Member-only story
ETL Pipelines Vs. ML Pipelines – Similarities and Differences

follow for more: https://medium.com/@fahadthedatascientist
Introduction
The Data Pipeline, used for Reporting and Analytics, and the ML Pipeline, used to learn and make predictions have many similarities. Data Engineers build Data Pipelines for Business Users, whereas Data Scientists construct and operate the ML Pipeline. Both pipelines access data from corporate systems and intelligent devices and store the collected data in data stores. They both go through data transformation to scrub the raw data and prepare it for analysis or learning. Both keep historical data. They both need to be scalable, secure and hosted on the cloud. Both need to be monitored and maintained regularly.
Definition of a data pipeline
The Data Pipeline comprises several specific modules and processes designed to enable reporting, analysis, and forecasting capabilities. The Data Pipeline moves data from an enterprise’s operational systems to a central data store on-premise or in the cloud. Data from various connected devices and IoT systems can also be added to the pipeline for specific business cases.
Continuous maintenance and monitoring are essential to make the Data Pipeline modules and process run smoothly and correctly. Problems that arise must be quickly resolved, and the systems (software, hardware, and networking components) they use to be updated. Data Pipelines are also often adjusted to reflect business changes.
What is data extraction in the data pipeline
Retrieving data from the enterprise operational systems and connected devices is the first module of the Data Pipeline. At this point of the process, the pipeline collects raw data from numerous separate enterprise data systems (ERP, CRM), production systems, and application logs. Extraction processes are set up to extra the data from each data source.
Two types of extraction mechanisms are possible.
- According to the Data Engineers‘ criteria, batch processes can be used to ingest the data assets of records. These can be run on a set schedule or be triggered by external factors.
- Streaming is an…