What is Data Pipeline?

Data Pipeline — AI Glossary

Data Pipeline

Definition An automated system that collects, processes, transforms, and delivers data from various sources to storage or analytics systems, forming the foundation of AI data infrastructure.

In Depth

A data pipeline is a series of automated steps that move data from source systems through processing stages to its final destination. In AI contexts, data pipelines handle the massive data workflows required for model training: collecting raw data, cleaning and validating it, transforming it into training-ready formats, and loading it into training systems.

Modern data pipelines must handle diverse data types (text, images, audio, structured data), enormous volumes (petabytes for large model training), and complex transformations (tokenization, augmentation, deduplication). Tools like Apache Spark, Apache Beam, dbt, and Airflow orchestrate these workflows, while cloud services like AWS Glue and Google Dataflow provide managed solutions.

Data pipeline quality directly impacts model quality; the adage 'garbage in, garbage out' is especially true for AI. Issues like data drift (changing data distributions over time), incomplete data, duplicates, and biased sampling can all degrade model performance. Robust data pipelines include monitoring, validation checks, and lineage tracking to ensure data quality and enable debugging when model performance degrades.

Data Pipeline

In Depth

Browse more terms