What It Is

Data labeling (also called data annotation) is the process of assigning informative tags, categories, or markings to raw data — images, text, audio, or video — so that machine learning models can learn from it. Supervised learning, which powers the majority of production AI systems, requires labeled examples: images tagged with what they contain, text classified by sentiment, audio transcribed word-by-word.

The data labeling industry exceeded $3 billion in 2025. Scale AI, Labelbox, Appen, Toloka, and Sama are major platforms. The rise of large language models has changed but not eliminated labeling needs — foundation model training requires massive labeled datasets for supervised fine-tuning and reinforcement learning from human feedback (RLHF).

As the AI saying goes: "garbage in, garbage out." Model architecture and training algorithms matter, but data quality is often the decisive factor in real-world performance.

Types of Annotation

Image annotation:

  • Bounding boxes — rectangles drawn around objects of interest. Used for object detection in autonomous vehicles, security cameras, and retail analytics.
  • Semantic segmentation — labeling every pixel in an image with a class (road, sidewalk, vehicle, pedestrian). Critical for autonomous driving and medical imaging.
  • Instance segmentation — distinguishing individual instances of the same class (separating each person in a crowd).
  • Keypoint annotation — marking specific points on objects (facial landmarks, body joints). Used for pose estimation and facial recognition.
  • Polygon annotation — drawing precise boundaries around irregularly shaped objects. Common in satellite imagery and geospatial analysis.

Text annotation:

  • Named entity recognition — labeling spans of text with entity types (person, organization, date, product). See natural language understanding.
  • Sentiment labeling — classifying text as positive, negative, or neutral.
  • Intent classification — labeling user messages with the intended action.
  • Relationship annotation — marking relationships between entities in text.
  • Text classification — categorizing documents by topic, type, or other taxonomies.

Audio annotation:

  • Transcription — converting speech to text with timestamps. See speech recognition.
  • Speaker diarization — labeling which speaker is speaking when.
  • Sound event detection — tagging audio with event labels (glass breaking, siren, speech).

Video annotation:

  • Object tracking — annotating objects across video frames with consistent IDs.
  • Action recognition — labeling video segments with the actions being performed.
  • Temporal annotation — marking the start and end times of events.

Labeling Workforce and Operations

Data labeling is performed by human annotators — either in-house teams, managed service providers, or distributed crowdsource workers.

Crowdsourcing — platforms like Amazon Mechanical Turk, Toloka, and Clickworker distribute tasks to large pools of workers. Crowdsourcing is fast and scalable but requires quality control mechanisms: consensus voting (multiple annotators per item), gold standard questions (items with known answers to check worker quality), and iterative review.

Managed teams — companies like Scale AI and Appen maintain trained annotator teams for specialized tasks. Medical image labeling requires trained technicians. Autonomous driving annotation requires understanding of traffic rules. RLHF annotation for LLMs requires skilled writers who can evaluate response quality.

In-house annotation — organizations with specialized requirements build internal teams. This provides the highest quality and domain expertise but is expensive to scale.

Annotation guidelines — clear, detailed labeling instructions are essential. Ambiguous guidelines produce inconsistent labels. Best practices include visual examples, edge case documentation, and regular calibration sessions where annotators discuss disagreements.

Quality Metrics

Inter-annotator agreement — the degree to which multiple annotators produce the same label for the same item. Measured by Cohen's Kappa, Krippendorff's Alpha, or simple agreement percentage. Low agreement indicates ambiguous guidelines or a poorly defined task.

Accuracy against gold standard — comparing annotator labels against expert-produced reference labels. This measures individual annotator quality and identifies workers who need retraining or removal.

Consistency — tracking annotator performance over time to detect fatigue, drift, or changing interpretation of guidelines.

AI-Assisted Labeling

AI increasingly accelerates the labeling process:

Pre-labeling — a model generates initial labels that human annotators verify and correct. This is 3-5x faster than labeling from scratch for tasks where models are already partially accurate.

Active learning — the model identifies the most informative unlabeled examples — typically those near decision boundaries — and requests labels for only those items. This maximizes model improvement per labeled example, reducing total labeling cost.

Self-supervised and semi-supervised learning — these techniques reduce labeling needs by learning from unlabeled data. Self-supervised pre-training (as in BERT and GPT) learns representations from raw text, requiring labels only for fine-tuning. Semi-supervised methods propagate labels from a small labeled set to a large unlabeled set.

LLM-based labelinglarge language models can perform many annotation tasks (sentiment, classification, entity extraction) at near-human quality. Using LLMs as annotators dramatically reduces cost and turnaround time, though human review remains important for quality assurance.

Challenges

  • Scale vs. quality tradeoff — fast, cheap labeling often produces lower quality. High-quality expert annotation is expensive and slow. Finding the right balance depends on the application's accuracy requirements and budget.
  • Subjectivity — many labeling tasks involve subjective judgments. Is a product review "positive" or "neutral"? Is an image "offensive"? Annotator demographics, cultural backgrounds, and personal experiences influence subjective labels.
  • Annotator well-being — content moderation labeling exposes workers to disturbing material (violence, abuse, hate speech). Companies face ethical obligations to provide psychological support and fair compensation. Investigations have revealed exploitative working conditions at some labeling operations.
  • Label noise — even with quality control, labeled datasets contain errors. Models trained on noisy labels learn incorrect patterns. Noise-robust training techniques and data cleaning pipelines help mitigate this but don't eliminate it.
  • Cost at scale — training frontier AI models requires millions to billions of labeled examples. At $0.01-$10 per label depending on complexity, data labeling represents a significant fraction of AI development costs.