hmu.ai
Back to AI Dictionary
AI Dictionary

Corpus

Definition

A large collection of text or audio data used to train natural language processing models.

Deep Dive

In the context of natural language processing (NLP) and machine learning, a corpus is a large, structured collection of text or audio data used as a fundamental resource for training and evaluating language models. This extensive dataset provides the linguistic evidence necessary for algorithms to learn patterns, grammatical rules, semantic relationships, and contextual meanings within human language. A corpus is typically curated for specific purposes, ranging from general language understanding to domain-specific applications, and often includes metadata or annotations to enhance its utility for research and development.

Examples & Use Cases

  • 1A collection of millions of news articles used to train a language model for text summarization
  • 2A dataset of transcribed phone calls for developing a speech recognition system
  • 3A compilation of digitized literary works used for linguistic analysis and stylometry

Related Terms

Natural Language Processing (NLP)Training DataDataset

Part of the hmu.ai extensive business and technology library.