AI Dictionary
Token
Definition
A basic unit of text (like a word or character) that an LLM processes.
Deep Dive
In the context of large language models (LLMs) and natural language processing (NLP), a token is the most fundamental unit of text that the model processes. This unit can vary significantly depending on the tokenization strategy employed; it might represent an entire word, a subword (like "un-" in "unhappy"), a single character, or even a punctuation mark. The process of breaking down raw text into these discrete tokens, known as tokenization, is a critical preprocessing step before any text can be fed into an LLM for analysis or generation.
Examples & Use Cases
- 1The sentence "Hello, world!" might be tokenized into ["Hello", ",", "world", "!"].
- 2A model processing a long document will count the total number of tokens to estimate processing cost and time.
- 3Identifying common subword tokens like "un-" or "-ing" to represent variations of words more efficiently.
Related Terms
TokenizationEmbeddingVocabulary