In natural language processing (NLP), an n-gram is a contiguous sequence of
items (typically words or characters) from a given text or speech sample. The value of can vary, leading to different types of n-grams:
- Unigram (1-gram): Single words (e.g., "cat")
- Bigram (2-gram): Pairs of consecutive words (e.g., "the cat")
- Trigram (3-gram): Triples of consecutive words (e.g., "the black cat")
- Higher-order n-grams: Sequences of four or more words
Applications of N-grams in NLP:
Language Modeling: N-grams are used to predict the next word in a sequence, helping in applications like speech recognition and text generation.
Text Classification: N-grams can serve as features for classifying text into categories (e.g., spam detection, sentiment analysis).
Machine Translation: N-gram models help improve the accuracy of translating phrases from one language to another by capturing the context of words.
Information Retrieval: N-grams enhance search engines by improving the matching of user queries with relevant documents.
Spell Checking: N-grams can assist in identifying likely misspellings and suggesting corrections based on common word sequences.
Sentiment Analysis: Analyzing n-grams can help identify sentiment in a text by examining the frequency and combination of words.
Text Summarization: N-grams can be used to identify key phrases and concepts in a document, aiding in the generation of concise summaries.
Chatbots and Virtual Assistants: N-grams help in understanding and generating human-like responses in conversational agents.
Counting Words in Corpora
Counting words in corpora is a fundamental task in natural language processing (NLP) and text analysis. It provides essential insights into language usage, helps in building language models, and supports various linguistic studies. A corpus (plural: corpora) is essentially a large and structured set of texts, used extensively in NLP for statistical analysis of language patterns. These texts can be written documents, transcripts of spoken language, or a combination of both.
Counting words in a corpus serves several purposes. One of the primary goals is frequency analysis, where the occurrence of words or phrases is measured. This is useful for identifying trends, common words, or rare terms. It also plays a critical role in determining the lexical diversity of a text, which refers to how varied the vocabulary is, often quantified by metrics such as the Type-Token Ratio (TTR). Beyond that, counting words helps in language modeling, where probabilities of word sequences are estimated to predict future text, and in information retrieval, where understanding the frequency and context of keywords can improve search engine algorithms.
The process of counting words typically begins with text preprocessing. This involves preparing the text so that the analysis can be conducted accurately. One common step is converting all text to lowercase, which prevents the algorithm from treating "The" and "the" as two different words. Another essential step is tokenization, where the text is split into individual words or tokens. Depending on the application, tokenization might involve splitting by spaces or using more complex rules like regular expressions to capture words accurately. Additionally, it is common to remove punctuation and sometimes perform stemming or lemmatization, which reduces words to their base forms, ensuring that different variants of a word (like "run," "running," and "ran") are treated as the same term.
Once preprocessing is complete, the actual word counting takes place. This can be implemented using various methods, such as dictionaries or hash maps, where each word is stored as a key, and its count as a value. As the program iterates through the text, it updates the count for each word. After this, the results are often presented as a frequency distribution, which shows the most common words and their frequencies. The results can be visualized through graphs, histograms, or word clouds, making it easier to identify patterns.
There are also several ambiguities in word counting that must be handled carefully. For instance, homographs—words that are spelled the same but have different meanings—can lead to confusion if the context is not taken into account. Similarly, handling hyphenated words (e.g., "well-being") can pose challenges; should they be treated as one word or two? Compound words (e.g., "toothbrush") and contractions (e.g., "don't" vs. "do not") present similar dilemmas. Properly addressing these ambiguities is essential for achieving accurate results in word counting.
The applications of word counting are wide-ranging. In content analysis, word frequencies can help identify the key themes or topics of a document. Lexical profiling involves measuring the complexity and richness of language in a text, which can be used to compare different authors or genres. In language learning, word counting can help instructors or students analyze vocabulary usage in educational materials to ensure that texts are at the appropriate difficulty level.
Simple (Unsmoothed) N-grams
In natural language processing (NLP), n-grams are contiguous sequences of words (or characters) extracted from a given text. The "n" in n-grams refers to the number of elements in each sequence. For instance, when n=1, the sequence is a unigram; for n=2, it is a bigram, and for n=3, it is a trigram. These simple n-gram models help capture the structure and dependencies of words in a language, making them useful in tasks like language modeling, speech recognition, and text generation.
1. Definition of Simple (Unsmoothed) N-grams
A simple n-gram model assumes that the probability of a word in a sequence depends only on the previous n−1 words. It calculates the probability of a word sequence by breaking it into smaller components (n-grams) and multiplying the probabilities of these n-grams. In the case of unsmoothed n-grams, the probabilities are calculated directly from the frequency of occurrences in the training data, without any adjustments for unseen data.
The basic formula for an n-gram probability is:
For example, in a bigram model, the probability of a word is conditional on the previous word :
Here, Count( ) represents how often the word pair appears in the corpus, and Count( ) is the frequency of the first word in the pair.
Example :
"The dog barks. The cat meows. The dog runs."
To build a bigram model, we first tokenize the corpus and calculate word pair frequencies:
- ("The", "dog"): 2 occurrences
- ("dog", "barks"): 1 occurrence
- ("The", "cat"): 1 occurrence
- ("cat", "meows"): 1 occurrence
- ("dog", "runs"): 1 occurrence
To calculate the probability of the sentence "The dog barks", we use the bigram probabilities:
Using the counts from the corpus:
Thus:
This means that, according to the bigram model, the probability of the sentence "The dog barks" is .
3. Strengths of Simple (Unsmoothed) N-grams
Simplicity: N-grams are easy to compute and understand. They require basic counting and division operations, making them a straightforward introduction to statistical language modeling.
Efficiency: For small corpora or simple tasks, n-gram models are computationally efficient. They don't require large amounts of data or complex algorithms.
Local Context: N-grams capture local dependencies between words, which can often be sufficient for short-range predictions or simple language tasks.
4. Limitations of Simple (Unsmoothed) N-grams
Data Sparsity: One of the most significant challenges with unsmoothed n-grams is the problem of data sparsity. In many cases, certain word combinations may not appear in the training data, leading to zero probabilities for valid sequences. For example, in the bigram model, if the word pair (wi−1,wi) is not observed in the training data, P(wi∣wi−1) will be zero, making the entire sentence probability zero. This severely limits the model's ability to generalize to new data.
Memory Requirements: As n increases, the number of possible n-grams grows exponentially, leading to large storage requirements. For example, a trigram model would need to store all possible triples of words, which is impractical for larger vocabularies.
Lack of Long-range Dependencies: Simple n-gram models only capture relationships between adjacent words or a small context window (depending on n). This ignores long-range dependencies between words that might be crucial for understanding the meaning of a sentence.
Zero Probability for Unseen N-grams: Unsmoothed n-gram models assign zero probability to any word sequence that does not appear in the training data, making them unsuitable for handling real-world text, where new phrases or expressions are common.
Relative Frequency and Maximum Likelihood Estimation (MLE)
Estimating the probability of words and sequences of words (n-grams) is crucial for tasks like language modeling, speech recognition, machine translation, and text generation. Two foundational methods for estimating these probabilities are relative frequency and maximum likelihood estimation (MLE). These methods provide simple yet effective ways to model language based on observed data.
1. Relative Frequency
Relative frequency is a basic method for estimating the probability of an event by dividing the number of times that event occurs by the total number of all possible events. In the context of NLP, the event is typically a word or n-gram, and the total number of events is the total number of words or word sequences in a given corpus.
The formula for relative frequency is:
P(w)=Count(w)Total number of words in the corpusP(w) = \frac{\text{Count}(w)}{\text{Total number of words in the corpus}} Where:
- is the probability of a word ,
- is the number of times w occurs in the corpus,
- Total number of words is the sum of all word occurrences in the corpus.
Example :
"The dog barks. The cat meows."
- The total number of words is 6 ("The," "dog," "barks," "The," "cat," "meows").
- The word "The" occurs 2 times, so its relative frequency is:
- The word "dog" occurs 1 time, so its relative frequency is:
Relative frequency is a straightforward and interpretable way to calculate probabilities in NLP, especially when dealing with simple word-level statistics.
2. Maximum Likelihood Estimation (MLE)
Maximum likelihood estimation (MLE) is a statistical method used to estimate the parameters of a probability distribution by maximizing the likelihood that the observed data came from that distribution. In NLP, MLE is used to estimate the probability of words or word sequences by assuming that the most likely model of the language is the one that maximizes the probability of the observed corpus.
For n-gram language models, MLE assumes that the best estimate of the probability of an n-gram is the relative frequency of that n-gram in the corpus. This is the same approach as calculating relative frequency, making MLE a specific case of relative frequency estimation in the context of language modeling.
The formula for MLE of an n-gram is:
P(wn∣wn−1,…,w1)=Count(w1,…,wn)Count(w1,…,wn−1)
Where:
- is the probability of word given the previous words ,
- is the number of times the n-gram occurs in the corpus,
- is the number of times the previous words appear.
To calculate the MLE for the bigram :
- The bigram ("The," "dog") occurs 1 time,
- The word "The" occurs 2 times.
Thus, the MLE estimate for is:
P("dog"∣"The")=Count("The dog")Count("The")=12=0.5
In this case, MLE gives us the best estimate for the probability of "dog" following "The" based on the data we have observed.
Smoothing
N-gram smoothing is a technique used in natural language processing and statistical language modeling to handle the problem of zero probabilities in n-gram models. In an n-gram model, the probability of a word depends on the preceding n−1 words. However, in practice, many word sequences might not appear in the training data, leading to zero probabilities. Smoothing helps adjust these probabilities so that unseen sequences can still have a non-zero probability.
Laplace Smoothing
Laplace Smoothing provides a solution by adding a small, constant value (usually 1) to all n-gram counts. This ensures that every possible word sequence, even those not observed in the training data, has a non-zero probability.
Formula for Laplace Smoothing
For an n-gram model, the probability of a word given its preceding is calculated as:
- is the count of the n-gram (the sequence of n words) in the training data.
- is the count of the preceding words.
- is the total number of unique words in the vocabulary.
- The "+1" in the numerator adds one count to every n-gram.
- The "+V" in the denominator accounts for the additional count added to each of the possible words.
Why Add-One?
By adding one to each count, we ensure that no probability is ever zero. The technique spreads the probability mass slightly across all possible n-grams, including those that were not observed. This results in a smoother, more generalizable model, especially in cases with limited training data.
Example
Suppose you are using a bigram model (where the probability of a word depends on the previous word) and you want to calculate the probability of "cat" following "the":
Without Smoothing:
P(cat∣the)=C(the cat)C(the)
If "the cat" never appeared in the training data, , leading to a zero probability.
With Laplace Smoothing:
P(cat∣the)=C(the cat)+1C(the)+V
Even if , the probability will now be non-zero.
Limitations
- Overcorrection: By adding one to every count, this method might overestimate the probability of unseen n-grams, especially when the vocabulary size is large.
- Better Alternatives: More sophisticated smoothing methods (like Good-Turing or Kneser-Ney) often provide more accurate probability estimates, especially for large and complex datasets.
Despite its simplicity, Laplace Smoothing is a useful technique when you need a quick, easy way to prevent zero probabilities in n-gram models.
Good-Turing Smoothing
The Good-Turing smoothing method is a statistical technique used to estimate the probability of unseen events in a dataset, particularly in language modeling and other natural language processing tasks. It was developed by Alan Turing and later refined by his colleague I. J. Good. The method is particularly useful for dealing with the problem of zero counts in n-gram models, where certain word sequences may not have appeared in the training data.
The central concept of the Good-Turing method is that the probability of unseen events (i.e., events that haven't been observed in the training data) can be estimated by redistributing the probability mass from low-frequency events (i.e., events that have been observed only a few times). The assumption is that if you've seen an event with frequency 1, there's a reasonable chance that other unseen events might occur with similar probabilities.
Key Terminology
- : The count (or frequency) of an n-gram or event in the training data.
: The number of n-grams that occur exactly
times in the training data.
- For example, represents the number of n-grams that occur exactly once (singleton counts).
- Adjusted Counts: Good-Turing calculates an adjusted count for an event that has been observed times. These adjusted counts are used to smooth probabilities.
Good-Turing Formula
Good-Turing replaces the raw counts C(w) with adjusted counts C∗(w), which are calculated as follows:
Where:
- is the observed count of an n-gram or event.
- is the number of n-grams or events with count .
- is the number of n-grams or events with count .
This formula essentially adjusts the count of events with frequency r to account for events with frequency . The intuition is that if you observe some events times, it's likely there are other events you haven’t seen yet that would appear times if you had a larger dataset.
Estimating Probabilities
Once the adjusted counts are computed, the probabilities are calculated by normalizing them:
Where is the total number of observed events.
For unseen events (i.e., ), the adjusted count is estimated using the following formula:
This gives the probability of encountering an event that has not been seen in the training data, which is derived from the number of events that have been observed exactly once.
Example of Good-Turing Smoothing
Suppose you have the following bigram counts in your training data:
Bigram | Count
|
---|
the cat | 3 |
a dog | 1 |
a cat | 1 |
the dog | 0 |
Step 1: Count Frequencies
- (two bigrams occur once: "a dog" and "a cat")
- (one bigram occurs three times: "the cat")
Step 2: Compute Adjusted Counts
For the bigram "the cat" ():
Since (no bigrams occur 4 times), we might need to revert to the raw count for .
For the bigram "a dog" or "a cat" (r=1):
In this case, we see that , meaning no bigrams occur twice, so the adjusted count would also be zero.
However, in practice, the Good-Turing method involves smoothing over the adjustments using techniques like regression for low r values to avoid such zeros.
Step 3: Calculate Probabilities
Once the adjusted counts are computed, we normalize them by dividing by the total number of observations to get probabilities.
Advantages of Good-Turing Smoothing
- Effective for sparse data: It redistributes probability mass from high-frequency events to low-frequency and unseen events, making it particularly useful in cases where the training data is limited.
- Handles unseen events: It explicitly gives a non-zero probability for events that have never been seen, preventing the model from assigning zero probability to unseen events.
Limitations of Good-Turing Smoothing
- Low frequency instability: For events with very low counts (e.g., counts of 1), the method can be sensitive to fluctuations in data.
- Complexity: While more accurate than simpler smoothing methods like Laplace smoothing, Good-Turing requires calculating and smoothing the count of counts, which can make it computationally more involved.
Interpolation and Backoff
Interpolation and backoff are two techniques used in n-gram language models to handle the estimation of probabilities for word sequences, especially when data is sparse or when dealing with unseen n-grams. Both methods aim to improve the model's ability to generalize from limited data by combining information from different n-gram levels (e.g., unigrams, bigrams, trigrams).
Interpolation is a technique where the probability of an n-gram is estimated as a weighted combination of the probabilities from different n-gram models (e.g., unigram, bigram, trigram models). This approach ensures that the model takes into account not just the specific n-gram but also more general lower-order n-grams, balancing between specificity and generalization.
Formula for Interpolation
The interpolated probability
Pinterp(wn∣wn−1,…,wn−k)=λk⋅P(wn∣wn−1,…,wn−k)+λk−1⋅P(wn∣wn−1,…,wn−k+1)+⋯+λ1⋅P(wn)
Where:
- wn given the preceding k words (the higher-order n-gram).
- are the interpolation weights, which sum to 1 (
Example
Suppose we want to estimate the probability of "cat" given "the":
Pinterp("cat"∣"the")=λ2⋅P("cat"∣"the")+λ1⋅P("cat")P_{interp}(\text{"cat"} | \text{"the"}) = \lambda_2 \cdot P(\text{"cat"} | \text{"the"}) + \lambda_1 \cdot P(\text{"cat"})
If the bigram "the cat" is not common, might be smaller, and more weight would be given to the unigram probability.
2. Backoff
Backoff is another technique used when the higher-order n-gram has zero or very low counts, indicating that the model might not have enough data to reliably estimate the probability. In such cases, the model "backs off" to a lower-order n-gram, which is more general and likely to have better estimates.
Formula for Backoff
The backoff probability is calculated as follows:
Where:
- is a normalization factor to ensure the probabilities sum to 1.
Example
Consider estimating the probability of "cat" given "the":
- If "the cat" is observed in the training data, use the bigram probability
- If "the cat" is not observed, back off to the unigram model and use , scaled by a normalization factor
Differences Between Interpolation and Backoff
- Interpolation combines probabilities from different n-gram levels simultaneously, using weights to balance their contributions.
- Backoff relies on higher-order n-grams when available and backs off to lower-order n-grams only when the higher-order n-grams have insufficient data.
Use Cases and Considerations
- Interpolation is useful when you want to always include information from all n-gram levels, which helps in creating smoother models.
- Backoff is more conservative, as it only resorts to lower-order n-grams when necessary. It's often used with methods like Kneser-Ney smoothing to ensure that probabilities are properly adjusted when backing off.
Example Corpus
"The cat sat on the mat. The dog sat on the mat. The dog barked."
From this, we can extract unigrams (single words), bigrams (two-word sequences), and trigrams (three-word sequences) with their frequencies:
Unigram Frequencies:
- The: 4
- cat: 1
- sat: 2
- on: 2
- the: 2
- mat: 2
- dog: 2
- barked: 1
Bigram Frequencies:
- The cat: 1
- cat sat: 1
- sat on: 2
- on the: 2
- the mat: 2
- The dog: 2
- dog sat: 1
- dog barked: 1
Trigram Frequencies:
- The cat sat: 1
- cat sat on: 1
- sat on the: 2
- on the mat: 2
- The dog sat: 1
- The dog barked: 1
Now, let's explain how interpolation and backoff use these different n-grams to smooth probability estimates.
1. Backoff
Backoff is a technique where the model first tries to estimate the probability of a word sequence using higher-order n-grams (e.g., trigrams). If the required n-gram doesn't exist in the training data, the model "backs off" to lower-order n-grams (e.g., bigrams or unigrams).
Example of Backoff
Let's compute the probability of the sentence "The cat sat on the mat" using a trigram backoff model.
Trigram Probability:
We first check if the trigram "The cat sat" exists in the corpus. From the trigram frequencies, we see that it does, with a frequency of 1. We can compute the probability of the trigram as:
P("sat"∣"The cat")=Count("The cat sat")Count("The cat")=11=1
Bigram Probability (Backing Off):
Next, we need to calculate , but "cat sat on" doesn’t appear in the trigram list. Since the trigram is unseen, we "back off" to the bigram , which has a frequency of 2. Thus:
P("on"∣"sat")=Count("sat on")Count("sat")=22=1
Unigram Probability (Further Backing Off):
Now, consider . We can estimate it from the bigram , but if "on the" is not available, we could back off further to the unigram:
P("the")=Count("the")Total number of words=414≈0.29
In backoff models, each time we encounter an unseen n-gram, we move to a lower-order model to find an available estimate.
Interpolation
Interpolation combines probabilities from higher-order and lower-order n-grams rather than fully backing off to lower-order ones. In interpolation, we weight the contributions of different n-gram levels (e.g., trigram, bigram, unigram) and combine them to get the final probability.
The interpolated probability formula is:
P(wn∣wn−1,wn−2)=λ3P(wn∣wn−2,wn−1)+λ2P(wn∣wn−1)+λ1P(wn)
Where:
- are the weights for the trigram, bigram, and unigram probabilities, respectively, and
- .
Example of Interpolation
Let’s again calculate the probability of "The cat sat on the mat" using interpolation. Assume we set
Trigram Probability:
Bigram Probability:
Unigram Probability:
Thus, the interpolated probability for "The cat sat" is:
P("sat"∣"The cat")=0.5×1+0.3×1+0.2×0.14=0.5+0.3+0.028=0.828
In interpolation, we do not discard lower-order n-grams, even if higher-order ones exist. Instead, we give each n-gram level a weight and combine them to estimate the probability.