DreamSource: Natural Language Processing

In natural language processing (NLP), an n-gram is a contiguous sequence of

$n$ items (typically words or characters) from a given text or speech sample. The value of $n$ can vary, leading to different types of n-grams:

Unigram (1-gram): Single words (e.g., "cat")
Bigram (2-gram): Pairs of consecutive words (e.g., "the cat")
Trigram (3-gram): Triples of consecutive words (e.g., "the black cat")
Higher-order n-grams: Sequences of four or more words

Applications of N-grams in NLP:

Language Modeling: N-grams are used to predict the next word in a sequence, helping in applications like speech recognition and text generation.
Text Classification: N-grams can serve as features for classifying text into categories (e.g., spam detection, sentiment analysis).
Machine Translation: N-gram models help improve the accuracy of translating phrases from one language to another by capturing the context of words.
Information Retrieval: N-grams enhance search engines by improving the matching of user queries with relevant documents.
Spell Checking: N-grams can assist in identifying likely misspellings and suggesting corrections based on common word sequences.
Sentiment Analysis: Analyzing n-grams can help identify sentiment in a text by examining the frequency and combination of words.
Text Summarization: N-grams can be used to identify key phrases and concepts in a document, aiding in the generation of concise summaries.
Chatbots and Virtual Assistants: N-grams help in understanding and generating human-like responses in conversational agents.

Counting Words in Corpora

Counting words in corpora is a fundamental task in natural language processing (NLP) and text analysis. It provides essential insights into language usage, helps in building language models, and supports various linguistic studies. A corpus (plural: corpora) is essentially a large and structured set of texts, used extensively in NLP for statistical analysis of language patterns. These texts can be written documents, transcripts of spoken language, or a combination of both.

Counting words in a corpus serves several purposes. One of the primary goals is frequency analysis, where the occurrence of words or phrases is measured. This is useful for identifying trends, common words, or rare terms. It also plays a critical role in determining the lexical diversity of a text, which refers to how varied the vocabulary is, often quantified by metrics such as the Type-Token Ratio (TTR). Beyond that, counting words helps in language modeling, where probabilities of word sequences are estimated to predict future text, and in information retrieval, where understanding the frequency and context of keywords can improve search engine algorithms.

The process of counting words typically begins with text preprocessing. This involves preparing the text so that the analysis can be conducted accurately. One common step is converting all text to lowercase, which prevents the algorithm from treating "The" and "the" as two different words. Another essential step is tokenization, where the text is split into individual words or tokens. Depending on the application, tokenization might involve splitting by spaces or using more complex rules like regular expressions to capture words accurately. Additionally, it is common to remove punctuation and sometimes perform stemming or lemmatization, which reduces words to their base forms, ensuring that different variants of a word (like "run," "running," and "ran") are treated as the same term.

Once preprocessing is complete, the actual word counting takes place. This can be implemented using various methods, such as dictionaries or hash maps, where each word is stored as a key, and its count as a value. As the program iterates through the text, it updates the count for each word. After this, the results are often presented as a frequency distribution, which shows the most common words and their frequencies. The results can be visualized through graphs, histograms, or word clouds, making it easier to identify patterns.

There are also several ambiguities in word counting that must be handled carefully. For instance, homographs—words that are spelled the same but have different meanings—can lead to confusion if the context is not taken into account. Similarly, handling hyphenated words (e.g., "well-being") can pose challenges; should they be treated as one word or two? Compound words (e.g., "toothbrush") and contractions (e.g., "don't" vs. "do not") present similar dilemmas. Properly addressing these ambiguities is essential for achieving accurate results in word counting.

The applications of word counting are wide-ranging. In content analysis, word frequencies can help identify the key themes or topics of a document. Lexical profiling involves measuring the complexity and richness of language in a text, which can be used to compare different authors or genres. In language learning, word counting can help instructors or students analyze vocabulary usage in educational materials to ensure that texts are at the appropriate difficulty level.

Simple (Unsmoothed) N-grams

In natural language processing (NLP), n-grams are contiguous sequences of words (or characters) extracted from a given text. The "n" in n-grams refers to the number of elements in each sequence. For instance, when $n = 1$ , the sequence is a unigram; for $n = 2$ , it is a bigram, and for $n = 3$ , it is a trigram. These simple n-gram models help capture the structure and dependencies of words in a language, making them useful in tasks like language modeling, speech recognition, and text generation.

1. Definition of Simple (Unsmoothed) N-grams

A simple n-gram model assumes that the probability of a word in a sequence depends only on the previous $n-1$ words. It calculates the probability of a word sequence by breaking it into smaller components (n-grams) and multiplying the probabilities of these n-grams. In the case of unsmoothed n-grams, the probabilities are calculated directly from the frequency of occurrences in the training data, without any adjustments for unseen data.

The basic formula for an n-gram probability is:

$P(w_1, w_2, \dots, w_n) \approx \prod_{i=1}^{n} P(w_i | w_{i-(n-1)}, \dots, w_{i-1})$

For example, in a bigram model, the probability of a word $w_{i}$ is conditional on the previous word $w_{i-1}$ :

$P(w_i | w_{i-1}) = \frac{\text{Count}(w_{i-1}, w_i)}{\text{Count}(w_{i-1})}$

Here, Count( $w_{i-1},$ ) represents how often the word pair $(w_{i-1}, w_i)$ appears in the corpus, and Count( $w_{i-1}$ ) is the frequency of the first word in the pair.

Example :

"The dog barks. The cat meows. The dog runs."

To build a bigram model, we first tokenize the corpus and calculate word pair frequencies:

("The", "dog"): 2 occurrences
("dog", "barks"): 1 occurrence
("The", "cat"): 1 occurrence
("cat", "meows"): 1 occurrence
("dog", "runs"): 1 occurrence

To calculate the probability of the sentence "The dog barks", we use the bigram probabilities:

$P(\text{"The dog barks"}) = P(\text{"dog"}|\text{"The"}) \times P(\text{"barks"}|\text{"dog"})$

Using the counts from the corpus:

$P(\text{"dog"}|\text{"The"}) = \frac{2}{3}, \quad P(\text{"barks"}|\text{"dog"}) = \frac{1}{2}$

Thus:

$P(\text{"The dog barks"}) = \frac{2}{3} \times \frac{1}{2} = \frac{1}{3}$

This means that, according to the bigram model, the probability of the sentence "The dog barks" is $\frac{1}{3}$ .

3. Strengths of Simple (Unsmoothed) N-grams

Simplicity: N-grams are easy to compute and understand. They require basic counting and division operations, making them a straightforward introduction to statistical language modeling.
Efficiency: For small corpora or simple tasks, n-gram models are computationally efficient. They don't require large amounts of data or complex algorithms.
Local Context: N-grams capture local dependencies between words, which can often be sufficient for short-range predictions or simple language tasks.

4. Limitations of Simple (Unsmoothed) N-grams

Data Sparsity: One of the most significant challenges with unsmoothed n-grams is the problem of data sparsity. In many cases, certain word combinations may not appear in the training data, leading to zero probabilities for valid sequences. For example, in the bigram model, if the word pair $(w_{i-1}, w_i)$ is not observed in the training data, $P(w_i | w_{i-1})$ will be zero, making the entire sentence probability zero. This severely limits the model's ability to generalize to new data.
Memory Requirements: As $n$ increases, the number of possible n-grams grows exponentially, leading to large storage requirements. For example, a trigram model would need to store all possible triples of words, which is impractical for larger vocabularies.
Lack of Long-range Dependencies: Simple n-gram models only capture relationships between adjacent words or a small context window (depending on $n$ ). This ignores long-range dependencies between words that might be crucial for understanding the meaning of a sentence.
Zero Probability for Unseen N-grams: Unsmoothed n-gram models assign zero probability to any word sequence that does not appear in the training data, making them unsuitable for handling real-world text, where new phrases or expressions are common.

Relative Frequency and Maximum Likelihood Estimation (MLE)

Estimating the probability of words and sequences of words (n-grams) is crucial for tasks like language modeling, speech recognition, machine translation, and text generation. Two foundational methods for estimating these probabilities are relative frequency and maximum likelihood estimation (MLE). These methods provide simple yet effective ways to model language based on observed data.

1. Relative Frequency

Relative frequency is a basic method for estimating the probability of an event by dividing the number of times that event occurs by the total number of all possible events. In the context of NLP, the event is typically a word or n-gram, and the total number of events is the total number of words or word sequences in a given corpus.

The formula for relative frequency is:

P(w) = \frac{\text{Count}(w)}{\text{Total number of words in the corpus}}

Where:

$P(w)$ is the probability of a word $w$ ,
$\text{Count}(w)$ is the number of times $w$ occurs in the corpus,
$\text{Total number of words}$ is the sum of all word occurrences in the corpus.

Example :

"The dog barks. The cat meows."

The total number of words is 6 ("The," "dog," "barks," "The," "cat," "meows").
The word "The" occurs 2 times, so its relative frequency is: $P ("The") = \frac{2}{6} = 0.33$
The word "dog" occurs 1 time, so its relative frequency is: $P ("dog") = \frac{1}{6} = 0.17$

Relative frequency is a straightforward and interpretable way to calculate probabilities in NLP, especially when dealing with simple word-level statistics.

2. Maximum Likelihood Estimation (MLE)

Maximum likelihood estimation (MLE) is a statistical method used to estimate the parameters of a probability distribution by maximizing the likelihood that the observed data came from that distribution. In NLP, MLE is used to estimate the probability of words or word sequences by assuming that the most likely model of the language is the one that maximizes the probability of the observed corpus.

For n-gram language models, MLE assumes that the best estimate of the probability of an n-gram is the relative frequency of that n-gram in the corpus. This is the same approach as calculating relative frequency, making MLE a specific case of relative frequency estimation in the context of language modeling.

The formula for MLE of an n-gram is:

P (w n ∣ w n - 1, \dots, w 1) = Count (w 1, \dots, w n) Count (w 1, \dots, w n - 1)​

Where:

$P(w_n | w_{n-1}, \dots, w_1)$ is the probability of word $w_n$ given the previous words $w_1, \dots, w_{n-1}$ ,
$\text{Count}(w_1, \dots, w_n)$ is the number of times the n-gram $(w_1, \dots, w_n)$ occurs in the corpus,
$\text{Count}(w_1, \dots, w_{n-1})$ is the number of times the previous $n-1$ words appear.

To calculate the MLE for the bigram $P(\text{"dog"}|\text{"The"})$ :

The bigram ("The," "dog") occurs 1 time,
The word "The" occurs 2 times.

Thus, the MLE estimate for $P(\text{"dog"}|\text{"The"})$ is:

P ("dog" ∣ "The") = Count ("The dog") Count ("The") = 12 = 0.5

In this case, MLE gives us the best estimate for the probability of "dog" following "The" based on the data we have observed.

Smoothing

N-gram smoothing is a technique used in natural language processing and statistical language modeling to handle the problem of zero probabilities in n-gram models. In an n-gram model, the probability of a word depends on the preceding $n-1$ words. However, in practice, many word sequences might not appear in the training data, leading to zero probabilities. Smoothing helps adjust these probabilities so that unseen sequences can still have a non-zero probability.

Laplace Smoothing

Laplace Smoothing provides a solution by adding a small, constant value (usually 1) to all n-gram counts. This ensures that every possible word sequence, even those not observed in the training data, has a non-zero probability.

Formula for Laplace Smoothing

For an n-gram model, the probability of a word $w_n$ given its preceding $n - 1 words$ $w_{n-1}, w_{n-2}, \ldots, w_{n-k}$ is calculated as:

$P(w_n | w_{n-1}, \ldots, w_{n-k}) = \frac{C(w_{n-k}, \ldots, w_{n-1}, w_n) + 1}{C(w_{n-k}, \ldots, w_{n-1}) + V}$

$C(w_{n-k}, \ldots, w_{n-1}, w_n)$ is the count of the n-gram (the sequence of $n$ words) in the training data.
$C(w_{n-k}, \ldots, w_{n-1})$ is the count of the preceding $n-1$ words.
$V$ is the total number of unique words in the vocabulary.
The "+1" in the numerator adds one count to every n-gram.
The "+V" in the denominator accounts for the additional count added to each of the $V$ possible words.

Why Add-One?

By adding one to each count, we ensure that no probability is ever zero. The technique spreads the probability mass slightly across all possible n-grams, including those that were not observed. This results in a smoother, more generalizable model, especially in cases with limited training data.

Example

Suppose you are using a bigram model (where the probability of a word depends on the previous word) and you want to calculate the probability of "cat" following "the":

Without Smoothing:
$P (cat ∣ the) = C (the cat) C (the)$
If "the cat" never appeared in the training data, $C(\text{the cat}) = 0$ , leading to a zero probability.
With Laplace Smoothing:
$P (cat ∣ the) = C (the cat) + 1 C (the) + V$
Even if $C(\text{the cat}) = 0$ , the probability will now be non-zero.

Limitations

Overcorrection: By adding one to every count, this method might overestimate the probability of unseen n-grams, especially when the vocabulary size is large.
Better Alternatives: More sophisticated smoothing methods (like Good-Turing or Kneser-Ney) often provide more accurate probability estimates, especially for large and complex datasets.

Despite its simplicity, Laplace Smoothing is a useful technique when you need a quick, easy way to prevent zero probabilities in n-gram models.

Good-Turing Smoothing

The Good-Turing smoothing method is a statistical technique used to estimate the probability of unseen events in a dataset, particularly in language modeling and other natural language processing tasks. It was developed by Alan Turing and later refined by his colleague I. J. Good. The method is particularly useful for dealing with the problem of zero counts in n-gram models, where certain word sequences may not have appeared in the training data.

The central concept of the Good-Turing method is that the probability of unseen events (i.e., events that haven't been observed in the training data) can be estimated by redistributing the probability mass from low-frequency events (i.e., events that have been observed only a few times). The assumption is that if you've seen an event with frequency 1, there's a reasonable chance that other unseen events might occur with similar probabilities.

Key Terminology

$C(w)$ : The count (or frequency) of an n-gram or event $w$ in the training data.
$N_r$ : The number of n-grams that occur exactly $r$ times in the training data.
- For example, $N_1$ represents the number of n-grams that occur exactly once (singleton counts).
Adjusted Counts: Good-Turing calculates an adjusted count for an event that has been observed $r$ times. These adjusted counts are used to smooth probabilities.

Good-Turing Formula

Good-Turing replaces the raw counts $C(w)$ with adjusted counts $C^*(w)$ , which are calculated as follows:

$C^{*} (r) = \frac{(r + 1) \cdot N_{r + 1}}{N_{r}}$

Where:

$r$ is the observed count of an n-gram or event.
$N_r$ is the number of n-grams or events with count $r$ .
$N_{r+1}$ is the number of n-grams or events with count $r+1$ .

This formula essentially adjusts the count of events with frequency $r$ to account for events with frequency $r+1$ . The intuition is that if you observe some events $r$ times, it's likely there are other events you haven’t seen yet that would appear $r+1$ times if you had a larger dataset.

Estimating Probabilities

Once the adjusted counts are computed, the probabilities are calculated by normalizing them:

$P (w) = \frac{C^{*} (w)}{N}$

Where $N$ is the total number of observed events.

For unseen events (i.e., $r = 0$ ), the adjusted count is estimated using the following formula:

$P (unseen) = \frac{N_{1}}{N}$

This gives the probability of encountering an event that has not been seen in the training data, which is derived from the number of events that have been observed exactly once.

Example of Good-Turing Smoothing

Suppose you have the following bigram counts in your training data:

Bigram	Count $C$
the cat	3
a dog	1
a cat	1
the dog	0

Step 1: Count Frequencies

$N_1 = 2$ (two bigrams occur once: "a dog" and "a cat")
$N_3 = 1$ (one bigram occurs three times: "the cat")

Step 2: Compute Adjusted Counts

For the bigram "the cat" ( $r = 3$ ):

$C^{*} (3) = \frac{(3 + 1) \cdot N_{4}}{N_{3}} = \frac{4 \cdot 0}{1} = 0$

Since $N_4 = 0$ (no bigrams occur 4 times), we might need to revert to the raw count for $r = 3$ .

For the bigram "a dog" or "a cat" ( $r = 1$ ):

$C^{*} (1) = \frac{(1 + 1) \cdot N_{2}}{N_{1}} = \frac{2 \cdot 0}{2} = 0$

In this case, we see that $N_2 = 0$ , meaning no bigrams occur twice, so the adjusted count would also be zero.

However, in practice, the Good-Turing method involves smoothing over the adjustments using techniques like regression for low $r$ values to avoid such zeros.

Step 3: Calculate Probabilities

Once the adjusted counts are computed, we normalize them by dividing by the total number of observations to get probabilities.

Advantages of Good-Turing Smoothing

Effective for sparse data: It redistributes probability mass from high-frequency events to low-frequency and unseen events, making it particularly useful in cases where the training data is limited.
Handles unseen events: It explicitly gives a non-zero probability for events that have never been seen, preventing the model from assigning zero probability to unseen events.

Limitations of Good-Turing Smoothing

Low frequency instability: For events with very low counts (e.g., counts of 1), the method can be sensitive to fluctuations in data.
Complexity: While more accurate than simpler smoothing methods like Laplace smoothing, Good-Turing requires calculating and smoothing the count of counts, which can make it computationally more involved.

Interpolation and Backoff

Interpolation and backoff are two techniques used in n-gram language models to handle the estimation of probabilities for word sequences, especially when data is sparse or when dealing with unseen n-grams. Both methods aim to improve the model's ability to generalize from limited data by combining information from different n-gram levels (e.g., unigrams, bigrams, trigrams).

Interpolation is a technique where the probability of an n-gram is estimated as a weighted combination of the probabilities from different n-gram models (e.g., unigram, bigram, trigram models). This approach ensures that the model takes into account not just the specific n-gram but also more general lower-order n-grams, balancing between specificity and generalization.

Formula for Interpolation

The interpolated probability $P_{i n t e r p} (w_{n} ∣ w_{n - 1}, \dots, w_{n - k}) is given by:$

P i n t e r p (w n ∣ w n - 1, \dots, w n - k) = λ k \cdot P (w n ∣ w n - 1, \dots, w n - k) + λ k - 1 \cdot P (w n ∣ w n - 1, \dots, w n - k + 1) + \dots + λ 1 \cdot P (w n)

Where:

$P (w_{n} ∣ w_{n - 1}, \dots, w_{n - k}) is the probability of$ $w_n$ given the preceding $k$ words (the higher-order n-gram).
$\lambda_k, \lambda_{k-1}, \ldots, \lambda_1$ are the interpolation weights, which sum to 1 ( $\sum λ_{i} = 1).$

Example

Suppose we want to estimate the probability of "cat" given "the":

P_{interp}(\text{"cat"} | \text{"the"}) = \lambda_2 \cdot P(\text{"cat"} | \text{"the"}) + \lambda_1 \cdot P(\text{"cat"})

$P ("cat" ∣ "the")is the bigram probability.$
$P ("cat") is the unigram probability.$

If the bigram "the cat" is not common, $\lambda_2$ might be smaller, and more weight would be given to the unigram probability.

2. Backoff

Backoff is another technique used when the higher-order n-gram has zero or very low counts, indicating that the model might not have enough data to reliably estimate the probability. In such cases, the model "backs off" to a lower-order n-gram, which is more general and likely to have better estimates.

Formula for Backoff

The backoff probability $P_{backoff}(w_n | w_{n-1}, \ldots, w_{n-k})$ is calculated as follows:

P_{backoff}(w_n | w_{n-1}, \ldots, w_{n-k}) = \begin{cases} P(w_n | w_{n-1}, \ldots, w_{n-k}), & \text{if } C(w_{n-k}, \ldots, w_{n-1}, w_n) > 0 \\ \alpha(w_{n-1}, \ldots, w_{n-k}) \cdot P_{backoff}(w_n | w_{n-1}, \ldots, w_{n-k+1}), & \text{otherwise} \end{cases}

Where:

$\alpha(w_{n-1}, \ldots, w_{n-k})$ is a normalization factor to ensure the probabilities sum to 1.

Example

Consider estimating the probability of "cat" given "the":

If "the cat" is observed in the training data, use the bigram probability $P(\text{"cat"} | \text{"the"})$
If "the cat" is not observed, back off to the unigram model and use $P(\text{"cat"})$ , scaled by a normalization factor $α ("the").$

Differences Between Interpolation and Backoff

Interpolation combines probabilities from different n-gram levels simultaneously, using weights to balance their contributions.
Backoff relies on higher-order n-grams when available and backs off to lower-order n-grams only when the higher-order n-grams have insufficient data.

Use Cases and Considerations

Interpolation is useful when you want to always include information from all n-gram levels, which helps in creating smoother models.
Backoff is more conservative, as it only resorts to lower-order n-grams when necessary. It's often used with methods like Kneser-Ney smoothing to ensure that probabilities are properly adjusted when backing off.

Example Corpus

"The cat sat on the mat. The dog sat on the mat. The dog barked."

From this, we can extract unigrams (single words), bigrams (two-word sequences), and trigrams (three-word sequences) with their frequencies:

Unigram Frequencies:

The: 4
cat: 1
sat: 2
on: 2
the: 2
mat: 2
dog: 2
barked: 1

Bigram Frequencies:

The cat: 1
cat sat: 1
sat on: 2
on the: 2
the mat: 2
The dog: 2
dog sat: 1
dog barked: 1

Trigram Frequencies:

The cat sat: 1
cat sat on: 1
sat on the: 2
on the mat: 2
The dog sat: 1
The dog barked: 1

Now, let's explain how interpolation and backoff use these different n-grams to smooth probability estimates.

1. Backoff

Backoff is a technique where the model first tries to estimate the probability of a word sequence using higher-order n-grams (e.g., trigrams). If the required n-gram doesn't exist in the training data, the model "backs off" to lower-order n-grams (e.g., bigrams or unigrams).

Example of Backoff

Let's compute the probability of the sentence "The cat sat on the mat" using a trigram backoff model.

Trigram Probability: We first check if the trigram "The cat sat" exists in the corpus. From the trigram frequencies, we see that it does, with a frequency of 1. We can compute the probability of the trigram as:
$P ("sat" ∣ "The cat") = Count ("The cat sat") Count ("The cat") = 11 = 1$
Bigram Probability (Backing Off): Next, we need to calculate $P(\text{"on"} | \text{"cat sat"})$ , but "cat sat on" doesn’t appear in the trigram list. Since the trigram is unseen, we "back off" to the bigram $P(\text{"on"}|\text{"sat"})$ , which has a frequency of 2. Thus:
$P ("on" ∣ "sat") = Count ("sat on") Count ("sat") = 22 = 1$
Unigram Probability (Further Backing Off): Now, consider $P(\text{"the"}|\text{"on the"})$ . We can estimate it from the bigram $P(\text{"the"} | \text{"on"})$ , but if "on the" is not available, we could back off further to the unigram:
$P ("the") = Count ("the") Total number of words = 414 \approx 0.29$

In backoff models, each time we encounter an unseen n-gram, we move to a lower-order model to find an available estimate.

Interpolation

Interpolation combines probabilities from higher-order and lower-order n-grams rather than fully backing off to lower-order ones. In interpolation, we weight the contributions of different n-gram levels (e.g., trigram, bigram, unigram) and combine them to get the final probability.

The interpolated probability formula is:

P (w n ∣ w n - 1, w n - 2) = λ 3 P (w n ∣ w n - 2, w n - 1) + λ 2 P (w n ∣ w n - 1) + λ 1 P (w n)

Where:

$\lambda_3, \lambda_2, \lambda_1$ are the weights for the trigram, bigram, and unigram probabilities, respectively, and
$\lambda_3 + \lambda_2 + \lambda_1 = 1$ .

Example of Interpolation

Let’s again calculate the probability of "The cat sat on the mat" using interpolation. Assume we set $λ_{3} = 0.5,$ $λ_{2} = 0.3, and$ $λ_{1} = 0.2.$

Trigram Probability: $P(\text{"sat"}|\text{"The cat"}) = 1$
Bigram Probability: $P(\text{"sat"}|\text{"cat"}) = \frac{1}{1} = 1$
Unigram Probability: $P(\text{"sat"}) = \frac{2}{14} \approx 0.14$

Thus, the interpolated probability for "The cat sat" is:

P ("sat" ∣ "The cat") = 0.5 \times 1 + 0.3 \times 1 + 0.2 \times 0.14 = 0.5 + 0.3 + 0.028 = 0.828

In interpolation, we do not discard lower-order n-grams, even if higher-order ones exist. Instead, we give each n-gram level a weight and combine them to estimate the probability.

Deleted Interpolation

Deleted interpolation is a smoothing technique used in natural language processing (NLP), particularly in language modeling. It is used to combine multiple models of varying orders (e.g., unigrams, bigrams, trigrams) to estimate the probability of a word given its history, while handling the problem of sparse data. It allows for the dynamic adjustment of interpolation weights based on the amount of data available for each context.

Language models assign probabilities to sequences of words. For example, in an

n

w_i

n-1

$P (w_{i} ∣ w_{i - (n - 1)}, . . ., w_{i - 1})$

However, higher-order

n

n=3

) tend to suffer from data sparsity because the more specific the context (e.g., the previous two words), the fewer occurrences there are in the training corpus. Lower-order models (e.g., bigrams or unigrams), while less sparse, provide less precise estimates.

What is Deleted Interpolation?

Deleted interpolation is a method of combining different

n

The Deleted Interpolation Formula

The probability of a word given its previous history can be interpolated from several

n

$P (w_{i} ∣ w_{i - 2}, w_{i - 1}) = λ_{3} P (w_{i} ∣ w_{i - 2}, w_{i - 1}) + λ_{2} P (w_{i} ∣ w_{i - 1}) + λ_{1} P (w_{i})$

Here:

$P (w_{i} ∣ w_{i - 2}, w_{i - 1}) is the trigram probability.$
$P(w_i | w_{i-1})$
is the bigram probability.
$P(w_i)$
is the unigram probability.
$\lambda_1, \lambda_2, \lambda_3$
are interpolation weights, which sum to 1 (
$λ_{1} + λ_{2} + λ_{3} = 1).$

Deleted interpolation adjusts these weights dynamically based on the data, so that more reliable models (based on more frequent contexts) receive higher weights.

How Deleted Interpolation Works

Estimate Model Probabilities:
First, you estimate the individual
$n$
Hold-out Data:
In deleted interpolation, the weights
$\lambda_1, \lambda_2, \lambda_3$
Calculate Weights:
The interpolation weights are dynamically adjusted based on the reliability of the different
$n$
Deleted Estimation:
Instead of using the same data to estimate both the model and the interpolation weights, the "deleted estimation" procedure involves holding out different parts of the data to estimate the interpolation weights. This process prevents overfitting and ensures the interpolation weights are chosen based on unseen data.
Example
Suppose you have the following counts for a language model:
- Trigram count:
  $P (cat ∣ the, big) = 0.02$
- Bigram count:
  $P (cat ∣ big) = 0.05$
- Unigram count:
  $P (cat) = 0.1$
Using deleted interpolation, you can estimate the probability of "cat" based on a trigram, bigram, and unigram model as follows:
$P (cat ∣ the, big) = λ_{3} P (cat ∣ the, big) + λ_{2} P (cat ∣ big) + λ_{1} P (cat)$
Assuming
$\lambda_3 = 0.5$
,
$λ_{2} = 0.3, and$ $λ_{1} = 0.2:$
$P (cat ∣ the, big) = 0.5 (0.02) + 0.3 (0.05) + 0.2 (0.1) = 0.01 + 0.015 + 0.02 = 0.045$
In this case, the probability of "cat" following "the big" is 0.045 , which incorporates information from the trigram, bigram, and unigram models.

N-Grams for Spelling and Pronunciation

In natural language processing (NLP), n-grams are used extensively to model language patterns and help improve various language tasks such as spelling error correction and pronunciation prediction. This section dives into how n-grams can enhance these tasks.

1. Context-Sensitive Spelling Error Correction

Spelling error correction can be improved using n-grams. This is especially useful in detecting real-word spelling errors, where a typographical mistake results in a valid but incorrect word. For instance, typing "there" instead of "three" or "dessert" instead of "desert." These errors can result from:

Typographical mistakes (insertion, deletion, substitution, or transposition of letters).
Homophone or near-homophone confusion, where words that sound similar are mixed up.

Traditional methods, such as checking words against a dictionary or using a finite-state model, fail to detect these errors because they do not account for context. However, n-grams can model the probability of word sequences, helping to identify unlikely word combinations.

Importance of Real-Word Spelling Errors

Studies like Peterson (1986) estimate that 15% of spelling errors lead to valid English words.
Empirical studies by Kukich (1992) suggest that 25-40% of spelling errors produce valid words, which makes them harder to detect without contextual information.

Local vs Global Errors

Local errors: These errors are detectable by analyzing the surrounding words. For example, "They are leaving in about fifteen minuets" (instead of minutes).
Global errors: These require examination of a larger context. For instance, "Won’t they heave if next Monday at that time?" (where "heave" is incorrect for "leave").

N-Gram Based Spelling Correction

The n-gram approach to spelling error detection and correction was proposed by Mays et al. (1991). It works by generating all possible misspellings of each word in a sentence and selecting the most likely sequence based on an n-gram language model.

Steps:

For a sentence
$W = \{w_1, w_2, \dots, w_k, \dots, w_n\}$
, where
$w_k$ $w'_k, w''_k$
It then computes the probability of each possible sentence using the n-gram model.
The spelling with the highest probability
$P(W)$
is chosen.

This method allows the model to choose the most contextually appropriate word based on the prior probability of word sequences.

3. Other Approaches to Context-Sensitive Spelling Correction

Besides n-grams, several other statistical methods can be used for context-sensitive spelling correction:

Bayesian classifiers: Can be combined with trigrams (e.g., Gale et al. 1993).
Decision lists: Used by Yarowsky (1994).
Transformation-based learning: Proposed by Mangu and Brill (1997).
Latent Semantic Analysis (LSA): Proposed by Jones and Martin (1997).
Winnow algorithm: Found to perform best in a study by Golding and Roth (1999).

All of these algorithms rely on features such as word and part-of-speech n-grams, and many are based on linear predictors known as Linear Statistical Queries (LSQ) hypotheses, as shown by Roth (1998, 1999).

4. N-Grams for Pronunciation Modeling

N-grams also play a crucial role in pronunciation modeling. Consider a task where the goal is to predict the correct word based on a surface pronunciation. For instance, given the input [n iy] after the word "I", possible words include "need," "new," "neat," "knee," and "the."

Using a unigram model, the word with the highest probability might be incorrectly chosen because it does not account for context. However, using a bigram model that incorporates the preceding word (e.g., "I need"), the correct prediction can be made.

Example: Pronunciation Prediction with N-Grams

Consider the following table showing bigram counts

C('I', w)

, the corresponding bigram probability

P(w|\text{I})

, and the likelihood of each candidate word:

Using the bigram probabilities, the model correctly predicts the word "need," whereas the unigram model might incorrectly choose "new."

Entropy

Entropy and perplexity are two key metrics used in NLP, especially in the evaluation of probabilistic models like N-gram language models. They help us understand the uncertainty or unpredictability of a model when predicting the next word in a sequence, which is crucial for tasks such as language modeling, machine translation, and speech recognition.

Entropy measures the uncertainty or information content in a random variable. In NLP, it represents the average amount of information needed to predict the next word in a sequence based on a probabilistic model.

Definition:

The entropy of a random variable

X

$H(X) = -\sum_{x \in \chi} p(x) \log_2 p(x)$

Where:

$X$
is the random variable (such as a word in a sentence).
$\chi$
is the set of all possible outcomes (e.g., the vocabulary).
$p(x)$
is the probability of a specific word
$x$
$\log_{2} p (x) is the log in base 2, measuring how surprising or unexpected the word is.$

Intuition:

High entropy means the model is uncertain about what the next word will be. This indicates there are many possible outcomes with similar probabilities.
Low entropy means the model is confident about the next word because it has fewer likely options.

Example:

Imagine you are trying to predict the next word in a sentence. If the sentence is "I would like a cup of...", the next word is likely to be "coffee" or "tea." Since there are only a few likely options, the entropy is relatively low. On the other hand, if the sentence is "I would like to...", the next word could be almost anything, leading to higher entropy.

Entropy in a Language Model:

In a probabilistic language model like an N-gram model, entropy measures how well the model predicts the next word based on previous words. A lower entropy indicates a better-performing model with more confident predictions.

Perplexity

Perplexity is derived from entropy and provides a more interpretable metric for evaluating language models. It tells us how well a model predicts a sequence of words and can be thought of as the weighted average number of choices the model has at each step.

Definition:

Perplexity is related to entropy by the formula:

$\text{Perplexity} = 2^{H(X)}$

Where

H (X) is the entropy of the model.

Intuition:

Perplexity can be understood as the number of equally likely choices the model is considering for the next word. For instance, if the perplexity is 10, the model is, on average, choosing between 10 different possible words for the next word in the sequence.
Lower perplexity indicates that the model is more confident in its predictions, meaning fewer choices are likely. A higher perplexity means more choices are likely, indicating the model is less certain.

Example:

Consider the same sentence: "I would like a cup of...". If the model predicts the next word with a low perplexity, it might only be considering a small number of options like "coffee" or "tea." But if the perplexity is high, the model might be considering many less likely options, like "water," "juice," "soup," etc.

Connection Between Entropy and Perplexity

Entropy measures the amount of uncertainty in a language model. It tells us how unpredictable the next word is based on the current model.
Perplexity translates this uncertainty into the average number of likely choices the model has for the next word.

Both metrics are crucial for evaluating language models:

A model with low entropy and low perplexity is performing well because it is making confident and accurate predictions.
A model with high entropy and high perplexity is performing poorly, as it is uncertain about the next word and considers many possible outcomes.

English Word Classes

1. Open vs. Closed Word Classes

Open Classes: Word types with a large and expanding membership (e.g., nouns, verbs, adjectives, adverbs). New words can be added to these classes.
Closed Classes: Word types with a fixed set of members, rarely adding new words (e.g., prepositions, pronouns, conjunctions, determiners, auxiliary verbs).

2. Open Classes

Nouns: Words representing people, places, things, or abstract concepts.
- Divided into proper nouns (names, e.g., Regina, IBM) and common nouns (e.g., goat, snow).
- Count nouns: Can be singular or plural (e.g., goat/goats).
- Mass nouns: Conceptualized as homogeneous (e.g., snow, salt).
Verbs: Words referring to actions or processes.
- Includes main verbs like eat, draw, and auxiliary verbs like can, may, should.
- Verbs change form depending on tense and subject (e.g., eat, eats, eating).
Adjectives: Describe properties or qualities (e.g., big, red).
- Some languages treat adjectives as a subclass of verbs (e.g., Korean).
Adverbs: Modify verbs, adjectives, or other adverbs, often describing how, when, where, or to what degree an action occurs.
- Categories: Locative (e.g., home), Manner (e.g., quickly), Degree (e.g., very), Temporal (e.g., yesterday).

3. Closed Classes

Prepositions: Relational words that connect noun phrases, often indicating space or time (e.g., on, in, under).
Determiners: Words like a, an, the that precede nouns, marking them as definite or indefinite.
Pronouns: Words that stand in for nouns (e.g., he, she, they).
- Subtypes: Personal (e.g., I, you), Possessive (e.g., my, your), Wh-pronouns (e.g., who, what).
Conjunctions: Words that link phrases or clauses.
- Coordinating conjunctions: (e.g., and, but, or).
- Subordinating conjunctions: (e.g., that, because).
Auxiliary Verbs: Verbs that assist the main verb in indicating tense, aspect, mood, etc. (e.g., is, have, will).
- Includes modal verbs (e.g., can, may, must).

4. Special Cases

Particles: Words that resemble prepositions or adverbs and combine with verbs to form phrasal verbs (e.g., turn down, find out).
Interjections: Express emotion or sudden reactions (e.g., oh, ah).
Politeness markers: Words like please or thank you.
Existential there: Used to express the existence of something (e.g., There are two apples).

Tagsets for English

Tagsets are standardized sets of part-of-speech (POS) labels used to tag words in a corpus, helping with POS tagging and other linguistic tasks. Several well-known tagsets are used in English, each with varying degrees of granularity.

Popular Tagsets for English

Penn Treebank Tagset (45 tags):
- One of the most widely used tagsets, it consists of 45 tags and is applied to many corpora, including the Brown corpus.
- The Penn Treebank tagset reduces some of the granularity found in larger tagsets, as it omits certain distinctions that can be inferred from the context or lexical item. For instance, forms of verbs like do, be, and have are not individually tagged.
- Example of a tagged sentence:
  The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./.
Brown Corpus Tagset (87 tags):
- This was the original tagset used for the Brown corpus and laid the foundation for future tagsets.
- Unlike Penn Treebank, it provides more detailed tagging for specific verb forms like did (VDD) or doing (VDG).
C5 Tagset (61 tags):
- Used by the Lancaster UCREL project’s CLAWS tagger to tag the British National Corpus (BNC).
- It is a medium-sized tagset, striking a balance between detailed and compact tagging.
C7 Tagset (146 tags):
- A more granular version of the C5 tagset, the C7 tagset distinguishes more syntactic nuances. For example, it differentiates prepositions from subordinating conjunctions and separates the preposition to (II) from the infinitive marker to (TO).

Features of the Penn Treebank Tagset

Simplicity: It is more compact than other tagsets, combining certain distinctions that might be unnecessary in parsed corpora. For instance, prepositions and subordinating conjunctions are both tagged as IN, and the structure of the sentence helps disambiguate their use.
Versatility: Despite its simplicity, the Penn Treebank tagset is suitable for many natural language processing (NLP) tasks, though it may not be detailed enough for projects requiring more specific syntactic information.

Granularity of Larger Tagsets

For projects that require more detailed information, such as distinguishing prepositions from subordinating conjunctions or the infinite marker to, larger tagsets like C7 are more appropriate. These tagsets allow for finer distinctions that may be crucial in syntactic parsing or machine learning applications.

Choosing a Tagset

The choice of a tagset depends on the needs of the application. For tasks where syntactic details are not critical, simpler tagsets like the Penn Treebank tagset may suffice. However, for applications requiring more in-depth syntactic information, a more detailed tagset such as C7 may be necessary.

Part-of-Speech Tagging

Part-of-speech (POS) tagging is the process of assigning a part-of-speech label (e.g., noun, verb, adjective) or another lexical class marker to each word in a text. This process is critical in various natural language processing (NLP) tasks, including speech recognition, parsing, and information retrieval.

Input and Output of Tagging

The input to a tagging algorithm consists of:

A string of words (a sentence or text).
A predefined tagset (such as the Penn Treebank tagset).

The output is a tagged version of the text where each word is associated with its most likely POS tag. For example, using the Penn Treebank tagset, the sentence "Book that flight." can be tagged as:

Book/VB that/DT flight/NN ./

In this example, "Book" is a verb (VB), "that" is a determiner (DT), and "flight" is a noun (NN). POS tagging involves disambiguating the lexical category of words based on context, a task that becomes more challenging with words that have multiple possible POS tags.

Ambiguity in POS Tagging

Many words in English are ambiguous, meaning they can belong to more than one part of speech. For example:

Book can be either a verb ("book that flight") or a noun ("hand me that book").
That can function as a determiner ("Does that flight serve dinner?") or a complementizer ("I thought that your flight was earlier").

DeRose (1988) found that while only 11.5% of English word types (distinct words) in the Brown Corpus are ambiguous, more than 40% of word tokens (instances of words) are ambiguous. Despite this, many ambiguous words are easy to disambiguate based on their context. For instance, the determiner sense of "a" (as in "a book") is much more common than its usage as a letter.

Tagging Algorithms

There are three main approaches to POS tagging:

Rule-based Taggers:
- These taggers rely on a set of manually written rules for disambiguation. For example, a rule might specify that an ambiguous word is a noun if it follows a determiner.
- ENGTWOL is a well-known rule-based tagger, based on the Constraint Grammar architecture of Karlsson et al. (1995).
Stochastic Taggers:
- Stochastic (or statistical) taggers use probabilities derived from a large corpus of tagged text. They assign tags based on the likelihood of a word having a specific tag in a given context.
- One common approach is the Hidden Markov Model (HMM) tagger, also known as the Maximum Likelihood Tagger.
Transformation-based Taggers (Brill Tagger):
- This method, introduced by Brill (1995), combines aspects of both rule-based and stochastic tagging. It uses rules for disambiguation, but the rules are automatically learned from a previously tagged corpus through a machine-learning process.
- The Brill tagger is particularly effective in environments where a balance between rule-based precision and statistical adaptability is needed.

Saturday, September 21, 2024

Natural Language Processing - N-Grams

Applications of N-grams in NLP:

Counting Words in Corpora

Simple (Unsmoothed) N-grams

1. Definition of Simple (Unsmoothed) N-grams

3. Strengths of Simple (Unsmoothed) N-grams

4. Limitations of Simple (Unsmoothed) N-grams

Relative Frequency and Maximum Likelihood Estimation (MLE)

1. Relative Frequency

2. Maximum Likelihood Estimation (MLE)

Smoothing

Laplace Smoothing

Formula for Laplace Smoothing

Why Add-One?

Example

Limitations

Good-Turing Smoothing

Key Terminology

Good-Turing Formula

Estimating Probabilities

Example of Good-Turing Smoothing

Step 1: Count Frequencies

Step 2: Compute Adjusted Counts

Step 3: Calculate Probabilities

Advantages of Good-Turing Smoothing

Limitations of Good-Turing Smoothing

Interpolation and Backoff

Formula for Interpolation

Example

2. Backoff

Formula for Backoff

Example

Differences Between Interpolation and Backoff

Use Cases and Considerations

Unigram Frequencies:

Bigram Frequencies:

Trigram Frequencies:

1. Backoff

Example of Backoff

Interpolation

Example of Interpolation

Deleted Interpolation

What is Deleted Interpolation?

The Deleted Interpolation Formula

How Deleted Interpolation Works

Example

N-Grams for Spelling and Pronunciation

1. Context-Sensitive Spelling Error Correction

Importance of Real-Word Spelling Errors

Local vs Global Errors

N-Gram Based Spelling Correction

Steps:

3. Other Approaches to Context-Sensitive Spelling Correction

4. N-Grams for Pronunciation Modeling

Example: Pronunciation Prediction with N-Grams

Entropy

Definition:

Intuition:

Example:

Entropy in a Language Model:

Perplexity

Definition:

Intuition:

Example:

Connection Between Entropy and Perplexity

English Word Classes

1. Open vs. Closed Word Classes

2. Open Classes

3. Closed Classes

4. Special Cases

Tagsets for English

Popular Tagsets for English

Features of the Penn Treebank Tagset

Granularity of Larger Tagsets

Choosing a Tagset

Part-of-Speech Tagging

Input and Output of Tagging

Ambiguity in POS Tagging

Tagging Algorithms