The field of speech and language processing is experiencing significant growth, driven by advances in computing power, the proliferation of the Internet, and the increasing use of mobile technology. This surge has led to the development and deployment of various sophisticated systems that illustrate the power and potential of these technologies. Here are some prominent examples:
Four Paradigms: 1970–1983
Between 1970 and 1983, research in speech and language processing saw significant growth, leading to the establishment of several research paradigms that continue to shape the field. During this period, four primary paradigms emerged, each contributing to the advancement of natural language processing (NLP) in unique ways.
The stochastic paradigm played a pivotal role, particularly in the development of speech recognition algorithms. A major breakthrough in this paradigm was the application of the Hidden Markov Model (HMM) and the concepts of the noisy channel and decoding. These concepts were independently developed by researchers like Jelinek, Bahl, Mercer, and their colleagues at IBM’s Thomas J. Watson Research Center, as well as by Baker at Carnegie Mellon University. Their work was heavily influenced by Baum and his team at the Institute for Defense Analyses in Princeton. Concurrently, AT&T’s Bell Laboratories emerged as a hub for pioneering work in speech recognition and synthesis, laying the groundwork for the modern advancements in these areas.
The logic-based paradigm was another crucial development during this era, initiated by the work of Colmerauer and his colleagues on Q-systems and metamorphosis grammars, which eventually led to the creation of Prolog and Definite Clause Grammars. Kay’s work on functional grammar and later, Bresnan and Kaplan’s development of Lexical Functional Grammar (LFG), underscored the importance of feature structure unification, further advancing the field.
The field of natural language understanding (NLU) also made significant strides. Terry Winograd’s SHRDLU system was a groundbreaking development that simulated a robot operating in a world of toy blocks. This system could understand and respond to complex natural language commands, demonstrating a level of sophistication previously unseen. SHRDLU’s success highlighted the progress in parsing and shifted the focus toward semantics and discourse models. Around the same time, Roger Schank and his colleagues at Yale, often referred to as the Yale School, developed a series of language understanding programs that explored human conceptual knowledge, including scripts, plans, goals, and memory organization. This work integrated network-based semantics and began incorporating Fillmore’s notion of case roles into their representations.
The discourse modeling paradigm focused on understanding and representing discourse structure. Key contributions in this area included the study of substructure in discourse and discourse focus by Grosz and her colleagues, as well as the development of automatic reference resolution techniques by researchers like Hobbs. Additionally, the BDI (Belief-Desire-Intention) framework was introduced, providing a logic-based approach to analyzing speech acts.
These paradigms laid the foundation for many of the advancements in NLP and continue to influence ongoing research and development in the field.
Survey of English Morphology
Introduction to Morphology
Morphology is the study of the structure and formation of words from smaller units of meaning called morphemes. A morpheme is the smallest meaning-bearing unit in a language. For example:
- Fox consists of a single morpheme (fox).
- Cats consists of two morphemes: cat (the stem) and -s (a suffix indicating plural).
Classes of Morphemes
Morphemes are broadly categorized into:
- Stems: The main morpheme that provides the core meaning of the word.
- Affixes: Morphemes that attach to a stem to modify its meaning. Affixes include:
- Prefixes: Attached to the beginning of a stem (e.g., un-buckle).
- Suffixes: Attached to the end of a stem (e.g., eat-s).
- Infixes: Inserted into the middle of a stem (common in some languages like Tagalog).
- Circumfixes: Attached around a stem (e.g., ge-sag-t in German for the past participle of "sagen").
Word Formation Processes
Words can be formed in various ways, playing a crucial role in language processing. The four key processes include:
- Inflection: Combines a stem with grammatical morphemes, typically resulting in a word of the same class but performing a different syntactic function (e.g., plural -s or past tense -ed in English).
- Derivation: Combines a stem with a grammatical morpheme, often resulting in a word of a different class with a meaning that can be hard to predict (e.g., computerize → computerization).
- Compounding: Combines multiple stems into a single word (e.g., doghouse).
- Cliticization: Attaches clitics (reduced forms of words) to a stem (e.g., 've in I’ve).
Inflectional Morphology in English
English has a relatively simple inflectional system, with inflectional changes mostly occurring in nouns, verbs, and sometimes adjectives:
- Nouns: Inflection for plurality (e.g., cat/cats) and possession (e.g., llama’s).
- Verbs: Inflection based on tense and agreement (e.g., walk/walks/walking/walked).
Regular Verbs
Regular verbs follow predictable patterns in their morphological forms:
- Stem: walk
- -s Form: walks
- -ing Participle: walking
- Past Form: walked
Irregular Verbs
Irregular verbs have more idiosyncratic forms and do not always follow predictable patterns:
- Stem: eat
- -s Form: eats
- -ing Participle: eating
- Preterite: ate
- Past Participle: eaten
Derivational Morphology in English
Derivational morphology is more complex than inflection, often producing new words with different syntactic categories:
- Nominalization: Creating nouns from verbs or adjectives (e.g., computerize → computerization).
- Adjective Formation: Creating adjectives from nouns or verbs (e.g., computation → computational).
Derivational processes are less productive than inflectional ones, meaning not every stem can be combined with every derivational suffix.
Cliticization
Clitics are units that lie between a full word and an affix:
- Proclitics: Precede a word (e.g., 'm in I’m).
- Enclitics: Follow a word (e.g., ’ve in I’ve).
In English, clitics often appear as reduced forms attached to verbs, articles, or conjunctions.
Non-concatenative Morphology
Some languages use non-concatenative morphology, where morphemes combine in non-linear ways:
- Templatic Morphology: Common in Semitic languages like Arabic and Hebrew, where a root and a pattern combine to form different words (e.g., the Hebrew root lmd combines with various templates to form words meaning "studied," "taught," and "was taught").
Agreement
In English, subject nouns and verbs must agree in number, meaning they must both be singular or both plural. Other languages may also require gender agreement between nouns, adjectives, and sometimes verbs.
Finite-State Morphological Parsing
Morphological parsing involves analyzing the structure of words to determine their stems and the associated morphological features, such as tense, number, gender, etc. The goal of morphological parsing is to take an input word form (e.g., "cats," "cities," "bebo," "canto") and produce a morphological parse that identifies the stem of the word along with its morphological features.
Example: English and Spanish Morphological Parsing
Below are examples of morphological parsing for English and Spanish words:
English | Morphological Parse | Spanish | Morphological Parse | Gloss |
---|
cats | cat +N +Pl | pavos | pavo +N +Masc +Pl | ducks |
cat | cat +N +Sg | pavo | pavo +N +Masc +Sg | duck |
cities | city +N +Pl | bebo | beber +V +PInd +1P +Sg | I drink |
geese | goose +N +Pl | canto | cantar +V +PInd +1P +Sg | I sing |
goose | goose +N +Sg | canto | canto +N +Masc +Sg | song |
goose | goose +V | puse | poner +V +Perf +1P +Sg | I was able |
gooses | goose +V +1P +Sg | vino | venir +V +Perf +3P +Sg | he/she came |
merging | merge +V +PresPart | vino | vino +N +Masc +Sg | wine |
caught | catch +V +PastPart | lugar | lugar +N +Masc +Sg | place |
caught | catch +V +Past | | | |
Morphological Features
The second column in the table above shows the stem of each word and its associated morphological features. Some key features include:
- +N: Indicates the word is a noun.
- +Sg: Indicates singular form.
- +Pl: Indicates plural form.
- +V: Indicates the word is a verb.
- +Masc: Masculine gender (specific to Spanish nouns).
- +PInd: Present Indicative (specific to Spanish verbs).
- +Perf: Perfect tense (specific to Spanish verbs).
- +1P: First person.
- +3P: Third person.
Note: Some input forms, such as "caught," "goose," "canto," or "vino," are ambiguous and can have multiple morphological parses. At this stage, morphological parsing aims to list all possible parses, and the task of disambiguating between them will be discussed later.
Components of a Morphological Parser
To build a morphological parser, three essential components are needed:
Lexicon: A list of stems and affixes, along with basic information about them (e.g., whether a stem is a noun or verb).
Morphotactics: The rules governing the order of morphemes within a word. For example, in English, the plural morpheme ("-s") follows the noun rather than preceding it.
Orthographic Rules: Spelling rules that model changes occurring when two morphemes combine. An example is the rule that changes "city + -s" to "cities" instead of "citys."
Finite-State Morphological Parsing
To represent a simple lexicon for morphological recognition, we use finite-state automata (FSAs) to model morphotactic knowledge. This approach is extended by introducing finite-state transducers (FSTs) to model morphological features and handle the task of morphological parsing. FSTs are also used to model orthographic rules and address the complexities of spelling changes in morphological analysis.
Building a Finite-State Lexicon
What is a Lexicon?
A lexicon in computational linguistics is a repository of words or morphemes (the smallest units of meaning). The simplest form of a lexicon is a list of every word in a language, including proper names and abbreviations, such as:
a, AAA, AA, Aachen, aardvark, aardwolf, aba, abaca, aback, ...
However, due to the vast number of words in a language, it's often impractical to list every word explicitly. Instead, computational lexicons are typically structured with lists of stems and affixes, along with a model of the morphotactics that describes how these morphemes can combine to form valid words.
Modeling Morphotactics with Finite-State Automata (FSA)
One common way to model morphotactics is by using Finite-State Automata (FSA). Below are examples of how FSAs can be used to model English nominal and verbal inflections.
English Nominal Inflection
The FSA in Figure 3.3 models the nominal inflection of English nouns. It includes:
- Regular nouns (reg-noun): These take the regular "-s" plural (e.g., cat, dog, fox).
- Irregular singular nouns (irreg-sg-noun): These don't take "-s" in the plural form (e.g., goose, mouse).
- Irregular plural nouns (irreg-pl-noun): These have irregular plural forms (e.g., geese, mice).
reg-noun | irreg-sg-noun | irreg-pl-noun | plural |
---|
fox | goose | geese | -s |
cat | mouse | mice | -s |
aardvark | sheep | sheep | -s |
English Verbal Inflection
The FSA in Figure 3.4 models the verbal inflection of English verbs. It includes:
- Regular verb stems (reg-verb-stem): These take regular inflectional endings (e.g., walk, talk).
- Irregular verb stems (irreg-verb-stem): These have irregular forms in past tense or participles (e.g., cut, sing).
- Irregular past verb forms (irreg-past-stem): These do not follow the regular "-ed" past tense rule (e.g., caught, ate).
reg-verb-stem | irreg-verb-stem | irreg-past-stem | past | past-part | pres-part | 3sg |
---|
walk | cut | caught | -ed | -ed | -ing | -s |
fry | speak | ate | -ed | eaten | -ing | -s |
talk | sing | sang | -ed | eaten | -ing | -s |
English Derivational Morphology
Derivational morphology in English, which deals with how new words are formed by adding prefixes or suffixes, is more complex than inflectional morphology. The complexity of derivational morphology often requires more sophisticated models, such as context-free grammars.
Example: English Adjective Morphotactics
Consider the morphotactics of English adjectives. Here are some examples:
- Basic adjectives: big, cool, happy, red
- Comparative forms: bigger, cooler, happier, redder
- Superlative forms: biggest, coolest, happiest, reddest
- Adverbial forms: coolly, happily, really, clearly, unclearly
- Negated forms: unhappy, unclear, unreal
An initial FSA might suggest the following structure:
q0 --un--> q1 --adj-root--> q2 --(-er | -est | -ly)--> q3
This FSA recognizes all the adjectives listed above but also incorrectly recognizes ungrammatical forms like "unbig," "unfast," or "oranger." To correct this, we need to categorize adjectives into different classes and specify which suffixes they can take.
Modeling English Derivational Morphology
Figure shows an FSA that models fragments of English nominal and verbal derivational morphology. For instance, it models the general rule that any verb ending in "-ize" can take the nominalizing suffix "-ation" (e.g., fossilize → fossilization). Similarly, adjectives ending in "-al" or "-able" can take the suffix "-ity" or "-ness" (e.g., formal → formality, natural → naturalness).
Morphological Recognition
The FSAs developed for morphotactics can be expanded to solve the problem of morphological recognition—determining whether an input string is a legitimate word. This is done by expanding each arc in the FSA (e.g., the "reg-noun-stem" arc) with all the morphemes in that category. The expanded FSA operates at the level of individual letters.
For example, above figure shows an expanded FSA that recognizes English nouns like "aardvarks" by matching the input letter by letter against the FSA's transitions.
Finite-State Transducers
Finite-State Transducers (FSTs) are a generalization of Finite-State Automata (FSAs). While FSAs are used to model the morphotactic structure of a lexicon and for word recognition, FSTs map between two sets of symbols. Essentially, FSTs are like two-tape automata that recognize or generate pairs of strings. They can be visualized as having transitions labeled with pairs of symbols, one from each tape.
Example of an FST
Consider the FST below, where each transition is labeled with an input and output string separated by a colon:
q0 --aa:b--> q1
q1 --b:--> q2
q1 --b:a--> q3
q3 --a:ba--> q4
This FST maps the string "aa" to "b," "b" to either "a" or an empty string, and "a" to "ba." FSTs have a more general function than FSAs; while FSAs define a formal language as a set of strings, FSTs define a relation between sets of strings.
Four Ways to Think About FSTs
- FST as Recognizer: An FST can take a pair of strings as input and output "accept" if the pair is in the language, and "reject" if it is not.
- FST as Generator: An FST can generate pairs of strings from the language.
- FST as Translator: An FST can read one string and output another string, effectively translating between representations.
- FST as Set Relater: An FST computes relations between sets of strings.
These functions are useful in various applications in speech and language processing. For example, in morphological parsing, FSTs can translate an input string of letters into a corresponding string of morphemes.
Formal Definition of an FST
An FST can be formally defined with the following seven parameters:
- Q: A finite set of states {q0,q1,...,qN−1}.
- Σ: A finite set representing the input alphabet.
- Δ: A finite set representing the output alphabet.
- q0: The start state, where q0∈Q.
- F: The set of final states, F⊆Q.
- δ(q, w): The transition function, which returns a set of new states given a state q∈Q and an input string w∈Σ∗.
- σ(q, w): The output function, which returns a set of output strings given a state q∈Q and an input string w∈Σ∗.
Where FSAs are isomorphic to regular languages (sets of strings), FSTs are isomorphic to regular relations (sets of pairs of strings).
Closure Properties of FSTs
Like regular languages, FSTs and regular relations are closed under union. However, they are generally not closed under difference, complementation, or intersection, although some subclasses of FSTs have closure properties under these operations.
In addition to union, FSTs have two important closure properties:
Inversion: The inversion of a transducer T (denoted T−1) switches the input and output labels. If T maps from input alphabet I to output alphabet O, then T−1 maps from O to I. This property is useful for converting an FST from a parser to a generator.
Composition: If T1 is a transducer from I1 to O1 and T2 is a transducer from O1 to O2, then the composition T1∘T2 maps from I1 to O2. Composition allows two transducers running in series to be replaced with a single, more complex transducer. Applying T1∘T2 to an input sequence S is equivalent to applying T1 to S and then T2 to the result: T1∘T2(S)=T2(T1(S)).
Projection of an FST
The projection of an FST is the Finite-State Automaton (FSA) produced by extracting only one side of the relation. The projection to the left or upper side of the relation is called the first projection, and the projection to the lower or right side is called the second projection.
Sequential Transducers and Determinism
Transducers, as discussed, can be nondeterministic. This means that a single input may correspond to multiple possible outputs. Nondeterministic Finite-State Transducers (FSTs) can be slow because they require complex search algorithms, similar to those discussed earlier in NLP, to process. This raises the question of whether nondeterministic FSTs can be converted into deterministic ones. However, unlike Finite-State Automata (FSAs), not all FSTs can be determinized.
Sequential Transducers
Sequential transducers are a subtype of FSTs that are deterministic on their input. This means that for any state in the transducer, each symbol in the input alphabet Σ can label at most one transition out of that state. In a sequential transducer, the transitions out of each state are uniquely determined by the state and the input symbol. An example is shown in the following diagram:
q0 --a:ba--> q1
q0 --b:b--> q1
In the above diagram, each input symbol leads to a deterministic transition to another state. Sequential transducers can have epsilon symbols (which represent the empty string) in the output, but not in the input.
However, sequential transducers are not necessarily sequential on their output. For example, in some cases, two distinct transitions from the same state may produce the same output symbol. Therefore, when discussing sequentiality, it is important to specify the direction of the transduction (input to output or output to input).
Formal Definition:
- The transition function (δ) is modified to map from Q×Σ∗ to Q, rather than to a set of states 2Q.
- The output function (σ) maps from Q×Σ∗ to Δ∗, rather than to a set of output strings 2Δ∗.
Subsequential Transducers
A subsequential transducer is a generalization of sequential transducers. It allows for the generation of an additional output string at the final states, which is concatenated to the output produced so far. This concept was introduced by Schützenberger in 1977.
Example:
Consider a subsequential transducer that, upon reaching a final state, appends an additional string to the output generated by the transitions. This feature is particularly useful in applications where deterministic processing of input is required, but with some flexibility in the final output.
The key advantages of subsequential transducers include:
- Efficiency: They are deterministic on input, so they can process inputs in linear time relative to the length of the input string. This contrasts with nondeterministic transducers, where processing time can depend on the number of states.
- Determinization and Minimization: There are efficient algorithms for determinizing and minimizing subsequential transducers, extending similar algorithms for FSAs. These algorithms are crucial for optimizing the performance of transducers in NLP applications.
Handling Ambiguity with p-Subsequential Transducers
While sequential and subsequential transducers are deterministic and efficient, they transduce each input string to exactly one possible output string. However, natural language often involves ambiguity, where a single input can correspond to multiple valid outputs.
To address this, p-subsequential transducers have been introduced. A p-subsequential transducer allows for p (where p≥1) final output strings to be associated with each final state. This capability to handle a finite amount of ambiguity is highly useful in many NLP tasks, such as:
- Representation of Dictionaries
- Compilation of Morphological and Phonological Rules
- Local Syntactic Constraints
For many problems in these areas, p-subsequential transducers can be determinized and minimized, making them efficient and practical for real-world applications.
FSTs for Morphological Parsing
Morphological parsing involves analyzing a word to identify its constituent morphemes (the smallest meaningful units of language) and the grammatical information they convey. For example:
- Given the English word "cats," a morphological parser should output "cat +N +Pl," indicating that "cat" is a plural noun.
- Given the Spanish word "bebo" (meaning "I drink"), the parser should output "beber +V +PInd +1P +Sg," indicating that "bebo" is the first-person singular form of the verb "beber" in the present indicative tense.
Finite-State Morphology
In finite-state morphology, words are represented as a correspondence between:
- Lexical Level: Represents the concatenation of morphemes that make up a word.
- Surface Level: Represents the actual spelling of the word as it appears.


- Lexical Tape: Composed of characters from the lexical alphabet Σ.
- Surface Tape: Composed of characters from the surface alphabet Δ.
Each transition in the FST corresponds to a pair of symbols: one from the lexical tape and one from the surface tape. For example:
- a:a means that the symbol "a" on the lexical tape corresponds to "a" on the surface tape.
- a:ϵ means that the symbol "a" on the lexical tape corresponds to nothing on the surface tape (indicating a deletion).
These symbol pairs form a new alphabet Σ′, where Σ′⊆Σ×Δ. This new alphabet is used to define the FST.
Feasible Pairs and Default Pairs
- Feasible Pair: A pair a:b in Σ′ that indicates how a symbol on the lexical tape maps to a symbol on the surface tape.
- Default Pair: A pair where the symbol maps to itself (e.g., a:a). These are often simplified by just referring to them by the single letter (e.g., "a" instead of "a").
Building an FST Morphological Parser
To construct a morphological parser using FSTs, we need to expand earlier morphotactic Finite-State Automata (FSAs) by adding:
- An extra "lexical" tape.
- Morphological features (e.g., +Sg for singular, +Pl for plural).
For example, consider an FST that handles English nominal number inflection. The symbols above each arc represent elements of the morphological parse on the lexical tape, while the symbols below represent the surface tape or intermediate tape.
- Morpheme Boundary (): Indicates the boundary between morphemes.
- Word Boundary (#): Indicates the boundary between words.

The output includes morpheme boundary markers (^) and word boundary markers (#). These markers are present on intermediate tapes, and further processing is needed to remove them to achieve the final surface form.
Combining FST Lexicon and Rules
In natural language processing, finite-state transducers (FSTs) can be used for both parsing and generating words by combining a lexicon with spelling rules. The lexicon transducer maps between:
- Lexical Level: Consists of word stems and morphological features.
- Intermediate Level: Represents a simple concatenation of morphemes.
Architecture of Two-Level Morphology Systems
The architecture of a two-level morphology system typically consists of a cascade of transducers (see Fig. 3.19):
- Lexicon Transducer: Maps from the lexical level to the intermediate level.
- Spelling Rule Transducers: A set of parallel transducers that map from the intermediate level to the surface level.
Cascade of Transducers: A cascade involves running transducers in series, where the output of one transducer becomes the input for the next. This setup can be used both for generating words (top-down approach) and parsing them (bottom-up approach).
Example: Parsing and Generating "Foxes"
Consider the word "foxes":
- Lexicon Level:
fox +N +Pl
(Noun, Plural). - Intermediate Level:
foxˆs#
(Morpheme boundary ˆ
and word boundary #
). - Surface Level:
foxes
(Final output).
During parsing, the transducers process the input to map from the surface level back to the lexical level, handling ambiguities that may arise (e.g., "foxes" could be a verb in another context). The transducer identifies multiple possible parses, such as fox +N +Pl
(noun) and fox +V +3Sg
(verb, third person singular). Disambiguation requires external evidence, such as the surrounding context.
Handling Ambiguity in Parsing
Ambiguity: During parsing, ambiguity can arise due to multiple possible interpretations of a word. For example, "foxes" might be parsed as a plural noun or as a verb. In cases like these, the transducer enumerates all possible parses. Disambiguation relies on external context, such as sentence structure.
Local Ambiguity: This occurs during the parsing process itself. For instance, when parsing the word "assess," the E-insertion transducer might initially interpret "ass" as leading to "asses" (plural form), but later corrects itself upon encountering the next character. This requires the FST to incorporate a search algorithm to handle non-determinism in parsing.
Simplifying the Cascade: Composition and Intersection
Running multiple transducers in a cascade can be complex. Fortunately, transducers can be:
- Composed: Combine a series of transducers into a single, more complex transducer.
- Intersected: Parallel transducers can be combined using an intersection algorithm, which creates a new state for each combination of states from the original transducers.
Intersection Algorithm: The intersection algorithm takes the Cartesian product of the states in two machines. For any input symbol, if one machine transitions to a state and another to , the combined machine transitions to a state
Lexicon-Free FSTs: The Porter Stemmer
Finite-State Transducers (FSTs) can be designed without a lexicon. In such systems, instead of relying on a predefined list of word stems, the FST applies a set of rules to manipulate the surface form of words directly. This approach is useful for tasks like stemming, where the goal is to reduce words to their root forms.
The Porter Stemmer
One of the most widely used lexicon-free stemming algorithms is the Porter Stemmer, developed by Martin Porter in 1980. The Porter Stemmer is a rule-based system that systematically removes common morphological and inflectional suffixes from English words to reduce them to their base or root form.
For example:
- "running" → "run"
- "happiness" → "happi"
How the Porter Stemmer Works
The Porter Stemmer operates in several steps, each designed to handle different types of suffixes. The process involves:
Step 1a: Deals with plural suffixes like -s
or -es
.
- Example:
caresses
→ caress
- Example:
ponies
→ poni
Step 1b: Removes -ed
and -ing
suffixes.
- Example:
jumped
→ jump
- Example:
running
→ run
Step 1c: Converts terminal -y
to -i
if the word has another vowel before y
.
Step 2: Handles common suffixes like -ational
, -izer
, -iveness
, and replaces them with shorter forms.
- Example:
relational
→ relate
- Example:
conditioner
→ condition
Step 3: Deals with the removal of suffixes like -ness
, -ful
, and others that indicate the part of speech.
- Example:
happiness
→ happi
- Example:
usefulness
→ use
Step 4: Further reduces words by removing suffixes that indicate the noun or adjective form, such as -ment
, -ance
, or -ence
.
- Example:
adjustment
→ adjust
- Example:
dependency
→ depend
Step 5: Finally, step 5 removes any remaining -e
at the end of words and handles the doubling of consonants.
- Example:
probate
→ probat
- Example:
controll
→ control
Characteristics of the Porter Stemmer
Rule-Based: The Porter Stemmer applies a series of transformation rules to the word. These rules are designed to be broadly applicable, meaning that the stemmer can be applied to any word without the need for a lexicon.
No Lexicon Needed: Unlike FSTs that rely on a predefined lexicon of word stems, the Porter Stemmer works entirely by applying these rules. This makes it highly efficient and easy to implement.
Aggressiveness: The Porter Stemmer is known for being quite aggressive in reducing words to their stems, sometimes resulting in non-dictionary forms (e.g., happiness
→ happi
).
Advantages and Limitations
Advantages:
- Efficiency: Because it does not require a lexicon, the Porter Stemmer is fast and can be applied to large corpora.
- Simplicity: The algorithm is straightforward to implement and understand, making it popular in many natural language processing applications.
Limitations:
- Over-stemming: The stemmer may sometimes be too aggressive, producing stems that are not real words (e.g.,
happiness
→ happi
). - Under-stemming: Conversely, some words may not be reduced enough (e.g.,
agreement
→ agree
, but disagreement
may be left unchanged). - Language-Specific: The Porter Stemmer is designed specifically for English and may not be suitable for other languages without modification.
Applications
The Porter Stemmer is commonly used in various NLP tasks, such as:
- Information Retrieval: Reducing words to their stems can improve the accuracy of search engines by matching documents that contain different forms of a word.
- Text Mining: Helps in clustering and categorizing text data by reducing word variants to a common form.
- Sentiment Analysis: Assists in analyzing sentiments by focusing on the root forms of words, which can reveal patterns in language use.
Human Morphological Processing
In psycholinguistics, understanding how multi-morphemic words (e.g., "walk", "walks", "walked") are represented in the human mind is crucial. We explore whether these words are stored individually in the mental lexicon or whether their components (e.g., "walk" and "-ed") are separately represented and combined during processing.
Theories of Lexical Representation
Full Listing Hypothesis
- Description: All forms of a word (e.g., "walk", "walks", "walked") are individually listed in the mental lexicon.
- Implication: Morphological structure is considered an epiphenomenon.
- Criticism: This view is inadequate for morphologically rich languages (e.g., Turkish) where such an exhaustive listing would be impractical.
Minimum Redundancy Hypothesis
- Description: Only the basic morphemes (e.g., "walk" and "-s") are stored in the lexicon. Words are formed by combining these morphemes during processing.
- Criticism: This hypothesis may be too rigid as it does not account for all morphological structures in the lexicon.
Evidence from Speech Errors
- Slips of the Tongue: Errors in speech, known as slips of the tongue, provide evidence that some morphological structure is represented in the mental lexicon.
- Examples:
- "screw looses" for "screws loose"
- "words of rule formation" for "rules of word formation"
- "easy enoughly" for "easily enough"
- Significance: These errors suggest that affixes (e.g., "-s", "-ed") can be separated from their stems, indicating a partial representation of morphological structure.
Experimental Evidence
Repetition Priming
- Description: Repetition priming involves recognizing a word faster if it has been seen before. This method helps examine whether derived or inflected forms are stored separately from their stems.
- Findings:
- Derived forms like "happiness" do not prime their stems like "happy".
- Regularly inflected forms like "pouring" do not have separate representations from "pour".
- Study Reference: Stanners et al. (1979) demonstrated that priming effects differ based on word structure.
Semantic Relatedness
- Description: Derived words prime their stems if there is a close semantic relationship.
- Findings:
- Words like "government" prime "govern".
- Words like "department" do not prime "depart".
- Study Reference: Marslen-Wilson et al. (1994) showed that semantic relatedness influences the priming effect.
Morphological Family Size
- Definition: The morphological family size of a word is the number of related multi-morphemic words and compounds in which it appears.
- Example: The family size of "fear" includes "fearful", "fearfully", "fearlessness", etc.
- Findings: Words with larger morphological families are recognized more quickly.
- Study Reference: Research by Baayen et al. (1997), De Jong et al. (2002), and Moscoso del Prado Martín et al. (2004a) showed that morphological family size affects word recognition speed.
Entropy and Morphological Paradigms
- Definition: Entropy refers to the amount of information contained in a morphological paradigm.
- Findings: The total entropy of a word’s morphological paradigm also affects recognition speed.