DreamSource: Natural Language Processing

Introduction to Natural Language Processing

Overview:
- The concept of enabling computers to process and understand human language dates back to the earliest days of computing.
- NLP is an interdisciplinary field with several names reflecting its diverse aspects, including:
  - Speech and Language Processing
  - Human Language Technology
  - Computational Linguistics
  - Speech Recognition and Synthesis
- Goal: To enable computers to perform tasks that involve human language, such as facilitating human-machine communication, improving human-human communication, and processing text or speech effectively.
Key Applications:
- Conversational Agents (Dialogue Systems):
  - A well-known example is HAL 9000 from the movie 2001: A Space Odyssey.
  - These systems can interact with humans using natural language.
  - Essential components include:
    - Language Input: Automatic Speech Recognition (ASR) and Natural Language Understanding (NLU).
    - Language Output: Natural Language Generation (NLG) and Speech Synthesis.
- Machine Translation:
  - Objective: To automatically convert text from one language to another.
  - This is crucial for making scientific knowledge and web content accessible in multiple languages.
  - Involves sophisticated algorithms and mathematical methods.
- Web-based Question Answering:
  - Goes beyond simple web search by allowing users to ask complete questions.
  - Example questions:
    - Definitions (e.g., "What does 'divergent' mean?")
    - Simple factual inquiries (e.g., "When was Abraham Lincoln born?")
    - Complex questions that may require extracting information, making inferences, and synthesizing data from various sources.
  - Important components include information extraction and word sense disambiguation.
Challenges and Current Progress:
- While many NLP tasks are still not fully resolved, research in this field is active and ongoing.
- Despite these challenges, many NLP technologies are already in commercial use, demonstrating the practical applications of this research.
Essential Knowledge and Tools:

A thorough understanding of NLP requires knowledge in linguistics, computer science, and mathematics.
Key topics in NLP include:

Spell Correction
Grammar Checking
Information Extraction
Word Sense Disambiguation
Inference and Summarization

Knowledge in Speech and Language Processing

Types of Language Knowledge in NLP:

Phonetics and Phonology:
- These areas involve the study of linguistic sounds.
- Speech recognition systems must recognize words from audio signals, and speech synthesis systems must generate audio from text.
- This requires knowledge of how words are pronounced (phonology) and how these sounds are acoustically realized (phonetics).
Morphology:
- Involves understanding how words are structured and how they can be modified.
- Example: Recognizing that "doors" is the plural form of "door" or that "I’m" is a contraction of "I am."
- Knowledge of morphology helps systems handle variations in word forms, such as tense, number, and contractions.
Syntax:
- This refers to the rules that govern how words are ordered and structured in sentences.
- Example: Knowing that "I’m afraid I can’t do that, Sameer" makes sense, while "I’m I do, sorry that afraid Sameer I’m can’t" does not, even though they contain the same words.
Semantics:
- Focuses on the meaning of words and how these meanings combine in sentences.
- Example: Understanding what "Western Europe" refers to, or interpreting "by the end of the 18th century" as a temporal phrase.
- Includes both lexical semantics (meaning of individual words) and compositional semantics (meaning derived from combining words in context).
Pragmatics:
- Involves understanding the intended meaning or action behind a sentence.
- Example: Differentiating between a request ("HAL, open the pod bay door"), a statement ("HAL, the pod bay door is open"), and a question ("HAL, is the pod bay door open?").
- Pragmatic knowledge helps systems understand politeness and indirect speech acts.
Discourse:

Refers to understanding language in context, especially in conversations.
Example: Answering the question "How many states were in the United States that year?" requires knowing that "that year" refers to the year mentioned in a previous statement.
Discourse knowledge includes tasks like co-reference resolution, where systems must track and link references to entities across sentences.

Application of Language Knowledge:

Conversational Agents:
- Systems like HAL need deep language knowledge to engage in meaningful dialogue, recognize speech, and generate appropriate responses.
Question Answering Systems:

Require a comprehensive understanding of language to accurately answer questions based on textual input, considering syntax, semantics, and discourse context.

Ambiguity in NLP

Definition of Ambiguity:
- Ambiguity occurs when a linguistic input can be interpreted in multiple ways, leading to different meanings.
- Many tasks in speech and language processing involve resolving these ambiguities.
Examples of Ambiguity:
- Consider the spoken sentence, "I made her duck." This sentence can have several meanings:
  - (1.5) I cooked waterfowl for her.
  - (1.6) I cooked waterfowl belonging to her.
  - (1.7) I created the duck she owns (possibly a plaster duck).
  - (1.8) I caused her to quickly lower her head or body.
  - (1.9) I used magic to transform her into a duck.
- These different interpretations arise from various types of ambiguity, including morphological, syntactic, and semantic.
Types of Ambiguity:
- Morphological and Syntactic Ambiguity:
  - Words like "duck" can be either a noun or a verb, and "her" can be a possessive pronoun or a dative pronoun.
- Semantic Ambiguity:
  - The word "make" can mean different things, such as "create" or "cook."
- Syntactic Ambiguity:
  - The structure of the sentence can vary depending on how the words are grouped. For instance, "make" can take one or two objects, leading to different meanings.
Resolving Ambiguity:
- Resolving these ambiguities is a crucial part of NLP. Different models and algorithms are used to disambiguate language:
  - Part-of-Speech Tagging: Determines whether "duck" is used as a verb or noun.
  - Word Sense Disambiguation: Determines the meaning of words like "make" based on context.
  - Syntactic Disambiguation: Determines the correct syntactic structure, such as whether "her" and "duck" refer to the same or different entities.
  - Speech Act Interpretation: Resolves ambiguities related to whether a sentence is a statement or a question.
Applications of Ambiguity Resolution:

A variety of NLP tasks require ambiguity resolution:

Text-to-Speech Synthesis: Decides the correct pronunciation of words based on context, such as "lead" in "lead pipe" versus "lead me on."
Probabilistic Parsing: Determines the most likely syntactic structure for a sentence, resolving ambiguities in word grouping and sentence meaning.

Models and Algorithm

Over the past five decades, research in Natural Language Processing (NLP) has revealed that the diverse types of linguistic knowledge necessary for processing language can be effectively captured using a select set of formal models. These models, derived from the foundational principles of computer science, mathematics, and linguistics, form the backbone of modern NLP systems. Among the most significant models are state machines, rule systems, logic, probabilistic models, and vector-space models, each contributing uniquely to the processing and understanding of human language.

State machines are one of the fundamental models used in NLP. They are formal structures consisting of states, transitions between those states, and an input representation that guides these transitions. Variations of state machines, such as deterministic and non-deterministic finite-state automata and finite-state transducers, are particularly useful for tasks related to phonology, morphology, and syntax. These models help in structuring linguistic knowledge by providing a framework for understanding how sounds and words are processed.

Closely related to state machines are formal rule systems, which offer a declarative approach to modeling language. These systems encompass regular grammars, context-free grammars, and feature-augmented grammars, and they can be applied in both probabilistic and non-probabilistic contexts. Rule systems are vital for managing the complexities of phonology, morphology, and syntax, offering structured ways to represent and manipulate linguistic rules.

Logic-based models also play a crucial role in NLP, particularly in the areas of semantics and pragmatics. First-order logic, or predicate calculus, along with related formalisms such as lambda calculus and feature structures, has traditionally been employed to model the meanings and intentions behind language. While these logical representations have been foundational, recent advancements have introduced more robust techniques drawn from non-logical lexical semantics, expanding the capabilities of NLP systems.

Probabilistic models are perhaps the most versatile and essential for capturing various types of linguistic knowledge. By augmenting other models like state machines, rule systems, and logic with probabilities, these models can handle the inherent ambiguities in language processing. A prime example is the Hidden Markov Model (HMM), which is extensively used in tasks such as part-of-speech tagging, speech recognition, dialogue understanding, text-to-speech, and machine translation. The strength of probabilistic models lies in their ability to frame almost any NLP problem as a matter of choosing the most probable interpretation among multiple ambiguous inputs.

Another critical model in NLP is the vector-space model, rooted in linear algebra. This model is particularly important for tasks such as information retrieval and the interpretation of word meanings. By representing words as vectors in a multi-dimensional space, vector-space models facilitate the comparison and processing of textual data in a way that reflects semantic similarity and other linguistic properties.

Algorithms :

State Space Search Model : The application of these models in NLP typically involves searching through a space of possible hypotheses about the input. For instance, in speech recognition, the system searches through potential sequences of phonemes to identify the correct word. In parsing, it explores various tree structures to determine the syntactic structure of a sentence. Similarly, in machine translation, the system evaluates different translation hypotheses to find the most accurate translation of a sentence. For tasks that are non-probabilistic, traditional graph algorithms like depth-first search are employed. However, for probabilistic tasks, heuristic algorithms such as best-first and A* search, along with dynamic programming techniques, are used to ensure computational efficiency.

Machine learning algorithms also play a significant role in NLP, especially in tasks that involve classification and sequence modeling. Classifiers, such as decision trees, support vector machines, and logistic regression, are used to assign individual objects, like words, to specific classes, such as determining whether a word is spelled correctly. Sequence models, including Hidden Markov Models (HMMs), Maximum Entropy Markov Models, and Conditional Random Fields (CRFs), are designed to classify sequences of objects collectively, such as labeling all words in a sentence simultaneously.

In NLP research, the methodologies used are closely aligned with those in broader machine learning research. Researchers rely on distinct training and test datasets, statistical techniques like cross-validation, and rigorous evaluation protocols to assess the performance of trained systems. This methodological rigor ensures that NLP models and algorithms are both effective and reliable in real-world applications.

Turing Test

The capacity of computers to process language with human-like proficiency is often seen as a milestone towards achieving true machine intelligence. This belief is rooted in the idea that effective language use is closely linked to cognitive abilities. Alan Turing's seminal 1950 paper introduced what is now known as the Turing Test, a method for evaluating a machine's ability to exhibit intelligent behavior indistinguishable from that of a human. Turing proposed an empirical test where a human interrogator interacts with both a machine and a human through text-based communication. The goal for the machine is to convince the interrogator that it is the human participant, while the human's task is to expose the machine. If the machine succeeds in this task, it is considered intelligent according to Turing’s criteria.

One example from Turing's paper shows how such a test might unfold: an interrogator might ask the machine to write a sonnet or perform a simple arithmetic task. The machine's ability to respond in a way that mimics human thought processes is central to the test. Turing estimated that by the end of the twentieth century, a machine with about 10 gigabytes of memory would have a 30% chance of fooling a human interrogator within five minutes. Despite the challenges in this metric, Turing's main point was that the use of language could serve as an operational test for machine intelligence.

The relevance of Turing's ideas became evident in 1966 with the development of the ELIZA program by Joseph Weizenbaum. ELIZA was an early natural language processing system designed to simulate a Rogerian psychotherapist through pattern-matching techniques. Although simple, ELIZA's responses, such as asking why a user thought it didn’t argue or why it was perceived as fearful, were effective enough to lead some users to believe it genuinely understood their concerns. Weizenbaum observed that many users continued to attribute understanding to ELIZA even after its operation was explained to them. This phenomenon highlighted the power of conversational simulation in creating the illusion of understanding.

In recent years, the Loebner Prize competition has continued to test computer programs against the Turing Test. While these contests have demonstrated that even rudimentary programs can sometimes deceive judges, they have also fueled ongoing debate about the efficacy of the Turing Test as a measure of intelligence. Nevertheless, Turing’s prediction about the evolving perception of machines has been affirmed. Research in social sciences, such as studies by Reeves and Nass, has shown that people interact with computers as if they were social entities. For example, individuals tend to respond more positively to computers that exhibit social behaviors, such as asking for feedback or giving compliments.

These insights underscore the importance of designing conversational agents that interact naturally with users. As people increasingly treat computers as social participants, the development of systems that effectively communicate in a conversational manner remains a significant focus in the field of natural language processing.

Language, Thought, and Understanding

The relationship between language, thought, and understanding is a central theme in natural language processing (NLP) and cognitive science. It explores how these three elements interact to enable effective communication and comprehension, both for humans and for artificial systems.

Language and thought are deeply interconnected but distinct processes. Language is a system of symbols and rules used for communication, while thought refers to the cognitive processes of understanding, reasoning, and problem-solving. The relationship between the two can be described as follows:

Language as a Medium for Thought: Language provides a framework through which thoughts can be expressed, organized, and communicated. For humans, language facilitates the articulation of complex ideas and abstract concepts, allowing individuals to share and develop their thoughts collaboratively. This interaction implies that while thought can occur without language, the ability to communicate and refine these thoughts often relies on linguistic expression.
Thought Influencing Language: The cognitive processes underlying thought influence how language is used. For example, the structure of our thoughts can shape the way we construct sentences and choose words. Cognitive theories, such as those proposed by Vygotsky, suggest that language development is closely tied to cognitive development, as linguistic skills support more sophisticated forms of thinking.
Language Shaping Thought: The Sapir-Whorf Hypothesis posits that the language we use can shape our perception of reality and influence cognitive processes. This hypothesis, while controversial, suggests that the structure and vocabulary of a language can affect how speakers perceive and categorize the world. For instance, languages with multiple words for different types of snow may lead speakers to recognize and think about snow in more nuanced ways.

Understanding in NLP

In the context of NLP, understanding involves a machine's ability to interpret and process human language in a meaningful way. This process requires bridging the gap between syntactic structures (how sentences are formed) and semantic meaning (the underlying concepts and ideas). Key aspects include:

Syntactic Processing: NLP systems need to analyze the grammatical structure of sentences to identify the relationships between words. This involves parsing sentences into their syntactic components, such as subjects, verbs, and objects. Syntactic models, like context-free grammars and dependency grammars, help NLP systems understand how words combine to form coherent sentences.
Semantic Processing: Beyond syntax, understanding requires grasping the meaning of words and sentences. This involves word sense disambiguation, where a system determines the correct meaning of a word based on its context. Semantic models, such as those based on vector spaces or logical representations, help in capturing the nuanced meanings of words and phrases.
Pragmatic Understanding: Effective communication often depends on context and intent. NLP systems must interpret not only the literal meaning of words but also the speaker’s intentions and the conversational context. Pragmatic analysis involves understanding subtleties such as politeness, sarcasm, and implied meanings.
Integration of Knowledge: For a machine to truly understand language, it must integrate various types of knowledge, including world knowledge, common sense, and domain-specific information. This integration allows NLP systems to make informed decisions and provide relevant responses in a given context.

The State of the Art

The field of speech and language processing is experiencing significant growth, driven by advances in computing power, the proliferation of the Internet, and the increasing use of mobile technology. This surge has led to the development and deployment of various sophisticated systems that illustrate the power and potential of these technologies. Here are some prominent examples:

Conversational Agents for Travel: Major travel providers like Amtrak and United Airlines utilize conversational agents to assist customers with making reservations and obtaining arrival and departure information. These systems guide users through the booking process, improving efficiency and user experience.
In-Car Voice Control Systems: Luxury automakers such as Mercedes-Benz have integrated automatic speech recognition and text-to-speech systems into their vehicles. These systems allow drivers to control their environment, entertainment, and navigation systems using voice commands, enhancing safety and convenience. A similar technology is used by astronauts on the International Space Station to manage onboard systems through spoken commands.
Video Search Technology: Companies like Blinkx leverage speech recognition technology to index and search millions of hours of video content available on the Web. By converting spoken words in video tracks into searchable text, these systems make it easier for users to find specific video content.
Cross-Language Information Retrieval: Google’s translation and retrieval systems enable users to search for information across different languages. Users can input queries in their native language, which Google translates and uses to find relevant pages in other languages. The results are then translated back into the user’s language, facilitating cross-language access to information.
Automated Essay Grading: Educational publishers like Pearson and testing services such as ETS use automated systems to evaluate and grade student essays. These systems assess writing with a level of consistency and efficiency comparable to human graders, supporting large-scale educational assessments.
Interactive Tutoring: Animated characters are employed as interactive tutors to assist children in learning to read. These systems engage children through lifelike interactions, providing personalized instruction and support to enhance reading skills.
Text Analysis for Marketing Intelligence: Companies like Nielsen Buzzmetrics, Umbria, and Collective Intellect analyze user opinions, preferences, and attitudes expressed in online content such as weblogs, discussion forums, and user groups. These text analysis systems help businesses gain insights into consumer behavior and market trends.
Voice-Activated Virtual Assistants: Virtual assistants like Amazon's Alexa, Apple's Siri, and Google Assistant have become integral to daily life. These voice-activated systems enable users to perform tasks such as setting reminders, controlling smart home devices, and retrieving information by simply speaking commands. They utilize natural language processing to understand and respond to user requests, showcasing the practical application of speech and language technologies in everyday settings.

Four Paradigms: 1970–1983

Between 1970 and 1983, research in speech and language processing saw significant growth, leading to the establishment of several research paradigms that continue to shape the field. During this period, four primary paradigms emerged, each contributing to the advancement of natural language processing (NLP) in unique ways.

The stochastic paradigm played a pivotal role, particularly in the development of speech recognition algorithms. A major breakthrough in this paradigm was the application of the Hidden Markov Model (HMM) and the concepts of the noisy channel and decoding. These concepts were independently developed by researchers like Jelinek, Bahl, Mercer, and their colleagues at IBM’s Thomas J. Watson Research Center, as well as by Baker at Carnegie Mellon University. Their work was heavily influenced by Baum and his team at the Institute for Defense Analyses in Princeton. Concurrently, AT&T’s Bell Laboratories emerged as a hub for pioneering work in speech recognition and synthesis, laying the groundwork for the modern advancements in these areas.

The logic-based paradigm was another crucial development during this era, initiated by the work of Colmerauer and his colleagues on Q-systems and metamorphosis grammars, which eventually led to the creation of Prolog and Definite Clause Grammars. Kay’s work on functional grammar and later, Bresnan and Kaplan’s development of Lexical Functional Grammar (LFG), underscored the importance of feature structure unification, further advancing the field.

The field of natural language understanding (NLU) also made significant strides. Terry Winograd’s SHRDLU system was a groundbreaking development that simulated a robot operating in a world of toy blocks. This system could understand and respond to complex natural language commands, demonstrating a level of sophistication previously unseen. SHRDLU’s success highlighted the progress in parsing and shifted the focus toward semantics and discourse models. Around the same time, Roger Schank and his colleagues at Yale, often referred to as the Yale School, developed a series of language understanding programs that explored human conceptual knowledge, including scripts, plans, goals, and memory organization. This work integrated network-based semantics and began incorporating Fillmore’s notion of case roles into their representations.

The discourse modeling paradigm focused on understanding and representing discourse structure. Key contributions in this area included the study of substructure in discourse and discourse focus by Grosz and her colleagues, as well as the development of automatic reference resolution techniques by researchers like Hobbs. Additionally, the BDI (Belief-Desire-Intention) framework was introduced, providing a logic-based approach to analyzing speech acts.

These paradigms laid the foundation for many of the advancements in NLP and continue to influence ongoing research and development in the field.

Survey of English Morphology

Introduction to Morphology

Morphology is the study of the structure and formation of words from smaller units of meaning called morphemes. A morpheme is the smallest meaning-bearing unit in a language. For example:

Fox consists of a single morpheme (fox).
Cats consists of two morphemes: cat (the stem) and -s (a suffix indicating plural).

Classes of Morphemes

Morphemes are broadly categorized into:

Stems: The main morpheme that provides the core meaning of the word.
Affixes: Morphemes that attach to a stem to modify its meaning. Affixes include:
- Prefixes: Attached to the beginning of a stem (e.g., un-buckle).
- Suffixes: Attached to the end of a stem (e.g., eat-s).
- Infixes: Inserted into the middle of a stem (common in some languages like Tagalog).
- Circumfixes: Attached around a stem (e.g., ge-sag-t in German for the past participle of "sagen").

Word Formation Processes

Words can be formed in various ways, playing a crucial role in language processing. The four key processes include:

Inflection: Combines a stem with grammatical morphemes, typically resulting in a word of the same class but performing a different syntactic function (e.g., plural -s or past tense -ed in English).
Derivation: Combines a stem with a grammatical morpheme, often resulting in a word of a different class with a meaning that can be hard to predict (e.g., computerize → computerization).
Compounding: Combines multiple stems into a single word (e.g., doghouse).
Cliticization: Attaches clitics (reduced forms of words) to a stem (e.g., 've in I’ve).

Inflectional Morphology in English

English has a relatively simple inflectional system, with inflectional changes mostly occurring in nouns, verbs, and sometimes adjectives:

Nouns: Inflection for plurality (e.g., cat/cats) and possession (e.g., llama’s).
Verbs: Inflection based on tense and agreement (e.g., walk/walks/walking/walked).

Regular Verbs

Regular verbs follow predictable patterns in their morphological forms:

Stem: walk
-s Form: walks
-ing Participle: walking
Past Form: walked

Irregular Verbs

Irregular verbs have more idiosyncratic forms and do not always follow predictable patterns:

Stem: eat
-s Form: eats
-ing Participle: eating
Preterite: ate
Past Participle: eaten

Derivational Morphology in English

Derivational morphology is more complex than inflection, often producing new words with different syntactic categories:

Nominalization: Creating nouns from verbs or adjectives (e.g., computerize → computerization).
Adjective Formation: Creating adjectives from nouns or verbs (e.g., computation → computational).

Derivational processes are less productive than inflectional ones, meaning not every stem can be combined with every derivational suffix.

Cliticization

Clitics are units that lie between a full word and an affix:

Proclitics: Precede a word (e.g., 'm in I’m).
Enclitics: Follow a word (e.g., ’ve in I’ve).

In English, clitics often appear as reduced forms attached to verbs, articles, or conjunctions.

Non-concatenative Morphology

Some languages use non-concatenative morphology, where morphemes combine in non-linear ways:

Templatic Morphology: Common in Semitic languages like Arabic and Hebrew, where a root and a pattern combine to form different words (e.g., the Hebrew root lmd combines with various templates to form words meaning "studied," "taught," and "was taught").

Agreement

In English, subject nouns and verbs must agree in number, meaning they must both be singular or both plural. Other languages may also require gender agreement between nouns, adjectives, and sometimes verbs.

Finite-State Morphological Parsing

Morphological parsing involves analyzing the structure of words to determine their stems and the associated morphological features, such as tense, number, gender, etc. The goal of morphological parsing is to take an input word form (e.g., "cats," "cities," "bebo," "canto") and produce a morphological parse that identifies the stem of the word along with its morphological features.

Example: English and Spanish Morphological Parsing

Below are examples of morphological parsing for English and Spanish words:

English	Morphological Parse	Spanish	Morphological Parse	Gloss
cats	cat +N +Pl	pavos	pavo +N +Masc +Pl	ducks
cat	cat +N +Sg	pavo	pavo +N +Masc +Sg	duck
cities	city +N +Pl	bebo	beber +V +PInd +1P +Sg	I drink
geese	goose +N +Pl	canto	cantar +V +PInd +1P +Sg	I sing
goose	goose +N +Sg	canto	canto +N +Masc +Sg	song
goose	goose +V	puse	poner +V +Perf +1P +Sg	I was able
gooses	goose +V +1P +Sg	vino	venir +V +Perf +3P +Sg	he/she came
merging	merge +V +PresPart	vino	vino +N +Masc +Sg	wine
caught	catch +V +PastPart	lugar	lugar +N +Masc +Sg	place
caught	catch +V +Past

Morphological Features

The second column in the table above shows the stem of each word and its associated morphological features. Some key features include:

+N: Indicates the word is a noun.
+Sg: Indicates singular form.
+Pl: Indicates plural form.
+V: Indicates the word is a verb.
+Masc: Masculine gender (specific to Spanish nouns).
+PInd: Present Indicative (specific to Spanish verbs).
+Perf: Perfect tense (specific to Spanish verbs).
+1P: First person.
+3P: Third person.

Note: Some input forms, such as "caught," "goose," "canto," or "vino," are ambiguous and can have multiple morphological parses. At this stage, morphological parsing aims to list all possible parses, and the task of disambiguating between them will be discussed later.

Components of a Morphological Parser

To build a morphological parser, three essential components are needed:

Lexicon: A list of stems and affixes, along with basic information about them (e.g., whether a stem is a noun or verb).
Morphotactics: The rules governing the order of morphemes within a word. For example, in English, the plural morpheme ("-s") follows the noun rather than preceding it.
Orthographic Rules: Spelling rules that model changes occurring when two morphemes combine. An example is the rule that changes "city + -s" to "cities" instead of "citys."

Finite-State Morphological Parsing

To represent a simple lexicon for morphological recognition, we use finite-state automata (FSAs) to model morphotactic knowledge. This approach is extended by introducing finite-state transducers (FSTs) to model morphological features and handle the task of morphological parsing. FSTs are also used to model orthographic rules and address the complexities of spelling changes in morphological analysis.

Building a Finite-State Lexicon

What is a Lexicon?

A lexicon in computational linguistics is a repository of words or morphemes (the smallest units of meaning). The simplest form of a lexicon is a list of every word in a language, including proper names and abbreviations, such as:

a, AAA, AA, Aachen, aardvark, aardwolf, aba, abaca, aback, ...

However, due to the vast number of words in a language, it's often impractical to list every word explicitly. Instead, computational lexicons are typically structured with lists of stems and affixes, along with a model of the morphotactics that describes how these morphemes can combine to form valid words.

Modeling Morphotactics with Finite-State Automata (FSA)

One common way to model morphotactics is by using Finite-State Automata (FSA). Below are examples of how FSAs can be used to model English nominal and verbal inflections.

English Nominal Inflection

The FSA in Figure 3.3 models the nominal inflection of English nouns. It includes:

Regular nouns (reg-noun): These take the regular "-s" plural (e.g., cat, dog, fox).
Irregular singular nouns (irreg-sg-noun): These don't take "-s" in the plural form (e.g., goose, mouse).
Irregular plural nouns (irreg-pl-noun): These have irregular plural forms (e.g., geese, mice).

reg-noun	irreg-sg-noun	irreg-pl-noun	plural
fox	goose	geese	-s
cat	mouse	mice	-s
aardvark	sheep	sheep	-s

English Verbal Inflection

The FSA in Figure 3.4 models the verbal inflection of English verbs. It includes:

Regular verb stems (reg-verb-stem): These take regular inflectional endings (e.g., walk, talk).
Irregular verb stems (irreg-verb-stem): These have irregular forms in past tense or participles (e.g., cut, sing).
Irregular past verb forms (irreg-past-stem): These do not follow the regular "-ed" past tense rule (e.g., caught, ate).

reg-verb-stem	irreg-verb-stem	irreg-past-stem	past	past-part	pres-part	3sg
walk	cut	caught	-ed	-ed	-ing	-s
fry	speak	ate	-ed	eaten	-ing	-s
talk	sing	sang	-ed	eaten	-ing	-s

English Derivational Morphology

Derivational morphology in English, which deals with how new words are formed by adding prefixes or suffixes, is more complex than inflectional morphology. The complexity of derivational morphology often requires more sophisticated models, such as context-free grammars.

Example: English Adjective Morphotactics

Consider the morphotactics of English adjectives. Here are some examples:

Basic adjectives: big, cool, happy, red
Comparative forms: bigger, cooler, happier, redder
Superlative forms: biggest, coolest, happiest, reddest
Adverbial forms: coolly, happily, really, clearly, unclearly
Negated forms: unhappy, unclear, unreal

An initial FSA might suggest the following structure:

q0 --un--> q1 --adj-root--> q2 --(-er | -est | -ly)--> q3

This FSA recognizes all the adjectives listed above but also incorrectly recognizes ungrammatical forms like "unbig," "unfast," or "oranger." To correct this, we need to categorize adjectives into different classes and specify which suffixes they can take.

Modeling English Derivational Morphology

Figure shows an FSA that models fragments of English nominal and verbal derivational morphology. For instance, it models the general rule that any verb ending in "-ize" can take the nominalizing suffix "-ation" (e.g., fossilize → fossilization). Similarly, adjectives ending in "-al" or "-able" can take the suffix "-ity" or "-ness" (e.g., formal → formality, natural → naturalness).

Morphological Recognition

The FSAs developed for morphotactics can be expanded to solve the problem of morphological recognition—determining whether an input string is a legitimate word. This is done by expanding each arc in the FSA (e.g., the "reg-noun-stem" arc) with all the morphemes in that category. The expanded FSA operates at the level of individual letters.

For example, above figure shows an expanded FSA that recognizes English nouns like "aardvarks" by matching the input letter by letter against the FSA's transitions.

Finite-State Transducers

Finite-State Transducers (FSTs) are a generalization of Finite-State Automata (FSAs). While FSAs are used to model the morphotactic structure of a lexicon and for word recognition, FSTs map between two sets of symbols. Essentially, FSTs are like two-tape automata that recognize or generate pairs of strings. They can be visualized as having transitions labeled with pairs of symbols, one from each tape.

Example of an FST

Consider the FST below, where each transition is labeled with an input and output string separated by a colon:

q0 --aa:b--> q1

q1 --b:--> q2

q1 --b:a--> q3

q3 --a:ba--> q4

This FST maps the string "aa" to "b," "b" to either "a" or an empty string, and "a" to "ba." FSTs have a more general function than FSAs; while FSAs define a formal language as a set of strings, FSTs define a relation between sets of strings.

Four Ways to Think About FSTs

FST as Recognizer: An FST can take a pair of strings as input and output "accept" if the pair is in the language, and "reject" if it is not.
FST as Generator: An FST can generate pairs of strings from the language.
FST as Translator: An FST can read one string and output another string, effectively translating between representations.
FST as Set Relater: An FST computes relations between sets of strings.

These functions are useful in various applications in speech and language processing. For example, in morphological parsing, FSTs can translate an input string of letters into a corresponding string of morphemes.

Formal Definition of an FST

An FST can be formally defined with the following seven parameters:

Q: A finite set of states $\{q_0, q_1, ..., q_{N-1}\}$ .
Σ: A finite set representing the input alphabet.
Δ: A finite set representing the output alphabet.
q0: The start state, where $q0 \in Q$ .
F: The set of final states, $F \subseteq Q$ .
δ(q, w): The transition function, which returns a set of new states given a state $q \in Q$ and an input string $w \in Σ^*$ .
σ(q, w): The output function, which returns a set of output strings given a state $q \in Q$ and an input string $w \in Σ^*$ .

Where FSAs are isomorphic to regular languages (sets of strings), FSTs are isomorphic to regular relations (sets of pairs of strings).

Closure Properties of FSTs

Like regular languages, FSTs and regular relations are closed under union. However, they are generally not closed under difference, complementation, or intersection, although some subclasses of FSTs have closure properties under these operations.

In addition to union, FSTs have two important closure properties:

Inversion: The inversion of a transducer $T$ (denoted $T^{-1}$ ) switches the input and output labels. If $T$ maps from input alphabet $I$ to output alphabet $O$ , then $T^{-1}$ maps from $O$ to $I$ . This property is useful for converting an FST from a parser to a generator.
Composition: If $T_1$ is a transducer from $I_1$ to $O_1$ and $T_2$ is a transducer from $O_1$ to $O_2$ , then the composition $T_1 \circ T_2$ maps from $I_1$ to $O_2$ . Composition allows two transducers running in series to be replaced with a single, more complex transducer. Applying $T_1 \circ T_2$ to an input sequence $S$ is equivalent to applying $T_1$ to $S$ and then $T_2$ to the result: $T_1 \circ T_2(S) = T_2(T_1(S))$ .

Projection of an FST

The projection of an FST is the Finite-State Automaton (FSA) produced by extracting only one side of the relation. The projection to the left or upper side of the relation is called the first projection, and the projection to the lower or right side is called the second projection.

Sequential Transducers and Determinism

Transducers, as discussed, can be nondeterministic. This means that a single input may correspond to multiple possible outputs. Nondeterministic Finite-State Transducers (FSTs) can be slow because they require complex search algorithms, similar to those discussed earlier in NLP, to process. This raises the question of whether nondeterministic FSTs can be converted into deterministic ones. However, unlike Finite-State Automata (FSAs), not all FSTs can be determinized.

Sequential Transducers

Sequential transducers are a subtype of FSTs that are deterministic on their input. This means that for any state in the transducer, each symbol in the input alphabet $\Sigma$ can label at most one transition out of that state. In a sequential transducer, the transitions out of each state are uniquely determined by the state and the input symbol. An example is shown in the following diagram:

q0 --a:ba--> q1

q0 --b:b--> q1

In the above diagram, each input symbol leads to a deterministic transition to another state. Sequential transducers can have epsilon symbols (which represent the empty string) in the output, but not in the input.

However, sequential transducers are not necessarily sequential on their output. For example, in some cases, two distinct transitions from the same state may produce the same output symbol. Therefore, when discussing sequentiality, it is important to specify the direction of the transduction (input to output or output to input).

Formal Definition:

The transition function (δ) is modified to map from $Q \times \Sigma^*$ to $Q$ , rather than to a set of states $2^Q$ .
The output function (σ) maps from $Q \times \Sigma^*$ to $\Delta^*$ , rather than to a set of output strings $2^{\Delta^*}$ .

Subsequential Transducers

A subsequential transducer is a generalization of sequential transducers. It allows for the generation of an additional output string at the final states, which is concatenated to the output produced so far. This concept was introduced by Schützenberger in 1977.

Example: Consider a subsequential transducer that, upon reaching a final state, appends an additional string to the output generated by the transitions. This feature is particularly useful in applications where deterministic processing of input is required, but with some flexibility in the final output.

The key advantages of subsequential transducers include:

Efficiency: They are deterministic on input, so they can process inputs in linear time relative to the length of the input string. This contrasts with nondeterministic transducers, where processing time can depend on the number of states.
Determinization and Minimization: There are efficient algorithms for determinizing and minimizing subsequential transducers, extending similar algorithms for FSAs. These algorithms are crucial for optimizing the performance of transducers in NLP applications.

Handling Ambiguity with p-Subsequential Transducers

While sequential and subsequential transducers are deterministic and efficient, they transduce each input string to exactly one possible output string. However, natural language often involves ambiguity, where a single input can correspond to multiple valid outputs.

To address this, p-subsequential transducers have been introduced. A p-subsequential transducer allows for $p$ (where $p \geq 1$ ) final output strings to be associated with each final state. This capability to handle a finite amount of ambiguity is highly useful in many NLP tasks, such as:

Representation of Dictionaries
Compilation of Morphological and Phonological Rules
Local Syntactic Constraints

For many problems in these areas, p-subsequential transducers can be determinized and minimized, making them efficient and practical for real-world applications.

FSTs for Morphological Parsing

Morphological parsing involves analyzing a word to identify its constituent morphemes (the smallest meaningful units of language) and the grammatical information they convey. For example:

Given the English word "cats," a morphological parser should output "cat +N +Pl," indicating that "cat" is a plural noun.
Given the Spanish word "bebo" (meaning "I drink"), the parser should output "beber +V +PInd +1P +Sg," indicating that "bebo" is the first-person singular form of the verb "beber" in the present indicative tense.

Finite-State Morphology

In finite-state morphology, words are represented as a correspondence between:

Lexical Level: Represents the concatenation of morphemes that make up a word.
Surface Level: Represents the actual spelling of the word as it appears.

Lexical Tape: Composed of characters from the lexical alphabet $\Sigma$ .
Surface Tape: Composed of characters from the surface alphabet $\Delta$ .

Each transition in the FST corresponds to a pair of symbols: one from the lexical tape and one from the surface tape. For example:

$a:a$ means that the symbol "a" on the lexical tape corresponds to "a" on the surface tape.
$a:\epsilon$ means that the symbol "a" on the lexical tape corresponds to nothing on the surface tape (indicating a deletion).

These symbol pairs form a new alphabet $\Sigma'$ , where $\Sigma' \subseteq \Sigma \times \Delta$ . This new alphabet is used to define the FST.

Feasible Pairs and Default Pairs

Feasible Pair: A pair $a:b$ in $\Sigma'$ that indicates how a symbol on the lexical tape maps to a symbol on the surface tape.
Default Pair: A pair where the symbol maps to itself (e.g., $a:a$ ). These are often simplified by just referring to them by the single letter (e.g., "a" instead of "a
").

Building an FST Morphological Parser

To construct a morphological parser using FSTs, we need to expand earlier morphotactic Finite-State Automata (FSAs) by adding:

An extra "lexical" tape.
Morphological features (e.g., +Sg for singular, +Pl for plural).

For example, consider an FST that handles English nominal number inflection. The symbols above each arc represent elements of the morphological parse on the lexical tape, while the symbols below represent the surface tape or intermediate tape.

Morpheme Boundary ( $^)$ : Indicates the boundary between morphemes.
Word Boundary (#): Indicates the boundary between words.

The output includes morpheme boundary markers (^) and word boundary markers (#). These markers are present on intermediate tapes, and further processing is needed to remove them to achieve the final surface form.

Combining FST Lexicon and Rules

In natural language processing, finite-state transducers (FSTs) can be used for both parsing and generating words by combining a lexicon with spelling rules. The lexicon transducer maps between:

Lexical Level: Consists of word stems and morphological features.
Intermediate Level: Represents a simple concatenation of morphemes.

Architecture of Two-Level Morphology Systems

The architecture of a two-level morphology system typically consists of a cascade of transducers (see Fig. 3.19):

Lexicon Transducer: Maps from the lexical level to the intermediate level.
Spelling Rule Transducers: A set of parallel transducers that map from the intermediate level to the surface level.

Cascade of Transducers: A cascade involves running transducers in series, where the output of one transducer becomes the input for the next. This setup can be used both for generating words (top-down approach) and parsing them (bottom-up approach).

Example: Parsing and Generating "Foxes"

Consider the word "foxes":

Lexicon Level: fox +N +Pl (Noun, Plural).
Intermediate Level: foxˆs# (Morpheme boundary ˆ and word boundary #).
Surface Level: foxes (Final output).

During parsing, the transducers process the input to map from the surface level back to the lexical level, handling ambiguities that may arise (e.g., "foxes" could be a verb in another context). The transducer identifies multiple possible parses, such as fox +N +Pl (noun) and fox +V +3Sg (verb, third person singular). Disambiguation requires external evidence, such as the surrounding context.

Handling Ambiguity in Parsing

Ambiguity: During parsing, ambiguity can arise due to multiple possible interpretations of a word. For example, "foxes" might be parsed as a plural noun or as a verb. In cases like these, the transducer enumerates all possible parses. Disambiguation relies on external context, such as sentence structure.

Local Ambiguity: This occurs during the parsing process itself. For instance, when parsing the word "assess," the E-insertion transducer might initially interpret "ass" as leading to "asses" (plural form), but later corrects itself upon encountering the next character. This requires the FST to incorporate a search algorithm to handle non-determinism in parsing.

Simplifying the Cascade: Composition and Intersection

Running multiple transducers in a cascade can be complex. Fortunately, transducers can be:

Composed: Combine a series of transducers into a single, more complex transducer.
Intersected: Parallel transducers can be combined using an intersection algorithm, which creates a new state for each combination of states from the original transducers.

Intersection Algorithm: The intersection algorithm takes the Cartesian product of the states in two machines. For any input symbol, if one machine transitions to a state $q_n$ and another to $q_m$ , the combined machine transitions to a state $q_{nm}$

Lexicon-Free FSTs: The Porter Stemmer

Finite-State Transducers (FSTs) can be designed without a lexicon. In such systems, instead of relying on a predefined list of word stems, the FST applies a set of rules to manipulate the surface form of words directly. This approach is useful for tasks like stemming, where the goal is to reduce words to their root forms.

The Porter Stemmer

One of the most widely used lexicon-free stemming algorithms is the Porter Stemmer, developed by Martin Porter in 1980. The Porter Stemmer is a rule-based system that systematically removes common morphological and inflectional suffixes from English words to reduce them to their base or root form.

For example:

"running" → "run"
"happiness" → "happi"

How the Porter Stemmer Works

The Porter Stemmer operates in several steps, each designed to handle different types of suffixes. The process involves:

Step 1a: Deals with plural suffixes like -s or -es.
- Example: caresses → caress
- Example: ponies → poni
Step 1b: Removes -ed and -ing suffixes.
- Example: jumped → jump
- Example: running → run
Step 1c: Converts terminal -y to -i if the word has another vowel before y.
- Example: happy → happi
Step 2: Handles common suffixes like -ational, -izer, -iveness, and replaces them with shorter forms.
- Example: relational → relate
- Example: conditioner → condition
Step 3: Deals with the removal of suffixes like -ness, -ful, and others that indicate the part of speech.
- Example: happiness → happi
- Example: usefulness → use
Step 4: Further reduces words by removing suffixes that indicate the noun or adjective form, such as -ment, -ance, or -ence.
- Example: adjustment → adjust
- Example: dependency → depend
Step 5: Finally, step 5 removes any remaining -e at the end of words and handles the doubling of consonants.
- Example: probate → probat
- Example: controll → control

Characteristics of the Porter Stemmer

Rule-Based: The Porter Stemmer applies a series of transformation rules to the word. These rules are designed to be broadly applicable, meaning that the stemmer can be applied to any word without the need for a lexicon.
No Lexicon Needed: Unlike FSTs that rely on a predefined lexicon of word stems, the Porter Stemmer works entirely by applying these rules. This makes it highly efficient and easy to implement.
Aggressiveness: The Porter Stemmer is known for being quite aggressive in reducing words to their stems, sometimes resulting in non-dictionary forms (e.g., happiness → happi).

Advantages and Limitations

Advantages:

Efficiency: Because it does not require a lexicon, the Porter Stemmer is fast and can be applied to large corpora.
Simplicity: The algorithm is straightforward to implement and understand, making it popular in many natural language processing applications.

Limitations:

Over-stemming: The stemmer may sometimes be too aggressive, producing stems that are not real words (e.g., happiness → happi).
Under-stemming: Conversely, some words may not be reduced enough (e.g., agreement → agree, but disagreement may be left unchanged).
Language-Specific: The Porter Stemmer is designed specifically for English and may not be suitable for other languages without modification.

Applications

The Porter Stemmer is commonly used in various NLP tasks, such as:

Information Retrieval: Reducing words to their stems can improve the accuracy of search engines by matching documents that contain different forms of a word.
Text Mining: Helps in clustering and categorizing text data by reducing word variants to a common form.
Sentiment Analysis: Assists in analyzing sentiments by focusing on the root forms of words, which can reveal patterns in language use.

Human Morphological Processing

In psycholinguistics, understanding how multi-morphemic words (e.g., "walk", "walks", "walked") are represented in the human mind is crucial. We explore whether these words are stored individually in the mental lexicon or whether their components (e.g., "walk" and "-ed") are separately represented and combined during processing.

Theories of Lexical Representation

Full Listing Hypothesis
- Description: All forms of a word (e.g., "walk", "walks", "walked") are individually listed in the mental lexicon.
- Implication: Morphological structure is considered an epiphenomenon.
- Criticism: This view is inadequate for morphologically rich languages (e.g., Turkish) where such an exhaustive listing would be impractical.
Minimum Redundancy Hypothesis
- Description: Only the basic morphemes (e.g., "walk" and "-s") are stored in the lexicon. Words are formed by combining these morphemes during processing.
- Criticism: This hypothesis may be too rigid as it does not account for all morphological structures in the lexicon.

Evidence from Speech Errors

Slips of the Tongue: Errors in speech, known as slips of the tongue, provide evidence that some morphological structure is represented in the mental lexicon.
- Examples:
  - "screw looses" for "screws loose"
  - "words of rule formation" for "rules of word formation"
  - "easy enoughly" for "easily enough"
- Significance: These errors suggest that affixes (e.g., "-s", "-ed") can be separated from their stems, indicating a partial representation of morphological structure.

Experimental Evidence

Repetition Priming
- Description: Repetition priming involves recognizing a word faster if it has been seen before. This method helps examine whether derived or inflected forms are stored separately from their stems.
- Findings:
  - Derived forms like "happiness" do not prime their stems like "happy".
  - Regularly inflected forms like "pouring" do not have separate representations from "pour".
- Study Reference: Stanners et al. (1979) demonstrated that priming effects differ based on word structure.
Semantic Relatedness
- Description: Derived words prime their stems if there is a close semantic relationship.
- Findings:
  - Words like "government" prime "govern".
  - Words like "department" do not prime "depart".
- Study Reference: Marslen-Wilson et al. (1994) showed that semantic relatedness influences the priming effect.

Morphological Family Size

Definition: The morphological family size of a word is the number of related multi-morphemic words and compounds in which it appears.
- Example: The family size of "fear" includes "fearful", "fearfully", "fearlessness", etc.
Findings: Words with larger morphological families are recognized more quickly.
Study Reference: Research by Baayen et al. (1997), De Jong et al. (2002), and Moscoso del Prado Martín et al. (2004a) showed that morphological family size affects word recognition speed.

Entropy and Morphological Paradigms

Definition: Entropy refers to the amount of information contained in a morphological paradigm.
Findings: The total entropy of a word’s morphological paradigm also affects recognition speed.

Tuesday, August 13, 2024

Natural Language Processing - Introduction

Introduction to Natural Language Processing

Ambiguity in NLP

Models and Algorithm

Algorithms :

Understanding in NLP

Survey of English Morphology

Introduction to Morphology

Classes of Morphemes

Word Formation Processes

Inflectional Morphology in English

Regular Verbs

Irregular Verbs

Derivational Morphology in English

Cliticization

Non-concatenative Morphology

Agreement

Example: English and Spanish Morphological Parsing

Morphological Features

Components of a Morphological Parser

Finite-State Morphological Parsing

What is a Lexicon?

Modeling Morphotactics with Finite-State Automata (FSA)

English Nominal Inflection

English Verbal Inflection

English Derivational Morphology

Example: English Adjective Morphotactics

Modeling English Derivational Morphology

Morphological Recognition

Example of an FST

Four Ways to Think About FSTs

Formal Definition of an FST

Closure Properties of FSTs

Projection of an FST

Sequential Transducers and Determinism

Sequential Transducers

Subsequential Transducers

Handling Ambiguity with p-Subsequential Transducers

Finite-State Morphology

Feasible Pairs and Default Pairs

Building an FST Morphological Parser

Architecture of Two-Level Morphology Systems

Example: Parsing and Generating "Foxes"

Handling Ambiguity in Parsing

Simplifying the Cascade: Composition and Intersection

The Porter Stemmer

How the Porter Stemmer Works

Characteristics of the Porter Stemmer

Advantages and Limitations

Applications

Theories of Lexical Representation

Evidence from Speech Errors

Experimental Evidence

Morphological Family Size

Entropy and Morphological Paradigms

No comments:

Post a Comment