Collocational grammar system
First Claim
1. A processor for parsing digitally encoded natural language text, such processor comprising means for receiving encoded natural language text for processing,dictionary means for storing words of the natural language together with a list of associated tags indicative of the possible grammatical or syntactic properties of each word,means for looking up a word of the text in the dictionary and annotating the word with its associated tags from the dictionary to provide a word record,means operative on word records of words of a sentence for defining a relative probability of occurrence of a tag sequence consisting of one tag selected from the word record of each word of a sequence of words of the sentence,means for constructing a selected set of tag sequences having a tag selected from the tags associated with each word of the sequence of words and determining a tag sequence of greatest relative probability of occurrence thereby identifying a single most probable tag for each word of the sequence, andgrammatical processing means for identifying grammatical structure from the ordering of the single tag for each said word so as to obtain a parse of the sentence.
12 Assignments
0 Petitions
Accused Products
Abstract
A system for the grammatical annotation of natural language receives natural language text and annotates each word with a set of tags indicative of its possible grammatical or syntactic uses. An empirical probability of collocation function defined on pairs of tags is iteratively extended to a selected set of tag sequences of increasing length so as to select a most probable tag for each word of a sequence of ambiguously-tagged words. For listed pairs of commonly confused words a substitute calculation reveals erroneous use of the wrong word. For words with tags having abnormally low frequency of occurrence, a stored table of reduced probability factors corrects the calculation. Once the text words have been annotated with their most probable tags, the tagged text is parsed by a parser which successively applies phrasal, predicate and clausal analysis to build higher structures from the disambiguated tag strings. A voice/text translator including such a tag annotator resolves sound or spelling ambiguity of words by their differing tags. A database retrieval system, such as a spelling checker, includes a tag annotator to identify desired data by syntactic features.
-
Citations
17 Claims
-
1. A processor for parsing digitally encoded natural language text, such processor comprising means for receiving encoded natural language text for processing,
dictionary means for storing words of the natural language together with a list of associated tags indicative of the possible grammatical or syntactic properties of each word, means for looking up a word of the text in the dictionary and annotating the word with its associated tags from the dictionary to provide a word record, means operative on word records of words of a sentence for defining a relative probability of occurrence of a tag sequence consisting of one tag selected from the word record of each word of a sequence of words of the sentence, means for constructing a selected set of tag sequences having a tag selected from the tags associated with each word of the sequence of words and determining a tag sequence of greatest relative probability of occurrence thereby identifying a single most probable tag for each word of the sequence, and grammatical processing means for identifying grammatical structure from the ordering of the single tag for each said word so as to obtain a parse of the sentence.
-
7. A processor for processing digitally encoded natural language text, such processor comprising
means for receiving digitally encoded natural language text, dictionary means for storing base forms of words of the natural language together with data codes indicative of grammatical or syntactic properties of each stored word, means for looking up each word of the text in the dictionary and for annotating the word with its said data codes to create a word record, collocational analysis means for performing a defined calculation to construct a function on a bounded set of selected sequences of data codes so as to determine for each word a data code indicative of its probable grammatical usage, said collocational analysis means including first means operative on word records for assigning a likelihood of co-occurrence of data codes of adjacent words, second means for iteratively applying said first means to develop a probability-like measure on each of an ordered set of sequences of data codes wherein each successive data code of a sequence is selected from the word record of a successive word of the text, and means for determining a plurality of sequences of data codes of greatest probability thereby, associating with each word of the text a plurality of most probable data codes.
-
11. A grammatical processor for processing digitally encoded natural language text so as to parse sentences of the text, such processor comprising
a dictionary of words of the language including for each word indications of its possible grammatical tags, tag annotation means for looking up words of a text sentence in the dictionary and annotating each word with its possible tags, tag selection means, including means for applying a collocation probability matrix to syntactically adjacent pairs of tags, and means for iteratively building up a probability-like measure on sequences formed of the possible tags of a sequence of words to determine at least one sequence of greatest probability thereby determining, when a word is annotated with more than one tag, a most probable tag of the word as a function of possible tags of surrounding words of the text sentence, and means for processing a string consisting of the most probable tags of the words of the sentence to identify the grammatical function of each word and determine a parse of the sentence.
-
12. A computerized system for the grammatical annotation of natural language, such system comprising
means for receiving encoded natural language text for processing, means for annotating a word of the text with a tag set of tag indicative of possible grammatical or syntactic uses of the word, selection means operative on the tag sets of the words of a sequence of words for determining the most probable tag of each word of the sequence, such selection means including (i) means operative on pairs of tags, one from the tag set of each of two adjacent words, for defining an empirical collocational likelihood, and (ii) means for extending said empirical likelihood to a function defined on a bounded subset of possible tag sequences constructed from the tag sets of words of a sequence of words of a sentence, the value of said function on a sequence of tags corresponding to the likelihood of occurrence of a sequence of words having said sequence of tags, whereby determination of the tag sequence of greatest value determines the most likely tag of each word of the sequence.
-
13. An improved annotator for annotating natural language words with tags indicative of possible grammatical or syntactic uses of the words, such annotator comprising
means for receiving an encoded natural language sentence for processing, means for assigning to a word of the sentence a set of tags indicative of the possible grammatical or syntactic uses of said word, and tag disambiguation means for identifying a single tag of a multiply-tagged word, such tag disambiguation means including means for constructing a selected bounded subset of tag sequences representative of possible tags associated with a sequence of respective words of the sentence, the sequence of words including said multiply-tagged word, means for defining a function value on each tag sequence of said subset of tag sequences, and means for selecting a specific tag sequence having the greatest function value defined thereon, whereby a single tag is identified from the multiply-tagged word by a single tag of said specific tag sequence thereby associated with said word.
Specification