Collocational grammar system

US 4,868,750 A
Filed: 10/07/1987
Issued: 09/19/1989
Est. Priority Date: 10/07/1987
Status: Expired due to Term

First Claim

Patent Images

1. A processor for parsing digitally encoded natural language text, such processor comprising means for receiving encoded natural language text for processing,dictionary means for storing words of the natural language together with a list of associated tags indicative of the possible grammatical or syntactic properties of each word,means for looking up a word of the text in the dictionary and annotating the word with its associated tags from the dictionary to provide a word record,means operative on word records of words of a sentence for defining a relative probability of occurrence of a tag sequence consisting of one tag selected from the word record of each word of a sequence of words of the sentence,means for constructing a selected set of tag sequences having a tag selected from the tags associated with each word of the sequence of words and determining a tag sequence of greatest relative probability of occurrence thereby identifying a single most probable tag for each word of the sequence, andgrammatical processing means for identifying grammatical structure from the ordering of the single tag for each said word so as to obtain a parse of the sentence.

View all claims

12 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system for the grammatical annotation of natural language receives natural language text and annotates each word with a set of tags indicative of its possible grammatical or syntactic uses. An empirical probability of collocation function defined on pairs of tags is iteratively extended to a selected set of tag sequences of increasing length so as to select a most probable tag for each word of a sequence of ambiguously-tagged words. For listed pairs of commonly confused words a substitute calculation reveals erroneous use of the wrong word. For words with tags having abnormally low frequency of occurrence, a stored table of reduced probability factors corrects the calculation. Once the text words have been annotated with their most probable tags, the tagged text is parsed by a parser which successively applies phrasal, predicate and clausal analysis to build higher structures from the disambiguated tag strings. A voice/text translator including such a tag annotator resolves sound or spelling ambiguity of words by their differing tags. A database retrieval system, such as a spelling checker, includes a tag annotator to identify desired data by syntactic features.

Citations

17 Claims

1. A processor for parsing digitally encoded natural language text, such processor comprising means for receiving encoded natural language text for processing,dictionary means for storing words of the natural language together with a list of associated tags indicative of the possible grammatical or syntactic properties of each word,means for looking up a word of the text in the dictionary and annotating the word with its associated tags from the dictionary to provide a word record,means operative on word records of words of a sentence for defining a relative probability of occurrence of a tag sequence consisting of one tag selected from the word record of each word of a sequence of words of the sentence,means for constructing a selected set of tag sequences having a tag selected from the tags associated with each word of the sequence of words and determining a tag sequence of greatest relative probability of occurrence thereby identifying a single most probable tag for each word of the sequence, andgrammatical processing means for identifying grammatical structure from the ordering of the single tag for each said word so as to obtain a parse of the sentence.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. A processor according to claim 1, wherein the means for defining a relative probability of tag sequences includes means for selecting fewer than a fixed number n of sequences from said selected set and for defining said relative probability thereon.
  - 3. A processor according to claim 1, wherein the means for determining a tag sequence of greatest relative probability of occurrence further includes means for determining, in order, tag sequences having successively lesser relative probabilities of occurrence, thereby identifying a succession of next most probable tags for each word of the sequence, and wherein the means for further processing includes means for processing a said next most probable tag of a word in the event the most probable tag does not produce a parse of the sentence.
  - 4. A processor according to claim 1, wherein the means for defining a relative probability of occurrence of a tag sequence corresponding to a sequence of words includes means for modifying said relative probability in accordance with an observed reduced frequency of occurrence of a tag of said tag sequence corresponding to a particular word of the sequence of words.
  - 5. A processor according to claim 1, wherein the means for defining a relative probability of occurrence includes means for recognizing a word of the sequence of words which is commonly confused with a different word, and means for substituting a tag of such different word in the tag sequence, such that the means for selecting a tag sequence of greatest relative probability determines if the tag of said different word has a greater relative probability of occurrence.
  - 6. A processor according to claim 5, wherein the means for selecting a tag sequence of greatest relative probability of occurrence further includes means for identifying in order tag sequences having successively lesser relative probabilities of occurrence, thereby identifying a succession of next most probable tags for each word of the sequence, and wherein the means for further processing includes means for processing a said next most probable tag of a word in the event the most probable tag does not fit a correct parse of the sentence.

7. A processor for processing digitally encoded natural language text, such processor comprisingmeans for receiving digitally encoded natural language text,dictionary means for storing base forms of words of the natural language together with data codes indicative of grammatical or syntactic properties of each stored word,means for looking up each word of the text in the dictionary and for annotating the word with its said data codes to create a word record,collocational analysis means for performing a defined calculation to construct a function on a bounded set of selected sequences of data codes so as to determine for each word a data code indicative of its probable grammatical usage,said collocational analysis means includingfirst means operative on word records for assigning a likelihood of co-occurrence of data codes of adjacent words,second means for iteratively applying said first means to develop a probability-like measure on each of an ordered set of sequences of data codes wherein each successive data code of a sequence is selected from the word record of a successive word of the text, andmeans for determining a plurality of sequences of data codes of greatest probability thereby, associating with each word of the text a plurality of most probable data codes.
- View Dependent Claims (8, 9, 10)
- - 8. A processor according to claim 7, wherein the collocational analysis means further comprisesmeans for identifying commonly confused word pairs, and also includes means for constructing, when a given word of text is one word of a pair of commonly confused words, said function on sequences of data codes from each said word so as to determine whether a data code of the other word of the pair is more probable than the data codes of the given word.
  - 9. A processor according to claim 8, wherein the collocational analysis means further comprisesmeans for identifying words which occur in particular grammatical uses with reduced frequency, andmeans for providing in the defined calculation a weight reduction factor corresponding to a said reduced frequency for modifying the determination of a said most probable data code for an identified word of the sentence.
  - 10. A processor according to claim 9, further comprisingerror display means for displaying an error message when the processor determines said data code of said other word is more probable than the data codes of the given word.

11. A grammatical processor for processing digitally encoded natural language text so as to parse sentences of the text, such processor comprisinga dictionary of words of the language including for each word indications of its possible grammatical tags,tag annotation means for looking up words of a text sentence in the dictionary and annotating each word with its possible tags,tag selection means, including means for applying a collocation probability matrix to syntactically adjacent pairs of tags, andmeans for iteratively building up a probability-like measure on sequences formed of the possible tags of a sequence of words to determine at least one sequence of greatest probability thereby determining, when a word is annotated with more than one tag, a most probable tag of the word as a function of possible tags of surrounding words of the text sentence, andmeans for processing a string consisting of the most probable tags of the words of the sentence to identify the grammatical function of each word and determine a parse of the sentence.

12. A computerized system for the grammatical annotation of natural language, such system comprisingmeans for receiving encoded natural language text for processing,means for annotating a word of the text with a tag set of tag indicative of possible grammatical or syntactic uses of the word,selection means operative on the tag sets of the words of a sequence of words for determining the most probable tag of each word of the sequence, such selection means including(i) means operative on pairs of tags, one from the tag set of each of two adjacent words, for defining an empirical collocational likelihood, and(ii) means for extending said empirical likelihood to a function defined on a bounded subset of possible tag sequences constructed from the tag sets of words of a sequence of words of a sentence, the value of said function on a sequence of tags corresponding to the likelihood of occurrence of a sequence of words having said sequence of tags, whereby determination of the tag sequence of greatest value determines the most likely tag of each word of the sequence.

13. An improved annotator for annotating natural language words with tags indicative of possible grammatical or syntactic uses of the words, such annotator comprisingmeans for receiving an encoded natural language sentence for processing,means for assigning to a word of the sentence a set of tags indicative of the possible grammatical or syntactic uses of said word, andtag disambiguation means for identifying a single tag of a multiply-tagged word, such tag disambiguation means includingmeans for constructing a selected bounded subset of tag sequences representative of possible tags associated with a sequence of respective words of the sentence, the sequence of words including said multiply-tagged word,means for defining a function value on each tag sequence of said subset of tag sequences, andmeans for selecting a specific tag sequence having the greatest function value defined thereon, whereby a single tag is identified from the multiply-tagged word by a single tag of said specific tag sequence thereby associated with said word.
- View Dependent Claims (14, 15, 16, 17)
- - 14. An annotator according to claim 13, further comprisinga spelling verifier having means for detecting mispelled words and for identifying candidate replacement words, wherein the means for identifying candidate replacement words includes selection means for selecting candidate replacement words having a tag with a syntactic context compatible with that of the mispelled word.
  - 15. An annotator according to claim 13, further comprising transformation means for transforming a natural language between sound and text representations, wherein the transformation means includes means for resolving ambivalent representations of words by selection of the word whose tag is consistent with the syntactic context of the ambivalent word.
  - 16. An annotator according to claim 15, wherein the transformation means is a text-to-sound transformation system and the ambivalent representations are homographs.
  - 17. An annotator according to claim 15, wherein the transformation means is a sound-to-text transformation system and the ambivalent representations are homonyms.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Vantage Technology Holdings LLC
Original Assignee
Houghton Mifflin Harcourt Publishing Company (Houghton Mifflin Harcourt Co. (Private Equity))
Inventors
Hopkins, Jeffrey G., Carus, Alwin B., Kucera, Henry
Primary Examiner(s)
Fleming, Michael R.

Application Number

US07/106,127
Time in Patent Office

713 Days
Field of Search

364/419, 364/900 MS File, 364/200 MS File
US Class Current

704/8
CPC Class Codes

G06F 40/253 Grammatical analysis; Style...

G06F 40/289 Phrasal analysis, e.g. fini...

Collocational grammar system

First Claim

12 Assignments

0 Petitions

Accused Products

Abstract

Citations

17 Claims

Specification

Solutions

Use Cases

Quick Links

Collocational grammar system

First Claim

12 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

17 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links