Multiple score language processing system
First Claim
1. A language processing system for generating the most likely analysis of the type of an annotated syntax tree of a sentence comprising a word sequence, wherein the word sequence is received from digitally encoded text, and outputting the most likely analysis via computer processing means, wherein said most likely analysis includes the most likely sequence of lexical categories for the words, the most likely syntactic structure of the type of a syntax tree for the sentence, and the most likely semantic attribute for each word, the language processing system comprising:
- means for storing dictionary data records containing possible lexical categories and semantic attributes of words in said computer;
means for storing grammar rules, indicative of the parent-children node relationship among grammatical constituents, by computer processing means, and assigning an ordered list of numbers (hereinafter, a permutation vector), for each grammar rule indicative of the semantic precedence of each child node relative to the other nodes;
means for decomposing a syntax tree into a plurality of phrase levels representative of the structure and substructures of said tree, and the context under which a substructure is constructed, by computer processing means;
annotating means for forming an ordered semantic feature vector for each node of a syntax tree representative of the major semantic features of said each node, and the semantic relationship among the features of the words, by transferring the semantic attributes of the words upward to the tree nodes, according to said permutation vectors, by computer processing means;
means for driving data records indicative of the real usage of the words, lexical categories, syntactic structures and semantic feature co-occurrence, in text corpora annotated with lexical categories, syntax trees and semantic attributes, with computer processing means, by using said decomposing means and annotating means;
means for storing statistical data, derived from said annotated text corpora, indicative of the probability of a word among all words having a common lexical category (hereinafter, lexical category probability), the probability of a lexical category being preceded by at least one neighboring lexical category (hereinafter, lexical context probability), the probability of a phrase level being reduced from a neighboring phrase level, or equivalently, the probability of constructing a nonterminal node under a particular contextual environment defined by neighboring terminal or nonterminal nodes (hereinafter, syntactic score probability), and the probability of a node being annotated with a particular ordered semantic feature vector given the syntactic subtree rooted at said node and at least one adjacent node of said node being annotated (hereinafter, semantic score probability);
means for receiving a sentence from computer input devices or storage media;
means, operative on said stored dictionary data, grammar rules and permutation vectors, for determining all possible annotated syntax trees, or equivalently, all possible lexical category sequences for the words, all syntactic structures, of the type of a syntax tree, for said lexical category sequences, and all semantic attribute sequences corresponding to said category sequences, and aid syntactic structures, by computer processing means, for said sentence or word sequence;
means, operative on said stored statistical data by computer processing means, for generating an analysis score, for each possible analysis (or annotated syntax tree), of said sentence or word sequence; and
means for determining the most likely sequence of lexical categories for the words;
means for determining the most likely syntactic structure for a sentence;
means for determining the most likely semantic attribute for a plurality of words in the text word; and
means for outputting an output annotated syntax tree according to said analysis score thus generated.
1 Assignment
0 Petitions
Accused Products
Abstract
A language processing system includes a mechanism for measuring the syntax trees of sentences of material to be translated and a mechanism for truncating syntax trees in response to the measuring mechanism. In a particular embodiment, a Score Function is provided for disambiguating or truncating ambiguities on the basis of composite scores, generated at different stages of the processing.
454 Citations
20 Claims
-
1. A language processing system for generating the most likely analysis of the type of an annotated syntax tree of a sentence comprising a word sequence, wherein the word sequence is received from digitally encoded text, and outputting the most likely analysis via computer processing means, wherein said most likely analysis includes the most likely sequence of lexical categories for the words, the most likely syntactic structure of the type of a syntax tree for the sentence, and the most likely semantic attribute for each word, the language processing system comprising:
-
means for storing dictionary data records containing possible lexical categories and semantic attributes of words in said computer; means for storing grammar rules, indicative of the parent-children node relationship among grammatical constituents, by computer processing means, and assigning an ordered list of numbers (hereinafter, a permutation vector), for each grammar rule indicative of the semantic precedence of each child node relative to the other nodes; means for decomposing a syntax tree into a plurality of phrase levels representative of the structure and substructures of said tree, and the context under which a substructure is constructed, by computer processing means; annotating means for forming an ordered semantic feature vector for each node of a syntax tree representative of the major semantic features of said each node, and the semantic relationship among the features of the words, by transferring the semantic attributes of the words upward to the tree nodes, according to said permutation vectors, by computer processing means; means for driving data records indicative of the real usage of the words, lexical categories, syntactic structures and semantic feature co-occurrence, in text corpora annotated with lexical categories, syntax trees and semantic attributes, with computer processing means, by using said decomposing means and annotating means; means for storing statistical data, derived from said annotated text corpora, indicative of the probability of a word among all words having a common lexical category (hereinafter, lexical category probability), the probability of a lexical category being preceded by at least one neighboring lexical category (hereinafter, lexical context probability), the probability of a phrase level being reduced from a neighboring phrase level, or equivalently, the probability of constructing a nonterminal node under a particular contextual environment defined by neighboring terminal or nonterminal nodes (hereinafter, syntactic score probability), and the probability of a node being annotated with a particular ordered semantic feature vector given the syntactic subtree rooted at said node and at least one adjacent node of said node being annotated (hereinafter, semantic score probability); means for receiving a sentence from computer input devices or storage media; means, operative on said stored dictionary data, grammar rules and permutation vectors, for determining all possible annotated syntax trees, or equivalently, all possible lexical category sequences for the words, all syntactic structures, of the type of a syntax tree, for said lexical category sequences, and all semantic attribute sequences corresponding to said category sequences, and aid syntactic structures, by computer processing means, for said sentence or word sequence; means, operative on said stored statistical data by computer processing means, for generating an analysis score, for each possible analysis (or annotated syntax tree), of said sentence or word sequence; and means for determining the most likely sequence of lexical categories for the words; means for determining the most likely syntactic structure for a sentence; means for determining the most likely semantic attribute for a plurality of words in the text word; and means for outputting an output annotated syntax tree according to said analysis score thus generated. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A system for processing digitally encoded language materials for quickly truncating unlikely analyses and outputting at least one most likely analysis, of the type of an annotated syntax tree, by computer processing means comprising:
-
mean for storing dictionary data records; means for storing grammar rules; means for assigning a permutation vector for each grammar rule indicative of the semantic precedence of the children nodes; means for storing a threshold for each word position indicative of an allowed lower bound of analysis score defined on the word sequence up to said word position; means for decomposing a tree into a plurality of phrase levels by computer processing means; annotating means for forming an ordered semantic feature vector for each node of a syntax tree and hence annotating a syntax tree into an annotated syntax tree and a phrase level into an annotated phrase level, according to said permutation vectors, by computer processing means; means for deriving data records indicative of the real usage of the words, lexical categories, syntactic structures and semantic feature co-occurrence in text corpora, with computer processing means, by using said decomposing means and annotating means; and means for storing statistical data, derived from text corpora, of the type of lexical category probabilities, lexical context probabilities, syntactic score probabilities, and semantic score probabilities; input means for entering the language materials from computer input devices or storage media, including speech recognition means, said language materials including a plurality of words arranged into sentences; means for constructing a set of semantically annotated syntac structures for each of the sentences, by computer processing means, according to the dictionary records, stored grammar rules, and stored permutation vectors for the grammar rules; score determination means for applying stored statistical data, word-by-word at each word position, to define an analysis score, and the corresponding lexical, syntactic and semantic scores, for each annotated syntax tree or partially constructed annotated syntax tree defined on the word sequence up to each word position, means for interrupting the constructing means, in the computer processing stage when an annotated syntax structure being constructed is of low analysis score in comparison with said threshold for the current word position or the lowest analysis score of previously analyzed complete analyses; and
means for restarting the constructing means to construct another annotated syntax structure; andmeans, operably coupled to the constructing means, for selecting from the set a best annotated syntax structure as output for a sentence. - View Dependent Claims (10, 11, 12)
-
-
13. A method for translating digitally encoded language materials of a first language into a second language in text or speech with a computer system having a processor module, a memory module and other storage media, user input devices and output devices, the method comprising the steps of:
-
(a) deriving from text corpora, a set of lexical category probabilities, a set of lexical context probabilities, a set of syntactic score probabilities, and a set of semantic score probabilities, indicative of the use of words, lexical category sequences, syntactic structures and semantic features, (b) storing into the memory module, by computer processing means, the dictionary data records containing possible lexical categories and semantic attributes of words, grammar rules concerning legal syntactic structures of the language of the input sentences, and a permutation vector for each grammar rule indicative of the semantic precedence of children nodes, and statistical data, of the type of lexical category probabilities, lexical context probabilities, syntactic score probabilities and semantic score probabilities; (c) inputting a source text from said input devices or storage media, said source text having a plurality of words arranged into sentences; (d) constructing a possible analysis for each sentence by; (1) determining one possible lexical category sequence, syntactic structure, of the type of a syntax tree, and semantic attributes of the words for said each sentence by computer processing means, in response to the stored dictionary data, grammar rules, and annotating the syntax tree by transferring the semantic attributes upward to the tree nodes according to stored permutation vectors; (2) determining an analysis score by applying stored statistical data according to determined lexical category sequence, syntactic structure, semantic attributes of the words and the annotated syntax tree, by computer processing means, for said each sentence; and (3) if the determined analysis score is below a preselected value, repeating step (1) with another different combination of lexical, syntactic and semantic information; (e) repeating step (d) for each sentence of the source text, to construct a plurality of analyses for each sentence; and (f) outputting at least one analysis of said plurality of analyses thus constructed, (g) selecting from the plurality of analyses a best candidate analysis, (h) translating the source text into a target text based on he best candidate analysis for the source text, (i) optionally supplying the target text to a means for speech synthesis.
-
-
14. A robust disambiguation system for selecting a preferred analysis, of the type of an annotated syntax tree, of a word sequence, with discrimination and robustness enhanced statistical data for the system, comprising:
-
means for storing dictionary data records; means for storing grammar rules and assigning a permutation vector for each grammar rule indicative of the semantic precedence of children nodes; means for decomposing a syntax tree into a plurality of phrase levels by computer processing means; annotating means for forming an ordered semantic feature vector for each node of a syntax tree according to said permutation vectors, by computer processing means; means for deriving data records indicative of the real usage of the words, lexical categories, grammatical syntactic structures and semantic feature co-occurrence, in text corpora, with computer processing means, by using said decomposing means and annotating means; means for storing statistical data, of the type of lexical category probabilities, lexical context probabilities, syntactic score probabilities, and semantic score probabilities, derived by analyzing a master text using computer processing means, said master text comprising words and annotated lexical categories, syntactic structures of the type of a syntax tree, and semantic attributes; means for modifying said stored statistical data by enhancing the discrimination power and robustness of the stored statistical data for improving the performance of the system; means for receiving a word sequence from a digitally encoded input text; means for deriving a set of candidate analyses of said word sequence, in response to stored dictionary data, grammar rules and permutation vectors, with computer processing means, each said candidate analysis being a possible analysis of lexical category sequence, syntactic structure, and semantic attribute sequence for said word sequence; means for generating an analysis score, by computer processing means, for each analysis in said set using said statistical data; means for selecting a preferred analysis from said set of candidate analyses according to the generated analysis score for each candidate analysis in said set by computer processing means; and means for outputting said preferred analysis to make said preferred analysis available for further use in a language processing system. - View Dependent Claims (15, 16, 17, 18, 19, 20)
-
Specification