Method and system for generating lexicon of cooccurrence relations in natural language
First Claim
1. A method, using a computer including a processor and a memory, of generating cooccurrence relation information indicating whether a sequence of words in a given sentence described in a natural language is semantically correct or not, said method comprising the steps of:
- (a) defining categories of sentences on the basis of the types of documents in which the sentences appear;
(b) defining fields of sentences on the basis of the subject matters of the sentences;
(c) preparing a text corpus by collecting input textual sentences belonging to the same category or the same field as the given sentence;
(d) preparing a cooccurrence relation table containing grammar or a set of grammatical rules for analyzing the textual sentences of the text corpus to permit determining a cooccurrence relation between words in the textual sentences;
(e) determining a hypothesized cooccurrence relation between words in the sequence of words in the given sentence on the basis of a cooccurrence relation from said cooccurrence relation table, the hypothesized cooccurrence relation indicating a particular possible concurrence relation between words in the given sentence;
(f) deriving an actual cooccurrence relation between words in the sequence of words in the given sentence from the determined hypothesized cooccurrence relation;
(g) determining whether the actual cooccurrence relation exceeds a predetermined threshold condition for a valid cooccurrence relation; and
(h) when the actual cooccurrence relation exceeds the predetermined threshold condition, outputting information indicating the actual cooccurrence relation as a valid cooccurrence relation.
1 Assignment
0 Petitions
Accused Products
Abstract
A method and an apparatus for generating/maintaining automatically or interactively a lexicon for storing information of cooccurrence relations utilized for determining whether or not a sequence of words in a given sentence described in a natural language is semantically correct with the aid of a memory, a data processor and a textual sentence file. A hypothesized cooccurrence relation table for storing hypothesized cooccurrence relations each having a high probabliity of being a valid cooccurrence relation is prepared by consulting the file. A hypothesis for the cooccurrence relation is previously established on the basis of a cooccurrence relation pattern indicating a probably acceptable conjunction by consulting the hypothesized cooccurrence relation table. Subsequently, a corresponding actual cooccurrence relation is derived from the textual file by parsing the relevant textual sentence and is tested to determine whether the cooccurrence relation is valid or not with reference to predetermined threshold conditions. On the basis of the results of the test, the information of the cooccurrence relation is correspondingly modified. The present method and apparatus can be utilized in a natural language parsing system and a machine translation system.
163 Citations
13 Claims
-
1. A method, using a computer including a processor and a memory, of generating cooccurrence relation information indicating whether a sequence of words in a given sentence described in a natural language is semantically correct or not, said method comprising the steps of:
-
(a) defining categories of sentences on the basis of the types of documents in which the sentences appear; (b) defining fields of sentences on the basis of the subject matters of the sentences; (c) preparing a text corpus by collecting input textual sentences belonging to the same category or the same field as the given sentence; (d) preparing a cooccurrence relation table containing grammar or a set of grammatical rules for analyzing the textual sentences of the text corpus to permit determining a cooccurrence relation between words in the textual sentences; (e) determining a hypothesized cooccurrence relation between words in the sequence of words in the given sentence on the basis of a cooccurrence relation from said cooccurrence relation table, the hypothesized cooccurrence relation indicating a particular possible concurrence relation between words in the given sentence; (f) deriving an actual cooccurrence relation between words in the sequence of words in the given sentence from the determined hypothesized cooccurrence relation; (g) determining whether the actual cooccurrence relation exceeds a predetermined threshold condition for a valid cooccurrence relation; and (h) when the actual cooccurrence relation exceeds the predetermined threshold condition, outputting information indicating the actual cooccurrence relation as a valid cooccurrence relation. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A method, using a computer including a processor and a memory, of automatically generating and maintaining a cooccurrence relation lexicon storing cooccurrence relation information indicating whether a sequence of words in a given sentence described in a natural language is semantically correct or not, said method comprising the steps of:
-
(a) storing in said memory a processing program for generating or maintaining said cooccurrence relation lexicon and a table containing hypothesized cooccurrence relations of high probability; (b) defining categories of sentences on the basis of the types of documents in which the sentences appear; (c) defining fields of sentences on the basis of the subject matters of the sentences; (d) preparing a text corpus file by collecting input textual sentences belonging to the same category or the same field as the given sentence; (e) determining a hypothesized cooccurrence relation between words in the sequence of words in the given sentence on the basis of a cooccurrence relation from said hypothesized cooccurrence relation table, the hypothesized cooccurrence relation indicating a particular possible cooccurrence relation between words in the given sentence; (f) deriving from said text corpus file actual textual sentences relevant to terms contained in the most recently determined hypothesized cooccurrence relation, analyzing the derived actual textual sentences, and storing the result of the analysis in said memory; (g) determining whether the result of the analysis indicates that information having the most recently determined hypothesized cooccurrence relation meets predetermined threshold conditions; (h) when the result of the analysis indicates that the information having the most recently determined hypothesized cooccurrence relation meets the predetermined threshold conditions, including the most recently determined hypothesized cooccurrence relation in said lexicon unless data of cooccurrence relations corresponding to a super-concept or a subconcept of the most recently determined hypothesized cooccurrence relation are present in said lexicon, and examining the probability of determining another hypothesized cooccurrence relation; (i) when the result of the analysis indicates that the information having the most recently determined hypothesized cooccurrence relation does not meet the predetermined threshold conditions, examining the probability of determining a further hypothesized cooccurrence relation; (j) when the result of the most recent analysis indicates that the possible further hypothesized cooccurrence relation does not meet the predetermined threshold conditions, examining the probability of determining a still further hypothesized cooccurrence relation; and (k) when a probability of establishing a further hypothesized cooccurrence relation is found in step (h), (i), or (j), re-executing the method commencing with step (e). - View Dependent Claims (7, 8, 13)
-
-
9. A system for generating cooccurrence relation information indicating whether a sequence of words in a given sentence described in a natural language is semantically correct or not, wherein the given sentence is defined as within a particular one of a plurality of sentence categories on the basis of the type of document in which the given sentence appears and is defined as within a particular one of a plurality of sentence fields on the basis of the subject matter of the given sentence, said system comprising:
-
a text corpus file including textual sentences belonging to the same category or the same field as the given sentence; a cooccurrence relation table containing grammar or a set of grammatical rules for analyzing the textual sentences of said text corpus file to permit determining a cooccurrence relation between words in the textual sentences; a memory including an area for storing a hypothesized cooccurrence relation table listing hypothesized cooccurrence relations having a high probability of valid cooccurrence relations and an area for storing a processing program for executing algorithms for automatically generating and maintaining a cooccurrence relation lexicon; means for determining hypothesized cooccurrence relations between words of the sequence of words in the given sentence on the basis of cooccurrence relation patterns, indicative of high probability of a particular cooccurrence relation extracted from said hypothesized cooccurrence relation table in accordance with a processing program stored in said memory; and testing means for responding to hypothesized cooccurrence relations determined by said determining means to derive textual sentences having relevant actual cooccurrence relation patterns from said text corpus file and for analyzing each of the derived textual sentences with the aid of sentence analysis or generation rules and a sentence analysis or generation lexicon, said testing means including means for examining whether the result of the analysis indicates that the derived textual sentences meet predetermined threshold conditions for a valid cooccurrence relation and means for outputting information indicating the valid cooccurrence relation. - View Dependent Claims (10)
-
-
11. A method, using a computer including a processor and a memory, of generating cooccurrence relation information indicating whether a sequence of words in a given sentence described in a natural language is semantically correct or not, said method comprising the steps of:
-
(a) defining categories of sentences on the basis of the types of documents in which the sentences appear; (b) defining fields of sentences on the basis of the subject matters of the sentences; (c) preparing a text corpus by collecting input textual sentences belonging to the same category or the same field as the given sentence; (d) determining a hypothesized cooccurrence relation between words in the sequence of words in the given sentence on the basis of a cooccurrence relation pattern set up by an operator and indicating a particular possible cooccurrence relation between words in the the given sentence; (e) deriving an actual cooccurrence relation between words in the sequence of words in the given sentence from said text corpus for the determined hypothesized cooccurrence relation; (f) determining whether the actual cooccurrence relation exceeds a predetermined threshold condition for a valid cooccurrence relation; and (g) when the actual cooccurrence relation exceeds the predetermined threshold condition, outputting information indicating the actual cooccurrence relation as a valid cooccurrence relation. - View Dependent Claims (12)
-
Specification