Method and system for information extraction
First Claim
1. A method for extracting information from a natural language text corpus based on a natural language query, comprising the steps of:
- analyzing said natural language text corpus with respect to surface structure of word tokens and surface syntactic roles of constituents;
indexing and storing the analyzed natural language text corpus;
analyzing a natural language query with respect to surface structure of word tokens and surface syntactic roles of constituents;
creating a number of surface variants of the analyzed natural language query by replacing word tokens of said natural language query, and for at least one surface variant by rearranging word tokens of said natural language query, in such a way that said number of surface variants are equivalent to said natural language query with respect to lexical meaning of word tokens and surface syntactic roles of constituents;
comparing said number of surface variants and said analyzed natural language query with the indexed and stored analyzed natural language text corpus; and
extracting from said indexed and stored analyzed natural language text corpus, each portion of text comprising a string of word tokens that matches any one of said surface variants or said analyzed natural language query.
1 Assignment
0 Petitions
Accused Products
Abstract
A method and a system for extracting information from a natural language text corpus based on a natural language query are disclosed. In the method the natural language text corpus is analyzed with respect to surface structure of word tokens and surface syntactic roles of constituents, and the analyzed natural language text corpus is then indexed and stored. Furthermore a natural language query is analyzed with respect to surface structure of word tokens and surface syntactic roles of constituents. From the analyzed natural language query one or more surface variants are then created, where these surface variants are equivalent to the natural language query with respect to lexical meaning of word tokens and surface syntactic roles of constituents. The surface variants are then compared with the indexed and stored analyzed natural language text corpus, and each portion of text comprising a string of word tokens that matches the any one of the surface variants or the natural language query is extracted from the indexed and stored analyzed natural language text corpus.
27 Citations
26 Claims
-
1. A method for extracting information from a natural language text corpus based on a natural language query, comprising the steps of:
-
analyzing said natural language text corpus with respect to surface structure of word tokens and surface syntactic roles of constituents;
indexing and storing the analyzed natural language text corpus;
analyzing a natural language query with respect to surface structure of word tokens and surface syntactic roles of constituents;
creating a number of surface variants of the analyzed natural language query by replacing word tokens of said natural language query, and for at least one surface variant by rearranging word tokens of said natural language query, in such a way that said number of surface variants are equivalent to said natural language query with respect to lexical meaning of word tokens and surface syntactic roles of constituents;
comparing said number of surface variants and said analyzed natural language query with the indexed and stored analyzed natural language text corpus; and
extracting from said indexed and stored analyzed natural language text corpus, each portion of text comprising a string of word tokens that matches any one of said surface variants or said analyzed natural language query. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 19, 20)
-
-
16. A system for extracting information from a natural language text corpus based on a natural language query, comprising:
-
a text analysis unit for analyzing a natural language text corpus and a natural language query with respect to surface structure of word tokens and surface syntactic roles of constituents;
storage means operatively connected to said text analysis unit, for storing the analyzed natural language text corpus;
an indexer, operatively connected to said storage means, for indexing the analyzed natural language text corpus;
an index, operatively connected to said indexer, for storing said indexed analyzed natural language text corpus;
a query manager, operatively connected to said text analysis unit, comprising means for creating surface variants of said natural language query by replacing word tokens and rearranging word tokens of said natural language query in such a way that said surface variants are equivalent to said natural language query with respect to lexical meaning of word tokens and surface syntactic roles of constituents, and means for comparing said surface variants and said analyzed natural language query with the indexed analyzed natural language text corpus in said index; and
a result manager operatively connected to said index, for extracting, from said indexed and stored analyzed natural language text corpus, each portion of text comprising a string of word tokens that matches any one of said surface variants or said analyzed natural language query. - View Dependent Claims (17, 18)
-
-
21. A method for extracting information from a natural language text corpus based on a natural language query, comprising the steps of:
-
analyzing said natural language text corpus with respect to location of phrases, location of word tokens, phrase types, and lexical meaning of word tokens;
indexing and storing the analyzed natural language text corpus;
analyzing a natural language query with respect to phrases, phrase types, word tokens of phrases, and lexical meaning of word tokens;
identifying, for at least one phrase of the analyzed natural language query, phrases of the indexed and stored analyzed natural language text corpus each having the same phrase type as the at least one phrase of the analyzed natural language query, and each comprising a word token being a lexical head and having the same lexical meaning as a word token being a lexical head of the at least one phrase of the analyzed natural language query; and
extracting, from the indexed and stored analyzed natural language text corpus, portions of text comprising the identified phrases. - View Dependent Claims (22, 23, 24)
-
-
25. A method for extracting information from a natural language text corpus based on a natural language query, comprising the steps of:
-
analyzing said natural language text corpus with respect to location of phrases, location of word tokens, phrase types, and lexical meaning of word tokens;
indexing and storing the analyzed natural language text corpus;
analyzing a natural language query consisting of one phrase with respect to phrase type, word tokens of the phrase, and lexical meaning of the word tokens;
identifying phrases of the indexed and stored analyzed natural language text corpus each having the same phrase type as the phrase of the analyzed natural language query, each comprising a word token being a lexical head and having the same lexical meaning as a word token being a lexical head of the phrase of the analyzed natural language query, and each comprising a word token being a modifier and having the same lexical meaning as a word token being a modifier of the lexical head of the phrase of the analyzed natural language query; and
extracting, from the indexed and stored analyzed natural language text corpus, portions of text comprising the identified phrases. - View Dependent Claims (26)
-
Specification