Method and system for information extraction
First Claim
1. A method of storing a natural language text corpus in a database, comprising the steps of:
- identifying word tokens of said natural language text corpus;
determining locations in the natural language text of the identified word tokens;
determining word types associated with the identified word tokens;
storing the determined word types in said database, wherein the number of stored word types is less than the number of identified word tokens;
storing word token location identifiers identifying the determined locations in the natural language text corpus of the identified word tokens; and
linking the stored word token location identifiers to the stored word types, such that, for a given identified word token, the stored word token location identifier identifying the location of the identified word token is logically linked to the stored word type associated with the identified word token.
4 Assignments
0 Petitions
Accused Products
Abstract
A method and a system for extracting information from a natural language text corpus based on a natural language query are disclosed. In the method the natural language text corpus is analyzed with respect to surface structure of word tokens and surface syntactic roles of constituents, and the analyzed natural language text corpus is then indexed and stored. Furthermore a natural language query is analyzed with respect to surface structure of word tokens and surface syntactic roles of constituents. From the analyzed natural language query one or more surface variants are then created, where these surface variants are equivalent to the natural language query with respect to lexical meaning of word tokens and surface syntactic roles of constituents. The surface variants are then compared with the indexed and stored analyzed natural language text corpus, and each portion of text comprising a string of word tokens that matches the any one of the surface variants or the natural language query is extracted from the indexed and stored analyzed natural language text corpus.
80 Citations
8 Claims
-
1. A method of storing a natural language text corpus in a database, comprising the steps of:
-
identifying word tokens of said natural language text corpus;
determining locations in the natural language text of the identified word tokens;
determining word types associated with the identified word tokens;
storing the determined word types in said database, wherein the number of stored word types is less than the number of identified word tokens;
storing word token location identifiers identifying the determined locations in the natural language text corpus of the identified word tokens; and
linking the stored word token location identifiers to the stored word types, such that, for a given identified word token, the stored word token location identifier identifying the location of the identified word token is logically linked to the stored word type associated with the identified word token. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A system for storing a natural language text corpus, comprising:
-
a text analysis unit for identifying word tokens of said natural language text corpus, determining locations in the natural language text of the identified word tokens, and determining word types associated with the identified word tokens;
a database for storing the determined word types, wherein the number of stored word;
types is less than the number of identified word tokens, storing word token location identifiers identifying the location in the natural language text corpus of a respective identified word token, and linking the stored word token location identifiers to the stored word types, such that, for a given identified word token, the stored word token location identifier identifying the location of the identified word token is logically linked to the stored word type which is associated with the identified word token. - View Dependent Claims (7, 8)
-
Specification