Method and system for information extraction

US 6,842,730 B1
Filed: 06/23/2000
Issued: 01/11/2005
Est. Priority Date: 06/22/2000
Status: Active Grant

First Claim

Patent Images

1. A method of storing a natural language text corpus in a database, comprising the steps of:

identifying word tokens of said natural language text corpus;

determining locations in the natural language text of the identified word tokens;

determining word types associated with the identified word tokens;

storing the determined word types in said database, wherein the number of stored word types is less than the number of identified word tokens;

storing word token location identifiers identifying the determined locations in the natural language text corpus of the identified word tokens; and

linking the stored word token location identifiers to the stored word types, such that, for a given identified word token, the stored word token location identifier identifying the location of the identified word token is logically linked to the stored word type associated with the identified word token.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and a system for extracting information from a natural language text corpus based on a natural language query are disclosed. In the method the natural language text corpus is analyzed with respect to surface structure of word tokens and surface syntactic roles of constituents, and the analyzed natural language text corpus is then indexed and stored. Furthermore a natural language query is analyzed with respect to surface structure of word tokens and surface syntactic roles of constituents. From the analyzed natural language query one or more surface variants are then created, where these surface variants are equivalent to the natural language query with respect to lexical meaning of word tokens and surface syntactic roles of constituents. The surface variants are then compared with the indexed and stored analyzed natural language text corpus, and each portion of text comprising a string of word tokens that matches the any one of the surface variants or the natural language query is extracted from the indexed and stored analyzed natural language text corpus.

80 Citations

View as Search Results

8 Claims

1. A method of storing a natural language text corpus in a database, comprising the steps of:
- identifying word tokens of said natural language text corpus;
  
  determining locations in the natural language text of the identified word tokens;
  
  determining word types associated with the identified word tokens;
  
  storing the determined word types in said database, wherein the number of stored word types is less than the number of identified word tokens;
  
  storing word token location identifiers identifying the determined locations in the natural language text corpus of the identified word tokens; and
  
  linking the stored word token location identifiers to the stored word types, such that, for a given identified word token, the stored word token location identifier identifying the location of the identified word token is logically linked to the stored word type associated with the identified word token.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The method according to claim 1, further comprising the steps of:
    - determining morpho-syntactic descriptions for the identified word tokens;
      
      storing the morpho-syntactic descriptions for the identified word tokens; and
      
      linking the stored morpho-syntactic descriptions to the stored word token location identifiers, such that, for a given identified word token, the stored morpho-syntactic description for the identified word token is logically linked to the stored word token location identifier identifying the location of the identified word token.
  - 3. The method according to claim 2, wherein the morpho-syntactic description of a word token comprises a part-of-speech and an inflectional form of the word token.
  - 4. The method according to claim 1, further comprising the steps of:
    - identifying phrases of said natural language text corpus;
      
      determining word tokens comprised in the identified phrases; and
      
      storing phrase location identifiers identifying the stored word token location identifiers of the word tokens comprised in the identified phrases, such that, for a given identified phrase, the stored phrase location identifier of the identified phrase identifies the stored word token location identifiers identifying the location of the word tokens comprised in the identified phrase.
  - 5. The method according to claim 4, further comprising the steps of:
    - determining phrase types of the identified phrases;
      
      storing the determined phrase types; and
      
      linking the stored phrase types to the stored phrase location identifiers, such that, for a given identified phrase, the phrase type for the identified phrase is logically linked to the stored phrase location identifier identifying the stored word token location identifiers identifying the location of the word tokens comprised in the identified phrase.

6. A system for storing a natural language text corpus, comprising:
- a text analysis unit for identifying word tokens of said natural language text corpus, determining locations in the natural language text of the identified word tokens, and determining word types associated with the identified word tokens;
  
  a database for storing the determined word types, wherein the number of stored word;
  
  types is less than the number of identified word tokens, storing word token location identifiers identifying the location in the natural language text corpus of a respective identified word token, and linking the stored word token location identifiers to the stored word types, such that, for a given identified word token, the stored word token location identifier identifying the location of the identified word token is logically linked to the stored word type which is associated with the identified word token.
- View Dependent Claims (7, 8)
- - 7. The system according to claim 6, wherein the text analysis unit is further adapted to determine morpho-syntactic descriptions for the identified word tokens, and the database further stores the morpho-syntactic descriptions for the identified word tokens, and links the morpho-syntactic descriptions to the stored word type location identifiers, such that, for a given identified word token, the morpho-syntactic description for the identified word token is logically linked to the stored word token location identifier identifying the location of the identified word token.
  - 8. The system according to claim 7, wherein the morpho-syntactic description for the word token comprises a part-of-speech and an inflectional form of the word token.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Essencient Ltd
Original Assignee
Hapax Ltd.
Inventors
Braroe, Peter A., Ejerhed, Eva Ingegord
Primary Examiner(s)
Chawan, Vijay
Assistant Examiner(s)
HARPER, V PAUL

Application Number

US09/599,563
Time in Patent Office

1,663 Days
Field of Search

707/5, 707/4, 707/3, 707/102, 704/9, 704/8, 704/257, 704/2, 704/1, 434/362
US Class Current

704/9
CPC Class Codes

G06F 16/3334   Selection or weighting of t...

G06F 16/3335   Syntactic pre-processing, e...

G06F 16/3344   using natural language anal...

G06F 40/20   Natural language analysis s...

G06F 40/211   Syntactic parsing, e.g. bas...

G06F 40/253   Grammatical analysis; Style...

G06F 40/268   Morphological analysis

G06F 40/30   Semantic analysis

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99934   Query formulation, input pr...

Y10S 707/99935   Query augmenting and refini...

Method and system for information extraction

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

80 Citations

8 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for information extraction

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

80 Citations

8 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links