Creating a document index from a flex- and Yacc-generated named entity recognizer

US 20060047691A1
Filed: 09/30/2004
Published: 03/02/2006
Est. Priority Date: 08/31/2004
Status: Abandoned Application

First Claim

Patent Images

1. A method of generating a web/document index comprising the steps of:

using a named entity recognizer generated from a tool used to parse computer programs to identify named entities in web pages/documents; and

constructing a web/document index of web pages/documents based in part on the named entities identified by the tool.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods of constructing a document index including named entity information generated by at least one tool associated with parsing computer programs are presented. The methods include using a lexical analyzer generator, e.g. Flex, and/or a parser generator, e.g. Yacc, to generate named entity recognizers. The named entity recognizers are used to identify named entities in documents, in particular, very large document sets such as web pages available on the Internet. The identified named entities are stored as named entity annotations in the document index. Also, methods of performing searches using the document index are presented. The searches are performed based on queries that can be received on an application programming interface (API). Relevant documents are obtained using the named entity annotations, which can be returned across the API. Also presented are associated computer readable media.

Citations

30 Claims

1. A method of generating a web/document index comprising the steps of:
- using a named entity recognizer generated from a tool used to parse computer programs to identify named entities in web pages/documents; and
  
  constructing a web/document index of web pages/documents based in part on the named entities identified by the tool.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The method of claim 1, and further comprising the steps of:
    - receiving text documents, and generating named entity annotations from the identified named entities.
  - 3. The method of claim 2, wherein constructing a web/document index comprises storing the named entity annotations in a database.
  - 4. The method of claim 3, wherein storing the named entities comprises storing a position indicator for each identified named entity.
  - 5. The method of claim 4, wherein storing the named entities comprises storing at least one class identifier for each identified named entity.
  - 6. The method of claim 1, wherein using a named entity recognizer comprises using at least one lexical analyzer generator applying regular expression rules to identify classes of named entities.
  - 7. The method of claim 1, wherein using a named entity recognizer comprises using at least one parser generator applying linguistic rules to identify classes of named entities.
  - 8. The method of claim 7, wherein using at least one lexical analyzer generator comprises using one of Flex and Lex, and wherein using at least one parser generator comprises using one of Yacc and Bison.

9. A computer readable medium having stored thereon computer readable instructions which, when read by the computer cause the computer to generate a document index by performing steps of:
- receiving text documents;
  
  identifying named entities in the text documents using a tool used to parse computer programs;
  
  generating named entity annotations corresponding with the identified named entities; and
  
  storing the generated named entity annotations in a database.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
- - 10. The computer readable medium of claim 9, wherein receiving text documents comprises receiving web pages.
  - 11. The computer readable medium of claim 9, wherein identifying named entities comprises using at least one lexical analyzer applying regular expression rules associated with classes or constituent strings of named entities.
  - 12. The computer readable medium of claim 11, wherein identifying named entities comprises using at least one parser applying grammar rules associated with classes or constituent strings of named entities.
  - 13. The computer readable medium of claim 12, wherein the lexical analyzer is generated using one of Flex and Lex, and wherein the parser is generated using one of Yacc and Bison.
  - 14. The computer readable medium of claim 9, wherein generating named entity annotations comprises generating position indicators for the identified named entities.
  - 15. The computer readable medium of claim 14, wherein generating position indicators comprises generating position information that comprises a start position and a string length or a start position and an end position for each identified named entity.
  - 16. The computer readable medium of claim 14, wherein generating named entity annotations comprises generating class identifiers for the identified named entities.
  - 17. The computer readable medium of claim 16, wherein generating named entity annotations comprises generating sub-class identifiers for at least some of the identified named entities.
  - 18. The computer readable medium of claim 9, wherein storing the generated named entity annotations comprises storing the named entity annotations along with information about the named entity class.
  - 19. The computer readable medium of claim 9, and further comprising storing tokens with corresponding named entity annotations.

20. A method of performing document searches comprising the steps of:
- constructing a document index with named entity annotations generated at least in part from a tool used for parsing computer programs;
  
  receiving a query comprising at least one named entity class;
  
  searching the document index for the at least one named entity class; and
  
  obtaining relevant documents.
- View Dependent Claims (21, 22, 23, 24, 25, 26, 27, 28, 29, 30)
- - 21. The method of claim 20, wherein constructing a document index comprises identifying named entities in web pages available on the Internet.
  - 22. The method of claim 20, wherein constructing a document index comprises periodically updating the document index.
  - 23. The method of claim 20, wherein constructing a document index comprises using at least one named entity recognizer generated by a lexical analyzer generator.
  - 24. The method of claim 23, wherein constructing a document index further comprises using at least one named entity recognizer generated using a parser generator.
  - 25. The method of claim 20, wherein receiving a query comprises receiving a query through an application programming interface (API), and wherein obtaining relevant documents comprises returning the relevant documents through the API.
  - 26. The method of claim 20, wherein receiving a query comprises receiving a query comprising at least one class of named entity.
  - 27. The method of claim 20, wherein searching the document index comprises searching for at least one class of named entity, and wherein obtaining relevant documents comprises obtaining documents comprising the at least one class of named entity contained in the received query.
  - 28. The method of claim 27, wherein searching the document index further comprises searching for at least one additional search term, and wherein obtaining relevant documents comprises obtaining documents comprising both the at least one class of named entity and at least one additional search term.
  - 29. The method of claim 28, wherein the at least one additional search term is one of a named entity, a named entity sub-class, a named entity constituent, and a word that is not identified as a named entity.
  - 30. The method of claim 20, wherein obtaining relevant documents comprises ranking the relevant documents for display based on named entity class information.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Powell, Kevin R., Calcagno, Michael V., Humphreys, Kevin W.

Application Number

US10/954,610
Publication Number

US 20060047691A1
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 40/295 Named entity recognition

Creating a document index from a flex- and Yacc-generated named entity recognizer

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

30 Claims

Specification

Solutions

Use Cases

Quick Links

Creating a document index from a flex- and Yacc-generated named entity recognizer

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

30 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links