System, method and computer program product for performing unstructured information management and automatic text analysis, including an annotation inverted file system facilitating indexing and searching

US 20040243560A1
Filed: 05/30/2003
Published: 12/02/2004
Est. Priority Date: 05/30/2003
Status: Abandoned Application

First Claim

Patent Images

1. A data processing system, comprising:

a token inverted file system storing tokens obtained by at least one tokenizer from document data; and

an annotation inverted file system storing annotations, a list of one or more occurrences of each annotation, and, for each listed occurrence, a set comprised of at least two token locations spanned by the respective annotation.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Disclosed is a system architecture, components and a searching technique for an Unstructured Information Management System (UIMS). The UIMS may be provided as middleware for the effective management and interchange of unstructured information over a wide array of information sources. The architecture generally includes a search engine, data storage, analysis engines containing pipelined document annotators and various adapters. The searching technique makes use of a two-level searching technique. The data processing system includes a token inverted file system storing tokens obtained by at least one tokenizer from document data. An annotation inverted file system stores annotations, a list of one or more occurrences of each annotation, and, for each listed occurrence, a set comprised of at least two token locations spanned by the respective annotation.

Citations

53 Claims

1. A data processing system, comprising:
- a token inverted file system storing tokens obtained by at least one tokenizer from document data; and
  
  an annotation inverted file system storing annotations, a list of one or more occurrences of each annotation, and, for each listed occurrence, a set comprised of at least two token locations spanned by the respective annotation.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
- - 2. A data processing system as in claim 1, where each occurrence is defined by a location of the respective annotation within a document.
  - 3. A data processing system as in claim 2, where a location is defined, relative to a document, by a starting location and at least one of an ending location and a length.
  - 4. A data processing system as in claim 1, where a set of token locations is monotonic.
  - 5. A data processing system as in claim 1, where a set of token locations is contiguous.
  - 6. A data processing system as in claim 1, where a set of token locations is non-contiguous.
  - 7. A data processing system as in claim 1, where an annotation type comprises one of a semantic type, a meta-value, a confidence and a price.
  - 8. A data processing system as in claim 1, where at least one token in a token set is spanned by at least two annotations.
  - 9. A data processing system as in claim 1, further comprising a document search engine coupled to document data storage that is responsive to a query that comprises at least one of an annotation, a token, and a token in relation to an annotation.
  - 10. A data processing system as in claim 9, further comprising a relationship data structure comprising at least one relationship comprised of arguments ordered in argument order, where a relationship is represented by a respective annotation, where said search engine is further responsive to a query comprising a specific relationship, and where said search engine searches said data storage to return at least one document having the specific relationship.
  - 11. A data processing system as in claim 10, where at least one argument comprises an argument annotation linked to the annotation.
  - 12. A data processing system as in claim 10, where said search engine further returns at least one argument in the specific relationship.
  - 13. A data processing system as in claim 12, where the at least one argument is returned in response to the query, but is not explicitly specified by the query.
  - 14. A data processing system as in claim 1, where said annotation comprises a relation identifier.
  - 15. A data processing system as in claim 14, where said relation identifier is comprised of at least one argument.
  - 16. A data processing system as in claim 15, where said at least one argument that comprises said relation identifier comprises at least one of:
    - at least one other annotation, a token, a string, a record, a meta-value, a category, a relation, a relation among at least two tokens, and a relation among at least two annotations.
  - 17. A data processing system as in claim 14, where said relation identifier comprises a logical predicate.
  - 18. A data processing system as in claim 10, where said search engine further returns a plurality of ordered arguments.

19. A computer program product embodied on a computer-readable medium and comprising program code for directing at least one computer to process document data, comprising:
- a program code segment for implementing a token inverted file system storing tokens obtained by at least one tokenizer from document data; and
  
  a computer program code segment for implementing an annotation inverted file system storing annotations, a list of one or more occurrences of each annotation, and, for each listed occurrence, a set comprised of at least two token locations spanned by the respective annotation.
- View Dependent Claims (20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36)
- - 20. A computer program product as in claim 19, where each occurrence is defined by a location of the respective annotation within a document.
  - 21. A computer program product as in claim 20, where a location is defined, relative to a document, by a starting location and at least one of an ending location and a length.
  - 22. A computer program product as in claim 19, where a set of token locations is monotonic.
  - 23. A computer program product as in claim 19, where a set of token locations is one of contiguous or non-contiguous.
  - 24. A computer program product as in claim 19, where an annotation type comprises one of a semantic type, a meta-value, a confidence and a price.
  - 25. A computer program product as in claim 19, where at least one token in a token set is spanned by at least two annotations.
  - 26. A computer program product as in claim 19, further comprising a computer program code segment for implementing search engine coupled to document data storage, said search engine being responsive to a query that comprises at least one of an annotation, a token, and a token in relation to an annotation.
  - 27. A computer program product as in claim 26, further comprising a relationship data structure comprising at least one relationship comprised of arguments ordered in argument order, where a relationship is represented by a respective annotation, where said search engine is further responsive to a query comprising a specific relationship, and where said search engine searches said data storage to return at least one document having the specific relationship.
  - 28. A computer program product as in claim 27, where at least one argument comprises an argument annotation linked to the annotation.
  - 29. A computer program product as in claim 27, where said search engine further returns at least one argument in the specific relationship.
  - 30. A computer program product as in claim 29, where the at least one argument is returned in response to the query, but is not explicitly specified by the query.
  - 31. A computer program product as in claim 19, where said annotation comprises a relation identifier.
  - 32. A computer program product as in claim 31, where said relation identifier is comprised of at least one argument.
  - 33. A computer program product as in claim 32, where said at least one argument that comprises said relation identifier comprises at least one of:
    - at least one other annotation, a token, a string, a record, a meta-value, a category, a relation, a relation among at least two tokens, and a relation among at least two annotations.
  - 34. A computer program product as in claim 31, where said relation identifier comprises a logical predicate.
  - 35. A computer program product as in claim 27, where said search engine further returns a plurality of ordered arguments.
  - 36. A computer program product as in claim 19, where at least some of said stored annotations are overlapping annotations.

37. A method for processing document data, comprising:
- storing tokens in a token inverted file system that are obtained by at least one tokenizer from document data; and
  
  storing, in an annotation inverted file system, annotations, a list of one or more occurrences of each annotation, and, for each listed occurrence, a set comprised of at least two token locations spanned by the respective annotation.
- View Dependent Claims (38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53)
- - 38. A method as in claim 37, where each occurrence is defined by a location of the respective annotation within a document.
  - 39. A method as in claim 38, where a location is defined, relative to a document, by a starting location and at least one of an ending location and a length.
  - 40. A method as in claim 37, where a set of token locations is monotonic.
  - 41. A method as in claim 37, where a set of token locations is one of contiguous or non-contiguous.
  - 42. A method as in claim 37, where an annotation type comprises one of a semantic type, a meta-value, a confidence and a price.
  - 43. A method as in claim 37, where at least one token in a token set is spanned by at least two annotations.
  - 44. A method as in claim 37, further comprising generating a search engine query that comprises at least one of an annotation, a token, and a token in relation to an annotation.
  - 45. A method as in claim 44, further comprising providing a relationship data structure comprising at least one relationship comprised of arguments ordered in argument order, where a relationship is represented by a respective annotation, where said search engine query further comprises a specific relationship, and searching a data storage to return at least one document having the specific relationship.
  - 46. A method as in claim 45, where at least one argument comprises an argument annotation linked to the annotation.
  - 47. A method as in claim 45, further comprising returning at least one argument in the specific relationship.
  - 48. A method as in claim 47, where the at least one argument is returned in response to the query, but is not explicitly specified by the query.
  - 49. A method as in claim 37, where said annotation comprises a relation identifier.
  - 50. A method as in claim 49, where said relation identifier is comprised of at least one argument.
  - 51. A method as in claim 49, where said at least one argument that comprises said relation identifier comprises at least one of:
    - at least one other annotation, a token, a string, a record, a meta-value, a category, a relation, a relation among at least two tokens, and a relation among at least two annotations.
  - 52. A method as in claim 49, where said relation identifier comprises a logical predicate.
  - 53. A method as in claim 45, further comprising returning a plurality of ordered arguments.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Marwick, Alan, Zadrozny, Wlodek W., Mass, Yosi, Broder, Andrei Z., Ferrucci, David

Application Number

US10/449,398
Publication Number

US 20040243560A1
Time in Patent Office

Days
Field of Search
US Class Current

707/3
CPC Class Codes

G06F 16/319 Inverted lists

G06F 16/3344 using natural language anal...

System, method and computer program product for performing unstructured information management and automatic text analysis, including an annotation inverted file system facilitating indexing and searching

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

53 Claims

Specification

Solutions

Use Cases

Quick Links

System, method and computer program product for performing unstructured information management and automatic text analysis, including an annotation inverted file system facilitating indexing and searching

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

53 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links