System, method and computer program product for performing unstructured information management and automatic text analysis

US 20040243554A1
Filed: 05/30/2003
Published: 12/02/2004
Est. Priority Date: 05/30/2003
Status: Abandoned Application

First Claim

Patent Images

1. A data processing system for processing document data, comprising:

data storage for storing a collection of document data that comprises unstructured document data;

coupled to the data storage, a semantic search engine for retrieving document data from said data storage; and

at least one analysis engine that comprises a plurality of coupled annotators at least some of which are operable for processing document data for tokenizing document data and for identifying and annotating a particular type of semantic content;

where said data processing system comprises an inverted file system for storing said annotations, a list comprising occurrences of respective annotations and, for each listed occurrence of a respective annotation, a set comprised of a plurality of token locations spanned by said respected annotation.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Disclosed is a system architecture, components and a searching technique for an Unstructured Information Management System (UIMS). The UIMS may be provided as middleware for the effective management and interchange of unstructured information over a wide array of information sources. The architecture generally includes a search engine, data storage, analysis engines containing pipelined document annotators and various adapters. The searching technique makes use of a two-level searching technique.

Citations

48 Claims

1. A data processing system for processing document data, comprising:
- data storage for storing a collection of document data that comprises unstructured document data;
  
  coupled to the data storage, a semantic search engine for retrieving document data from said data storage; and
  
  at least one analysis engine that comprises a plurality of coupled annotators at least some of which are operable for processing document data for tokenizing document data and for identifying and annotating a particular type of semantic content;
  
  where said data processing system comprises an inverted file system for storing said annotations, a list comprising occurrences of respective annotations and, for each listed occurrence of a respective annotation, a set comprised of a plurality of token locations spanned by said respected annotation.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 2. A data processing system as in claim 1, where each said occurrence is defined by a location of said annotation.
  - 3. A data processing system as in claim 2, where a location is defined, relative to a document, by a starting location and at least one of an ending location and a length.
  - 4. A data processing system as in claim 1, where a set of token locations is monotonic.
  - 5. A data processing system as in claim 1, where a set of token locations is one of contiguous or non-contiguous.
  - 6. A data processing system as in claim 1, further comprising, coupled to said semantic search engine, said data store and said analysis engine, at least one collection analysis engine.
  - 7. A data processing system as in claim 1, where an annotation type comprises one of a semantic type and a meta-value.
  - 8. A data processing system as in claim 1, where at least one token in a token set is spanned by at least two annotations.
  - 9. A data processing system as in claim 1, where said search engine inputs a document search query from said collection analysis engine, and where the query comprises at least one of an annotation, a token, and a token in relation to an annotation.
  - 10. A data processing system as in claim 6, further comprising a relationship data structure comprising at least one relationship comprised of at least one argument ordered in argument order, where a relationship is represented by a respective annotation, where said search engine inputs from said collection analysis engine a document search query comprising a specific relationship, and where said search engine searches said data storage to return at least one document having the specific relationship.
  - 11. A data processing system as in claim 10, where at least one argument comprises an argument annotation linked to the annotation.
  - 12. A data processing system as in claim 10, where said search engine further returns at least one argument in the specific relationship.
  - 13. A data processing system as in claim 12, where the at least one argument is returned in response to the query, but is not explicitly specified by the query.
  - 14. A data processing system as in claim 1, where there are a plurality of said analysis engines operable to generate a corresponding plurality of views of a document, each view being derived from a different tokenization of the document.
  - 15. A data processing system as in claim 6, where said collection analysis engine comprises storage for a document retrieved by said search engine in association with meta-data output from said at least one analysis engine.

16. A data processing system for processing document data, comprising:
- at least one application data storage interface for coupling to at least one database comprised of unstructured document data, said data storage interface for receiving at least database specification parameters, data source specification parameters and query command specification parameters; and
  
  at least one application text analysis engine interface for coupling to at least one text analysis engine that comprises a plurality of coupled annotators, at least some of which are operable for processing document data for identifying and annotating a particular type of semantic content, said text analysis interface for receiving at least text analysis engine flow parameters, document specification parameters and annotator specification parameters and producing analysis results;
  
  where an application is interoperable with said data storage and text analysis interfaces for specifying how to populate said at least one database, for specifying document selection and processing parameters for processing specified document data and analysis results, and for specifying at least one user interface, where at least one of the parameters sent through said application text analysis engine interface specifies a common abstract data format for specifying the operation of said at least one text analysis engine.
- View Dependent Claims (17, 18, 19, 20, 21, 22)
- - 17. A data processing system as in claim 16, further comprising at least one application search engine interface for coupling to a semantic search engine, said search engine interface receiving queries and returning search results, where at least one query comprises an annotation produced by said text analysis engine.
  - 18. A data processing system as in claim 16, where said application data storage interface transmits and receives meta-data corresponding to documents stored in said database.
  - 19. A data processing system as in claim 16, further comprising an application knowledge access interface for coupling to at least one knowledge access system, said application knowledge access interface for receiving a knowledge predicate query from said application and for transmitting a query result to said application.
  - 20. A data processing system as in claim 16, further comprising an application directory service interface for coupling to a directory service system comprising a knowledge directory service, said application directory service interface for receiving Knowledge Source Adapter descriptors and for returning Knowledge Source Adapter service handles.
  - 21. A data processing system as in claim 16, further comprising an application directory service interface for coupling to a directory service system comprising a text analysis engine directory service, said application directory service interface for receiving a text analysis engine descriptor and for returning information for enabling said application to make use of a text analysis engine that corresponds to the received text analysis engine descriptor.
  - 22. A data processing system as in claim 16, where said common abstract data format comprises an object-based representation implemented as a type system supporting one of single or multiple inheritance.

23. A modular text intelligence system, comprising:
- at least one document store interface coupled to at least one document store, the document store interface receiving at least one database specification and at least one data source and providing at least one database query command;
  
  at least one analysis engine interface coupled to at least one text analysis engine, the analysis engine interface receiving at least one document set specification of at least one document set and providing text analysis engine analysis results;
  
  an application interface for coupling to an application through which the application specifies;
  
  how to populate said at least one document store;
  
  an application logic for selecting at least one document set;
  
  processing of said selected document set by said at least one text analysis engine;
  
  processing of said analysis results; and
  
  at least one user interface, where the application specification occurs by setting at least one parameter, said at least one parameter comprising a specification of a common abstract data format for use by said at least one text analysis engine.
- View Dependent Claims (24, 25)
- - 24. A modular text intelligence system as in claim 23, further comprising at least one search engine interface for receiving at least one search engine identifier of at last one search engine and at least one search engine specification, said search engine interface further receiving at least one search engine query search result.
  - 25. A modular text intelligence system as in claim 24, where said search engine interface is coupled to at least one of an index file system, a database and a ranking module.

26. A computer program product embodied on a computer-readable medium and comprising program code for directing operation of a text intelligence system in cooperation with at least one application, comprising:
- a program code segment for managing a collection of document data that comprises unstructured document data;
  
  a program code segment for implementing a semantic search engine;
  
  a program code segment for implementing at least one analysis engine comprising a plurality of annotators at least some of which are operable for processing document data for tokenizing document data and for identifying and annotating a particular type of semantic content; and
  
  a program code segment for creating and managing an inverted file system for storing, for each processed document, annotations, a list comprising occurrences of respective annotations and, for each listed occurrence of a respective annotation, a set comprised of token locations spanned by said respected annotation.
- View Dependent Claims (27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41)
- - 27. A computer program product as in claim 26, where each said occurrence is defined by a location of said annotation.
  - 28. A computer program product as in claim 27, where a location is defined, relative to a document, by a starting location and at least one of an ending location and a length.
  - 29. A computer program product as in claim 26, where a set of token locations is monotonic.
  - 30. A computer program product as in claim 26, where a set of token locations is one of contiguous or non-contiguous.
  - 31. A computer program product as in claim 26, where at least one token in a token set is spanned by at least two annotations.
  - 32. A computer program product as in claim 26, where an annotation type comprises one of a semantic type and a meta-value.
  - 33. A computer program product as in claim 26, where said search engine inputs a document search query from said collection analysis engine, and where the query comprises at least one of an annotation, a token, and a token in relation to an annotation.
  - 34. A computer program product as in claim 33, further comprising a relationship data structure comprising at least one relationship comprised of at least one argument ordered in argument order, where a relationship is represented by a respective annotation, where said search engine inputs a document search query comprising a specific relationship, and where said search engine searches said data storage to return at least one document having the specific relationship.
  - 35. A computer program product as in claim 34, where at least one argument comprises an argument annotation linked to the annotation.
  - 36. A computer program product as in claim 34, where said search engine further returns at least one argument in the specific relationship.
  - 37. A computer program product as in claim 36, where the at least one argument is returned in response to the query, but is not explicitly specified by the query.
  - 38. A computer program product as in claim 26, where there is at least one program code segment for implementing a plurality of instances of said analysis engine for generating a corresponding plurality of views of a document, each view being derived from a different tokenization of the document.
  - 39. A computer program product as in claim 26, comprising storage for a document retrieved by said search engine in association with meta-data output from said at least one analysis engine.
  - 40. A computer program product as in claim 26, further comprising a computer program code segment for implementing an analysis engine assembler for creating an aggregate analysis engine through a declarative coordination of component analysis engines;
    - and a computer program code segment for deploying a created aggregate analysis engine.
  - 41. A computer program product as in claim 26, where said plurality of annotators operate in a loosely coupled manner for storing a document tokenization within a plurality of memories.

42. A method to process document data, comprising:
- providing at least one application data storage interface for coupling to at least one database comprised of unstructured document data, and receiving at least database specification parameters, data source specification parameters and query command specification parameters through said data storage interface; and
  
  providing at least one application text analysis engine interface for coupling to at least one text analysis engine that comprises a plurality of coupled annotators, at least some of which are operable for processing document data for identifying and annotating a particular type of semantic content, and receiving at least text analysis engine flow parameters, document specification parameters and annotator specification parameters and producing analysis results through said text analysis interface;
  
  where an application is interoperable with said data storage and text analysis interfaces for specifying how to populate said at least one database, for specifying document selection and processing parameters for processing specified document data and analysis results, and for specifying at least one user interface, where at least one of the parameters sent through said application text analysis engine interface specifies a common abstract data format for specifying the operation of said at least one text analysis engine.
- View Dependent Claims (43, 44, 45, 46, 47, 48)
- - 43. A method as in claim 42, further comprising providing at least one application search engine interface for coupling to a semantic search engine, and receiving queries and returning search results through said search engine interface, where at least one query comprises an annotation produced by said text analysis engine.
  - 44. A method as in claim 42, further comprising transmitting and receiving meta-data corresponding to documents stored in said database through said application data storage interface.
  - 45. A method as in claim 42, further comprising providing an application knowledge access interface for coupling to at least one knowledge access system, and receiving a knowledge predicate query from said application and transmitting a query result to said application through said application knowledge access interface.
  - 46. A method as in claim 42, further comprising providing an application directory service interface for coupling to a directory service system comprising a knowledge directory service, and receiving Knowledge Source Adapter descriptors and returning Knowledge Source Adapter service handles through said application directory service interface.
  - 47. A method as in claim 42, further comprising providing an application directory service interface for coupling to a directory service system comprising a text analysis engine directory service, and receiving a text analysis engine descriptor and returning information for enabling said application to make use of a text analysis engine that corresponds to the received text analysis engine descriptor through said application directory service interface.
  - 48. A method as in claim 42, where said common abstract data format comprises an object-based representation implemented as a type system supporting one of single or multiple inheritance.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Zadrozny, Wlodek W., Broder, Andrei Z., Ferrucci, David, Ciccolo, Arthur C., Marwick, Alan D.

Application Number

US10/448,859
Publication Number

US 20040243554A1
Time in Patent Office

Days
Field of Search
US Class Current

707/3
CPC Class Codes

G06F 16/31 Indexing; Data structures t...

G06F 16/38 Retrieval characterised by ...

System, method and computer program product for performing unstructured information management and automatic text analysis

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

48 Claims

Specification

Solutions

Use Cases

Quick Links

System, method and computer program product for performing unstructured information management and automatic text analysis

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

48 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links