System, method and computer program product for performing unstructured information management and automatic text analysis, including an annotation inverted file system facilitating indexing and searching
First Claim
1. A data processing system, comprising:
- a token inverted file system storing tokens obtained by at least one tokenizer from document data; and
an annotation inverted file system storing annotations, a list of one or more occurrences of each annotation, and, for each listed occurrence, a set comprised of at least two token locations spanned by the respective annotation.
1 Assignment
0 Petitions
Accused Products
Abstract
Disclosed is a system architecture, components and a searching technique for an Unstructured Information Management System (UIMS). The UIMS may be provided as middleware for the effective management and interchange of unstructured information over a wide array of information sources. The architecture generally includes a search engine, data storage, analysis engines containing pipelined document annotators and various adapters. The searching technique makes use of a two-level searching technique. The data processing system includes a token inverted file system storing tokens obtained by at least one tokenizer from document data. An annotation inverted file system stores annotations, a list of one or more occurrences of each annotation, and, for each listed occurrence, a set comprised of at least two token locations spanned by the respective annotation.
-
Citations
53 Claims
-
1. A data processing system, comprising:
-
a token inverted file system storing tokens obtained by at least one tokenizer from document data; and
an annotation inverted file system storing annotations, a list of one or more occurrences of each annotation, and, for each listed occurrence, a set comprised of at least two token locations spanned by the respective annotation. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
-
-
19. A computer program product embodied on a computer-readable medium and comprising program code for directing at least one computer to process document data, comprising:
-
a program code segment for implementing a token inverted file system storing tokens obtained by at least one tokenizer from document data; and
a computer program code segment for implementing an annotation inverted file system storing annotations, a list of one or more occurrences of each annotation, and, for each listed occurrence, a set comprised of at least two token locations spanned by the respective annotation. - View Dependent Claims (20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36)
-
-
37. A method for processing document data, comprising:
-
storing tokens in a token inverted file system that are obtained by at least one tokenizer from document data; and
storing, in an annotation inverted file system, annotations, a list of one or more occurrences of each annotation, and, for each listed occurrence, a set comprised of at least two token locations spanned by the respective annotation. - View Dependent Claims (38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53)
-
Specification