System for automatically organizing data in accordance with pattern hierarchies therein
First Claim
1. In a system of the type including a computer with associated storage media and a user interface coupled thereto for retrieving information from a report-based stream of data which includes data from a report, the improvement comprising:
- a routine executed by the computer, including;
(a) a first portion automatically identifying and defining patterns in data from a report and a hierarchy among such patterns; and
(b) a second portion using the patterns and the hierarchy to automatically extract information from the data to permit creation of virtual records in response to queries.
13 Assignments
0 Petitions
Accused Products
Abstract
An information retrieval system processes a report-based data stream which includes report data having text lines comprised of fields, wherein each field is described by the type of data in the field. The system automatically abstracts the text line patterns from the report data by automatically classifying the text lines into species representative of text lines having a predetermined relationship to one another, automatically creating a definition of each species in terms of the species'"'"' constituent fields and automatically creating a virtual table definition specifying the hierarchical relationships among species based on functional type. The system automatically creates tables listing for each text line in the report, the species which the text line best matches, links entries in the list based on relationships specified in the virtual table, and then utilizes the linked list to generate virtual records in response to user-generated queries.
97 Citations
24 Claims
-
1. In a system of the type including a computer with associated storage media and a user interface coupled thereto for retrieving information from a report-based stream of data which includes data from a report, the improvement comprising:
-
a routine executed by the computer, including;
(a) a first portion automatically identifying and defining patterns in data from a report and a hierarchy among such patterns; and
(b) a second portion using the patterns and the hierarchy to automatically extract information from the data to permit creation of virtual records in response to queries. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. In a system of the type including a computer with associated storage media and a user interface coupled thereto for retrieving information in a report-based stream of data which includes data from a report arranged in text lines comprised of fields, wherein each field is described by the type of data in the field, the improvement comprising:
-
a routine executed by the computer, including;
(a) a first portion automatically classifying the text lines into text line species, wherein each species is representative of text lines having a predetermined relationship to one another; and
(b) a second portion automatically creating a definition of each species in terms of the constituent fields of the species to form an inventory of defined species. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18, 19)
(a) the text line is already included in a species in that the text line would not alter the species definition;
(b) the text line is related to a species and is added as a member of the species to alter the species definition;
or(c) the text line is unrelated to an existing species and is used to create a new species.
-
-
13. The system of claim 12, wherein a text line is already included in a species if the text line has a degree of membership to the species within a predetermined threshold and corresponds to the species as to field data type and inter-field space regions, wherein the degree of membership of a text line to a species is determined by the average similarity of character types at each character position.
-
14. The system of claim 11, wherein the first portion of the routine includes a sub-portion constructing for each species a template which is a listing of the most predominant character class for each character position in the collection of lines making up the species.
-
15. The system of claim 14, wherein said third sub-portion of the routine includes a portion constructing an initial template using a first text line in a collection of text lines, and a portion adjusting the template after each text line is added as a member of the species by comparing all current members of the species to the template for conformance with the predominant character class representation at each character position and, if more than half the current members fail to conform to the template, using all the members to find the character class with the maximum frequency at each character position and replacing the template at that position with the new character class representation.
-
16. The system of claim 10, wherein the routine includes a third portion automatically generating a virtual table classifying each species by functional type based upon the location and frequency of occurrence of that species in a list of species best matched by each text line in the report.
-
17. The system of claim 16, wherein the third portion of the routine includes a portion sorting the line species in the virtual table in the order of header species, detail species and trailer species, wherein detail species are species with high occurrence frequency and high variation on location, header species are species which occur before the first detail species and trailer species are species which occur after the last detail species.
-
18. The system of claim 10, wherein the defined species make up an inventory, the routine further including a portion for improving the inventory by applying the inventory to a stream of data and automatically adjusting the inventory to include every text line in the stream.
-
19. The system of claim 18, wherein the third portion of the routine includes a portion which classifies each text line in the data stream to determine whether the text line matches a defined species definition exactly, partially or not at all, creates a collection of unmatched text lines and automatically classifies the collection into new species;
- and a portion determining whether any partially matched text line more closely matches the defined species definition which the partially matched text line partially matches or the definition of a new species and, if the former, adds the text line to the partially matched defined species and rebuilds the species definition accordingly and, if the latter, includes the text line in the collection of unmatched text lines and adds the text line to an existing new species or creates an additional new species for the text line.
-
20. In a system of the type including a computer with associated storage media and a user interface coupled thereto for retrieving meaningful information from a report-based stream of data which includes data from a report arranged in text lines, wherein there exists a classification of the text lines into species such that each species is representative of text lines having a predetermined relationship to one another, the improvement comprising:
-
a routine executed by the computer, including;
(a) a first portion automatically scanning data from a report line-by-line and determining for each text line the species which the text line best matches; and
(b) a second portion creating a list showing for each text line of the report the species which the text line best matches. - View Dependent Claims (21)
-
-
22. A computer-implemented method for retrieving meaningful information from a report-based stream of data which includes data from a report, the method comprising:
-
automatically identifying and defining patterns in data from a report and a hierarchy among such patterns; and
utilizing the patterns and the hierarchy to automatically extract information from the data to permit creation of virtual records in response to queries.
-
-
23. A computer-implemented method for retrieving information in a report-based stream of data which includes data from a report arranged in text lines comprised of fields, wherein each field is described by the type of data in the field, the method comprising:
-
automatically classifying the text lines into text line species, wherein each species is representative of text lines having a predetermined relationship to one another; and
automatically creating a definition of each species in terms of the species'"'"' constituent fields to form defined species.
-
-
24. A computer-implemented method for obtaining meaningful information from a report-based stream of data which includes data from a report arranged in text lines, wherein there exists a classification of the text lines into species such that each species is representative of text lines having a predetermined relationship to one another, the method comprising:
-
automatically scanning data from a report line-by-line and determining for each text line the species which the text line best matches; and
creating a list showing for each text line of the report the species which the text line best matches.
-
Specification