Efficient storage mechanism for representing term occurrence in unstructured text documents
First Claim
1. A method of converting a document corpus containing an ordered plurality of documents into a compact representation in memory of occurrence data, said representation to be based on a dictionary previously developed for said document corpus and wherein each term in said dictionary has associated therewith a corresponding unique integer, said method comprising:
- developing a first vector for said entire document corpus, said first vector being a listing of said unique integers corresponding to dictionary terms such that each said document in said document corpus is sequentially represented in said listing; and
developing a second vector for said entire document corpus, said second vector indicating the location of each said document'"'"'s representation in said first vector.
1 Assignment
0 Petitions
Accused Products
Abstract
A method and structure converts a document corpus containing an ordered plurality of documents into a compact representation in memory of occurrence data, where the representation is to be based on a dictionary previously developed for the document corpus and where each term in the dictionary has associated therewith a corresponding unique integer. The method includes developing a first vector for the entire document corpus, the first vector being a sequential listing of the unique integers such that each document in the document corpus is sequentially represented in the listing according to the occurrence in the document of the corresponding dictionary terms. A second vector is also developed for the entire document corpus and indicates the location of each of the document'"'"'s representation in the first vector.
-
Citations
17 Claims
-
1. A method of converting a document corpus containing an ordered plurality of documents into a compact representation in memory of occurrence data, said representation to be based on a dictionary previously developed for said document corpus and wherein each term in said dictionary has associated therewith a corresponding unique integer, said method comprising:
-
developing a first vector for said entire document corpus, said first vector being a listing of said unique integers corresponding to dictionary terms such that each said document in said document corpus is sequentially represented in said listing; and
developing a second vector for said entire document corpus, said second vector indicating the location of each said document'"'"'s representation in said first vector. - View Dependent Claims (2, 3, 4)
-
-
5. A method of converting, organizing, and representing in a computer memory a document corpus containing an ordered plurality of documents, for use by a data mining application program requiring occurrence-of-terms data, said representation to be based on terms in a dictionary previously developed for said document corpus and wherein each said term in said dictionary has associated therewith a corresponding unique integer, said method comprising:
-
for said document corpus, taking in sequence each said ordered document and developing a first uninterrupted listing of said unique integers to correspond to the occurrence of said dictionary terms in the document corpus; and
developing a second uninterrupted listing for said entire document corpus, containing in sequence the location of each corresponding document in said first uninterrupted listing, wherein said first listing and said second listing are provided as input data for said data mining application program. - View Dependent Claims (6, 7, 8)
-
-
9. An apparatus for organizing and representing in a computer memory a document corpus containing an ordered plurality of documents, for use by a data mining applications program requiring occurrence-of-terms data, said representation to be based on terms in a dictionary previously developed for said document corpus and wherein each said term in said dictionary has associated therewith a corresponding unique integer, said apparatus comprising:
-
an integer determiner receiving in sequence each said ordered document of said document corpus and developing a first uninterrupted listing of said unique integers to correspond to the occurrence of said dictionary terms in the document corpus; and
a locator developing a second uninterrupted listing for said entire document corpus containing in sequence the location of each corresponding document in said first uninterrupted listing, wherein said first listing and said second listing are provided as input data for said data mining applications program. - View Dependent Claims (10, 11, 12, 14, 16, 17)
-
-
13. A signal-bearing medium tangibly embodying a program of machine-readable instructions executable by a digital processing apparatus to perform a method to organize and represent in a computer memory a document corpus containing an ordered plurality of documents, for use by a data mining algorithm requiring occurrence-of-terms data, said representation to be based on terms in a dictionary previously developed for said document corpus and wherein each said term in said dictionary has associated therewith a corresponding unique integer, said method comprising:
-
a first uninterrupted listing of said unique integers to correspond to the occurrence of said dictionary terms in the document corpus; and
a second uninterrupted listing for said entire document corpus containing in sequence the location of each corresponding document in said first uninterrupted listing, wherein said first listing and said second listing are provided as input data for said data mining algorithm.
-
-
15. A data converter for organizing and representing in a computer memory a document corpus containing an ordered plurality of documents, for use by a data mining applications program requiring occurrence-of-terms data, said representation to be based on terms in a dictionary previously developed for said document corpus and wherein each said term in said dictionary has associated therewith a corresponding unique integer, said data converter comprising:
-
means for developing a first uninterrupted listing of said unique integers to correspond to the occurrence of said dictionary terms in the document corpus and; and
means for developing a second uninterrupted listing for said entire document corpus containing in sequence the location of each corresponding document in said first uninterrupted listing, wherein said first listing and said second listing are provided as input data for said data mining applications program.
-
Specification