Method and apparatus for establishing topic word classes based on an entropy cost function to retrieve documents represented by the topic words
First Claim
1. In a computer, a method for establishing topic words to represent a document wherein the topic words are suitable for inclusion in a computer database index structure, the method comprising the steps of:
- accepting at least a portion of the document from a data input device, wherein the portion of the document includes words;
determining a plurality of document keywords from the portion of the document;
classifying each of the document keywords into one of a plurality of preestablished keyword classes; and
selecting words as the topic words, each said selected word from a different one of the preestablished keyword classes, to minimize an entropy-based cost function on proposed topic words.
1 Assignment
0 Petitions
Accused Products
Abstract
A computer-based method and system for establishing topic words to represent a document, the topic words being suitable for use in document retrieval. The method includes determining document keywords from the document; classifying each of the document keywords into one of a plurality of preestablished keyword classes; and selecting words as the topic words, each selected word from a different one of the preestablished keyword classes, to minimize a cost function on proposed topic words. The cost function may be a metric of dissimilarity, such as cross-entropy, between a first distribution of likelihood of appearance by the plurality of document keywords in a typical document and a second distribution of likelihood of appearance by the plurality of document keywords in a typical document, the second distribution being approximated using proposed topic words. The cost function can be a basis for sorting the priority of the documents.
-
Citations
23 Claims
-
1. In a computer, a method for establishing topic words to represent a document wherein the topic words are suitable for inclusion in a computer database index structure, the method comprising the steps of:
-
accepting at least a portion of the document from a data input device, wherein the portion of the document includes words; determining a plurality of document keywords from the portion of the document; classifying each of the document keywords into one of a plurality of preestablished keyword classes; and selecting words as the topic words, each said selected word from a different one of the preestablished keyword classes, to minimize an entropy-based cost function on proposed topic words. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
-
-
17. In a computer, a method for establishing topic words to represent a document from a database, the document including words, the topic words to be used by a computer-based document retrieval engine having a user interface, the method comprising the steps of:
-
extracting a subset of the words as nonrepeating document keywords; grouping the document keywords into ordered groups d1 to dc, wherein each group uniquely corresponds to one of a plurality of preestablished keyword classes numbered from 1 to a number c from most common to least common in the database; identifying, for each of the c-1 most common groups d1 to dc-1, a document keyword ki that has greatest summed mutual information ##EQU20## wherein i is the number of the each group, Ki is a random variable corresponding to appearance or nonappearance of the keyword ki in a typical document, and Wj is a random variable corresponding to appearance or nonappearance of a word wj in a typical document; and identifying, for the least common group c, a document keyword kc that has greatest mutual information I(Kc-1 ;
Kc), wherein the document keywords k1, . . . , kc are the topic words. - View Dependent Claims (18)
-
-
19. A computer-implemented method for matching a query from a user to at least one document in a database based on a sequence of topic words stored for each document in the database, the method comprising the steps of:
- accepting the query from a data input device;
parsing the query to form query keywords which satisfy a predetermined definition of domain keywords; computing a closeness score between the parsed query and the topic words for each of a plurality of documents in the database using an entropy-based metric; and sorting the computed closeness score of at least two of the plurality of documents in the database. - View Dependent Claims (20, 21)
- accepting the query from a data input device;
-
22. A system for establishing topic words for characterizing a document comprising:
-
a parser for determining a plurality of document keywords from the document; a keyword classifier for generating groupings of the document keywords according to predefined classes; and an entropy-based cost function minimizer for minimizing a cost function on potential topic words and establishing a lowest-cost set of potential topic words as the topic words for the document.
-
-
23. A system for matching a query from a user to at least one document in a database based on a plurality of topic words stored in an index structure for each document in the database, the system comprising:
-
a parser for parsing the query to form query keywords which satisfy a predetermined definition of domain keywords; a closeness score computer for computing a closeness score between the parsed query and the topic words for each of a plurality of documents in the database using an entropy based metric, wherein the closeness score computer uses the same number of topic words for each of the plurality of documents in the database; and a sorter for sorting the computed closeness score of at least two of the plurality of documents in the database.
-
Specification