Method and apparatus for establishing topic word classes based on an entropy cost function to retrieve documents represented by the topic words

US 6,128,613 A
Filed: 04/29/1998
Issued: 10/03/2000
Est. Priority Date: 06/26/1997
Status: Expired due to Term

First Claim

Patent Images

1. In a computer, a method for establishing topic words to represent a document wherein the topic words are suitable for inclusion in a computer database index structure, the method comprising the steps of:

accepting at least a portion of the document from a data input device, wherein the portion of the document includes words;

determining a plurality of document keywords from the portion of the document;

classifying each of the document keywords into one of a plurality of preestablished keyword classes; and

selecting words as the topic words, each said selected word from a different one of the preestablished keyword classes, to minimize an entropy-based cost function on proposed topic words.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A computer-based method and system for establishing topic words to represent a document, the topic words being suitable for use in document retrieval. The method includes determining document keywords from the document; classifying each of the document keywords into one of a plurality of preestablished keyword classes; and selecting words as the topic words, each selected word from a different one of the preestablished keyword classes, to minimize a cost function on proposed topic words. The cost function may be a metric of dissimilarity, such as cross-entropy, between a first distribution of likelihood of appearance by the plurality of document keywords in a typical document and a second distribution of likelihood of appearance by the plurality of document keywords in a typical document, the second distribution being approximated using proposed topic words. The cost function can be a basis for sorting the priority of the documents.

Citations

23 Claims

1. In a computer, a method for establishing topic words to represent a document wherein the topic words are suitable for inclusion in a computer database index structure, the method comprising the steps of:
- accepting at least a portion of the document from a data input device, wherein the portion of the document includes words;
  
  determining a plurality of document keywords from the portion of the document;
  
  classifying each of the document keywords into one of a plurality of preestablished keyword classes; and
  
  selecting words as the topic words, each said selected word from a different one of the preestablished keyword classes, to minimize an entropy-based cost function on proposed topic words.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
- - 2. The method according to claim 1 wherein the cost function is a metric of dissimilarity between a first statistical distribution of likelihood of appearance by the plurality of document keywords in a typical document and a second statistical distribution of likelihood of appearance by the plurality of document keywords in a typical document, the second statistical distribution being approximated using proposed topic words.
  - 3. The method according to claim 2 wherein the metric of dissimilarity is cross-entropy.
  - 4. The method according to claim 3 wherein the step of selecting words from the document keywords comprises minimizing or maximizing an alternate metric to thereby minimize the cross-entropy without actually computing cross-entropy.
  - 5. The method according to claim 2 wherein the preestablished keyword classes are ordered such that at least one of the keyword classes has an immediately succeeding keyword class, and wherein the step of selecting words from the document keywords as the topic words comprises the steps of:
    - computing a sum of mutual information between appearance by a document keyword and appearance by each of the document keywords that belong to an immediately succeeding keyword class; and
      
      selecting a document keyword, from the plurality of document keywords belonging to a particular one of the keyword classes, that has the highest summed mutual information with respect to document keywords that belong to an immediately succeeding keyword class as the topic keyword from the particular keyword class.
  - 6. The method according to claim 2 wherein:
    - the preestablished keyword classes are ordered such that at least one of the keyword classes has an immediately preceding keyword class from among the preestablished keyword classes;
      
      the first statistical distribution of likelihood includes an assumption of conditional independence under which the likelihood of appearance of a keyword in any particular document depends only on the appearance or nonappearance of keywords from an immediately preceding keyword class in that particular document; and
      
      the second distribution of likelihood includes an approximation under which the likelihood of appearance of a keyword in any particular document depends only on the appearance or nonappearance of a proposed topic keyword from an immediately preceding keyword class in that particular document.
  - 7. The method according to claim 1 wherein the keyword classes consist of less than about fifty keyword classes.
  - 8. The method according to claim 1 wherein the keyword classes consist of less than about twenty keyword classes.
  - 9. The method according to claim 1 wherein the preestablished keyword classes have an ordering of classes, and an arbitrary word in one of the preestablished keyword class is expected to appear substantially no more frequently in an information domain than any arbitrary word in an immediately preceding keyword class.
  - 10. The method according to claim 1 wherein:
    - the step of determining the plurality of document keywords comprises determining as the document keywords only words which qualify as domain keywords according to a preestablished definition of domain keywords.
  - 11. The method according to claim 10 wherein:
    - there exists a predetermined set of domain keywords found in the database;
      
      the step of determining as the document keywords only words which qualify as domain keywords according to a preestablished definition of domain keywords comprises extracting words from the portion of the document that belong to the predetermined set of domain keywords; and
      
      the method further comprises, before the step of classifying each of the document keywords into one of the preestablished keyword classes, the step of partitioning the set of domain keywords into an ordered plurality of domain keyword groups wherein a word in an arbitrary one of the domain keyword groups is expected to appear in substantially no more documents in the database than any word in a lower-ordered domain keyword group, and wherein the ordered plurality of domain keyword groups forms the preestablished keyword classes.
  - 12. The method according to claim 11 wherein the step of partitioning the set of domain keywords comprises the step of controlling relative numbers of domain keywords partitioned into each domain keyword group.
  - 13. The method according to claim 12 wherein the step of controlling relative numbers of domain keywords comprises substantially equalizing, across all the domain keyword groups, a sum for each domain keyword group of the probability for each domain keyword in the each domain keyword group that a randomly selected document from the database includes the each domain keyword.
  - 14. The method according to claim 11 wherein there exists a set of domain keyword training documents from an information domain, the method further comprising, before the step of determining a plurality of document keywords, the step of extracting non-stop words from the set of domain keyword training documents to thereby establish the set of domain keywords.
  - 15. The method according to claim 1 further comprising a step of writing the topic keywords to a computer-readable memory product to form a computer-readable database index product of topic keywords for at least one document, wherein the database index product includes the computer-readable memory product.
  - 16. A computer-readable database index product formed according to the method of claim 15.

17. In a computer, a method for establishing topic words to represent a document from a database, the document including words, the topic words to be used by a computer-based document retrieval engine having a user interface, the method comprising the steps of:
- extracting a subset of the words as nonrepeating document keywords;
  
  grouping the document keywords into ordered groups d₁ to d_c, wherein each group uniquely corresponds to one of a plurality of preestablished keyword classes numbered from 1 to a number c from most common to least common in the database;
  
  identifying, for each of the c-1 most common groups d₁ to d_c-1, a document keyword k_i that has greatest summed mutual information ##EQU20## wherein i is the number of the each group, K_i is a random variable corresponding to appearance or nonappearance of the keyword k_i in a typical document, and W_j is a random variable corresponding to appearance or nonappearance of a word w_j in a typical document; and
  
  identifying, for the least common group c, a document keyword k_c that has greatest mutual information I(K_c-1 ;
  
  K_c), wherein the document keywords k₁, . . . , k_c are the topic words.
- View Dependent Claims (18)
- - 18. The method according to claim 17 wherein mutual information for any two random variables W_i W_j is defined as:
    - ##EQU21##

19. A computer-implemented method for matching a query from a user to at least one document in a database based on a sequence of topic words stored for each document in the database, the method comprising the steps of:
- accepting the query from a data input device;
  
  parsing the query to form query keywords which satisfy a predetermined definition of domain keywords;
  
  computing a closeness score between the parsed query and the topic words for each of a plurality of documents in the database using an entropy-based metric; and
  
  sorting the computed closeness score of at least two of the plurality of documents in the database.
- View Dependent Claims (20, 21)
- - 20. The method according to claim 19 wherein the step of computing a closeness score comprises the steps of:
    - classifying each of the query keywords into one of a plurality of preestablished keyword classes; and
      
      computing as the closeness score a conditional likelihood of appearance by the query keywords given appearance by the topic words of one of the plurality of documents in the database.
  - 21. The method according to claim 20 wherein the database is the Internet and the method further comprises:
    - providing links to a user to documents in the database which have a closeness score above a selectable threshold value.

22. A system for establishing topic words for characterizing a document comprising:
- a parser for determining a plurality of document keywords from the document;
  
  a keyword classifier for generating groupings of the document keywords according to predefined classes; and
  
  an entropy-based cost function minimizer for minimizing a cost function on potential topic words and establishing a lowest-cost set of potential topic words as the topic words for the document.

23. A system for matching a query from a user to at least one document in a database based on a plurality of topic words stored in an index structure for each document in the database, the system comprising:
- a parser for parsing the query to form query keywords which satisfy a predetermined definition of domain keywords;
  
  a closeness score computer for computing a closeness score between the parsed query and the topic words for each of a plurality of documents in the database using an entropy based metric, wherein the closeness score computer uses the same number of topic words for each of the plurality of documents in the database; and
  
  a sorter for sorting the computed closeness score of at least two of the plurality of documents in the database.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Chinese University of Hong Kong
Original Assignee
Chinese University of Hong Kong
Inventors
Qin, An, Wong, Wing S.
Primary Examiner(s)
Black, Thomas G.
Assistant Examiner(s)
TRINH, WILLIAM B

Application Number

US09/069,618
Time in Patent Office

888 Days
Field of Search

707/7, 707/5, 707/532
US Class Current

707/738
CPC Class Codes

G06F 16/313   Selection or weighting of t...

G06F 16/353   into predefined classes

Y10S 707/968   Partitioning

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99937   Sorting

Method and apparatus for establishing topic word classes based on an entropy cost function to retrieve documents represented by the topic words

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

23 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for establishing topic word classes based on an entropy cost function to retrieve documents represented by the topic words

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

23 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links