Light weight document matcher

US 6,286,000 B1
Filed: 12/01/1998
Issued: 09/04/2001
Est. Priority Date: 12/01/1998
Status: Expired due to Fees

First Claim

Patent Images

1. A computer implemented document matcher comprising:

a back-end processor receiving input documents and generating a first data structure consisting of a set of local dictionaries of keywords for each document and then generating a second data structure consisting of a global dictionary resulting from the union of all keywords in the first data structure, said back-end processor computing a table of word weights; and

a front-end processor for matching input documents against documents represented by said second data structure, said front-end processor computing a score for the documents, then sorting the documents by score, stored documents being ranked by a relevance scoring scheme according to a formula

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A lightweight document matcher employs minimal processing and storage. The lightweight document matcher matches new documents to those stored in a database. The matcher lists, in order, those stored documents that are most similar to the new document. The new documents are typically problem statements or queries, and the stored documents are potential solutions such as FAQs (Frequently Asked Questions). Given a set of documents, titles, and possibly keywords, an automatic back-end process constructs a global dictionary of unique keywords and local dictionaries of relevant words for each document. The application front-end uses this information to score the relevance of stored documents to new documents. The scoring algorithm uses the count of matched words as a base score, and then assigns bonuses to words that have high predictive value. It optionally assigns an extra bonus for a match of words in special sections, e.g., titles. The method uses minimal data structures and lightweight scoring algorithms to compute efficiently even in restricted environments, such as mobile or small desktop computers.

Citations

10 Claims

1. A computer implemented document matcher comprising:
- a back-end processor receiving input documents and generating a first data structure consisting of a set of local dictionaries of keywords for each document and then generating a second data structure consisting of a global dictionary resulting from the union of all keywords in the first data structure, said back-end processor computing a table of word weights; and
  
  a front-end processor for matching input documents against documents represented by said second data structure, said front-end processor computing a score for the documents, then sorting the documents by score, stored documents being ranked by a relevance scoring scheme according to a formula
- View Dependent Claims (2, 3, 4, 9)
- - 2. The computer implemented document matcher recited in claim 1 wherein the back-end processor comprises:
    - a converter for converting an input document to a standard representation;
      
      a local dictionary extractor and store which receives the standard representation from the converter and generates the first data structure;
      
      a dictionary combiner and global dictionary store which accesses the first data structure stored in the local dictionary extractor and store to generate the second data structure; and
      
      a word weight calculator which computes a table of word weights based upon frequency of use in input documents.
  - 3. The computer implemented document matcher recited in claim 2 wherein the front-end processor comprises:
    - a document accumulator identifying words of an input document by matching against the global dictionary of the second data structure;
      
      a document scorer accessing the word weight table generated by the back-end processor and assigning a score to each document; and
      
      a document sorter sorting a list of matching documents with assigned scores.
  - 4. The computer implemented document matcher recited in claim 3 wherein the document scorer scores documents by assigning a value for every matched word, adding a bonus to the value assigned for a matched word from the word weight table, and adds or subtracts a bonus or penalty for every match or mismatch in special sections of a document.
  - 9. The computer implemented document matcher recited in claim 1, wherein the Bonus includes the title and special section keywords.

5. A computer implemented process for matching new documents to those stored in a database comprising the steps of:
- generating a first data structure consisting of a set of local dictionaries of keywords;
  
  generating a second data structure which is a global dictionary resulting from the union of all keywords in the first data structure;
  
  computing a table of word weights based on frequency of use in input documents;
  
  matching input documents against documents represented by said second data structure; and
  
  accessing the table of word weights, scoring input documents, and ranking stored documents by relevance scoring scheme according to a formula
- View Dependent Claims (6, 7, 8, 10)
- - 6. The computer implemented process recited in claim 5 further comprising the step of converting an input document to a standard representation prior to generating the first data structure.
  - 7. The computer implemented process recited in claim 6 wherein the step of matching comprises the step of identifying words of an input document by matching against the global dictionary of the second data structure.
  - 8. The computer implemented process recited in claim 7 wherein the step of scoring comprises the steps of:
    - assigning a value for every matched word;
      
      adding a bonus to the value assigned for a matched word from the word weight table; and
      
      adding or subtracting a bonus or penalty for every match or mismatch in special sections of a document.
  - 10. The computer implemented process recited in claim 5, wherein the Bonus includes the title and special section keywords.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Weiss, Sholom M., Damerau, Frederick J., Apte, Chidanand, White, Brian F.
Primary Examiner(s)
Black, Thomas
Assistant Examiner(s)
LE, UYEN T

Application Number

US09/203,673
Time in Patent Office

1,008 Days
Field of Search

707/1-8, 707/10, 707/102, 712/203
US Class Current

1/1
CPC Class Codes

G06F 16/319   Inverted lists

Y10S 707/99934   Query formulation, input pr...

Y10S 707/99935   Query augmenting and refini...

Light weight document matcher

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

10 Claims

Specification

Solutions

Use Cases

Quick Links

Light weight document matcher

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

10 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links