Document extraction and comparison method with applications to automatic personalized database searching

US 5,926,812 A
Filed: 03/28/1997
Issued: 07/20/1999
Est. Priority Date: 06/20/1996
Status: Expired due to Fees

First Claim

Patent Images

1. A computer-implemented method for determining the relevance of the content of a first set of documents to the content of a second set of documents, the method comprising:

extracting from the first set of documents a corresponding first set of document extract entries and from the second set of documents a corresponding second set of document extract entries, wherein each entry in the first and second sets of document extract entries comprises a weighted word histogram for a corresponding document;

generating from the first set of document extract entries a first set of word clusters, and generating from the second set of document extract entries a second set of word clusters, wherein each word cluster in the first and second sets of word clusters comprises a cluster word list, a total distance matrix, and a number of connections matrix; and

determining a degree of similarity between clusters from the first set of word clusters and clusters from the second set of word clusters.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A computer-implemented method for comparing the contents of two sets of documents includes the step of extracting from a set of documents 44! corresponding sets of document extract entries 46!. The method further includes a step of generating from the sets of document extract entries 46! corresponding sets of word clusters 48!. Each word cluster comprises a cluster word list having N words, an N×N total distance matrix, and an N×N number of connections matrix. The preferred embodiment includes a step of grouping similar word clusters and combining the similar word clusters to form a single word cluster for each group. The grouping comprises evaluating a measure of cluster similarity between two word clusters, and placing them in a common group of similar word clusters if the measure of similarity exceeds a predetermined value. The step of evaluating cluster similarity comprises intersecting clusters to form subclusters and calculating a function of the subclusters. In the preferred embodiment, the method is implemented in a system to automatically identify database documents which are of interest to a given user or users. In this implementation, the method comprises the step of automatically deriving the first set of documents from a local data storage device, such as a user'"'"'s hard disk. The method also comprises the step of deriving the second set of documents from a second data storage device, such as a network machine. This application of the invention, therefore, provides fast and accurate searching to identify documents of interest to a particular user or users without any need for the user or users to specify what search criteria to use.

303 Citations

15 Claims

1. A computer-implemented method for determining the relevance of the content of a first set of documents to the content of a second set of documents, the method comprising:
- extracting from the first set of documents a corresponding first set of document extract entries and from the second set of documents a corresponding second set of document extract entries, wherein each entry in the first and second sets of document extract entries comprises a weighted word histogram for a corresponding document;
  
  generating from the first set of document extract entries a first set of word clusters, and generating from the second set of document extract entries a second set of word clusters, wherein each word cluster in the first and second sets of word clusters comprises a cluster word list, a total distance matrix, and a number of connections matrix; and
  
  determining a degree of similarity between clusters from the first set of word clusters and clusters from the second set of word clusters.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method of claim 1 further comprising:
    - deriving the first set of documents from a first data storage device.
  - 3. The method of claim 2 wherein the deriving step comprises selecting the first set of documents from a set of files on the first data storage device, wherein the set of files contain files associated with a predetermined set of users.
  - 4. The method of claim 1 wherein the extracting step comprises converting the first set of documents from text format to hypertext format.
  - 5. The method of claim 1 further comprising:
    - deriving the second set of documents from a second data storage device.
  - 6. The method of claim 1 wherein each entry in the first set of document extract entries and in the second set of document extract entries further comprises a list of people and a list of companies.
  - 7. The method of claim 1 wherein the weighted word histogram for a corresponding document comprises a set of histogram word records, wherein each word record comprises a word from the document, a word score, a number of appearances of the word in the document, and a list of position indices for the word within the document.
  - 8. The method of claim 1 wherein the generating step comprises:
    - grouping the word clusters within the first set of word clusters to form subsets of similar word clusters; and
      
      combining the similar word clusters within each subset to form a single word cluster for the subset.
  - 9. The method of claim 8 wherein the grouping comprises:
    - evaluating a measure of cluster similarity between a first word cluster and a second word cluster, where both the first and second word clusters are members of the first set of word clusters; and
      
      placing the first word cluster and the second word cluster in a common subset of similar word clusters if the measure of similarity exceeds a predetermined value.
  - 10. The method of claim 9 wherein the evaluating step comprises:
    - intersecting the first cluster and the second cluster, thereby dividing the first cluster into four first subclusters and the second cluster into four second subclusters; and
      
      calculating a function of the four first and four second subclusters, wherein the function comprises the calculation of a quantity chosen from the group consisting of a maximum value of matrix elements, a minimum value of matrix elements, a sum of diagonal matrix elements, a sum of off-diagonal matrix elements, and a sum of all matrix elements.
  - 11. The method of claim 8 wherein the combining comprises:
    - concatenating the word list of a first cluster and the word list of a second cluster to form a combined word list of a combined word cluster;
      
      merging the total distance matrix of the first word cluster and the total distance matrix of the second word cluster to form a total distance matrix of a combined word cluster; and
      
      merging the number of connections matrix of the first word cluster and the number of connections matrix of the second word cluster to form a number of connections matrix of a combined word cluster.

12. A computer-implemented method for determining the relevance of the content of a first set of documents to the content of a second set of documents, the method comprising:
- extracting from the first set of documents a corresponding first set of document extract entries and from the second set of documents a corresponding second set of document extract entries, wherein each entry in the first and second sets of document extract entries comprises a weighted word histogram for a corresponding document;
  
  generating from the first set of document extract entries a first set of word clusters, and from the second set of document extract entries a second set of word clusters, wherein each word cluster in the first and second sets of word clusters comprises a cluster word list, a total distance matrix, and a number of connections matrix; and
  
  determining a degree of similarity between clusters from the first set of word clusters and clusters from the second set of word clusters;
  
  wherein;
  
  the number of connections matrix for each word cluster comprises an N×
  
  N matrix, wherein N is equal to the number of words in the cluster word list, wherein the (i,j) entry of the number of connections matrix for i≠
  
  j contains a number of connections in the document between words i and j when word i precedes word j, and wherein the (i,i) entry of the number of connections matrix contains a number of appearances in the document of word i; and
  
  the total distance matrix for each word cluster comprises an N×
  
  N matrix, wherein the (i,j) entry of the total distance matrix for i≠
  
  j contains a total distance between words i and j for all connections in the document when word i precedes word j, and wherein the (i,i) entry of the total distance matrix contains a weight of word i in the document.
- View Dependent Claims (13, 14)
- - 13. The method of claim 12 wherein the generating step comprises:
    - determining the cluster word list by recursively calling a procedure that returns a list of words within a predetermined distance from a given word in the document, andcalculating the number of connections matrix by repeatedly calling a procedure that determines the number of connections in the document between words i and j when word i precedes word j.
  - 14. The method of claim 13 wherein the recursive calling of the procedure is limited to a predetermined recursive depth.

15. A computer-implemented method for determining the relevance of the content of a first set of documents to the content of a second set of documents, the method comprising:
- extracting from the first set of documents a corresponding first set of document extract entries and from the second set of documents a corresponding second set of document extract entries, wherein each entry in the first and second sets of document extract entries comprises a weighted word histogram for a corresponding document;
  
  generating from the first set of document extract entries a first set of word clusters, and from the second set of document extract entries a second set of word clusters, wherein each word cluster in the first and second sets of word clusters comprises a cluster word list, a total distance matrix, and a number of connections matrix; and
  
  determining a degree of similarity between clusters from the first set of word clusters and clusters from the second set of word clusters;
  
  wherein the determining step comprises;
  
  intersecting a first cluster from the first set of word clusters and a second cluster from the second set of word clusters, thereby dividing the first cluster into four first subclusters and the second cluster into four second subclusters; and
  
  calculating a function of the four first and four second subclusters, wherein the function comprises the calculation of a quantity chosen from the group consisting of a sum of diagonal matrix elements, a sum of off-diagonal matrix elements, and a sum of all matrix elements.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Mantra Technologies Incorporated
Original Assignee
Mantra Technologies Incorporated
Inventors
Carmel, Ron, Ariel, Hagai, Hilsenrath, Oliver A.
Primary Examiner(s)
Black, Thomas G.
Assistant Examiner(s)
Alam, Hosain T.

Application Number

US08/829,451
Time in Patent Office

844 Days
Field of Search

707/5, 707/6, 707/1-4, 707/501, 707/513, 395/200.47, 395/200.48
US Class Current

707/737
CPC Class Codes

G06F 16/355   Class or cluster creation o...

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99936   Pattern matching access

Document extraction and comparison method with applications to automatic personalized database searching

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

303 Citations

15 Claims

Specification

Solutions

Use Cases

Quick Links

Document extraction and comparison method with applications to automatic personalized database searching

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

303 Citations

15 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links