Document extraction and comparison method with applications to automatic personalized database searching
First Claim
1. A computer-implemented method for determining the relevance of the content of a first set of documents to the content of a second set of documents, the method comprising:
- extracting from the first set of documents a corresponding first set of document extract entries and from the second set of documents a corresponding second set of document extract entries, wherein each entry in the first and second sets of document extract entries comprises a weighted word histogram for a corresponding document;
generating from the first set of document extract entries a first set of word clusters, and generating from the second set of document extract entries a second set of word clusters, wherein each word cluster in the first and second sets of word clusters comprises a cluster word list, a total distance matrix, and a number of connections matrix; and
determining a degree of similarity between clusters from the first set of word clusters and clusters from the second set of word clusters.
1 Assignment
0 Petitions
Accused Products
Abstract
A computer-implemented method for comparing the contents of two sets of documents includes the step of extracting from a set of documents 44! corresponding sets of document extract entries 46!. The method further includes a step of generating from the sets of document extract entries 46! corresponding sets of word clusters 48!. Each word cluster comprises a cluster word list having N words, an N×N total distance matrix, and an N×N number of connections matrix. The preferred embodiment includes a step of grouping similar word clusters and combining the similar word clusters to form a single word cluster for each group. The grouping comprises evaluating a measure of cluster similarity between two word clusters, and placing them in a common group of similar word clusters if the measure of similarity exceeds a predetermined value. The step of evaluating cluster similarity comprises intersecting clusters to form subclusters and calculating a function of the subclusters. In the preferred embodiment, the method is implemented in a system to automatically identify database documents which are of interest to a given user or users. In this implementation, the method comprises the step of automatically deriving the first set of documents from a local data storage device, such as a user'"'"'s hard disk. The method also comprises the step of deriving the second set of documents from a second data storage device, such as a network machine. This application of the invention, therefore, provides fast and accurate searching to identify documents of interest to a particular user or users without any need for the user or users to specify what search criteria to use.
303 Citations
15 Claims
-
1. A computer-implemented method for determining the relevance of the content of a first set of documents to the content of a second set of documents, the method comprising:
- extracting from the first set of documents a corresponding first set of document extract entries and from the second set of documents a corresponding second set of document extract entries, wherein each entry in the first and second sets of document extract entries comprises a weighted word histogram for a corresponding document;
generating from the first set of document extract entries a first set of word clusters, and generating from the second set of document extract entries a second set of word clusters, wherein each word cluster in the first and second sets of word clusters comprises a cluster word list, a total distance matrix, and a number of connections matrix; and
determining a degree of similarity between clusters from the first set of word clusters and clusters from the second set of word clusters. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- extracting from the first set of documents a corresponding first set of document extract entries and from the second set of documents a corresponding second set of document extract entries, wherein each entry in the first and second sets of document extract entries comprises a weighted word histogram for a corresponding document;
-
12. A computer-implemented method for determining the relevance of the content of a first set of documents to the content of a second set of documents, the method comprising:
-
extracting from the first set of documents a corresponding first set of document extract entries and from the second set of documents a corresponding second set of document extract entries, wherein each entry in the first and second sets of document extract entries comprises a weighted word histogram for a corresponding document; generating from the first set of document extract entries a first set of word clusters, and from the second set of document extract entries a second set of word clusters, wherein each word cluster in the first and second sets of word clusters comprises a cluster word list, a total distance matrix, and a number of connections matrix; and determining a degree of similarity between clusters from the first set of word clusters and clusters from the second set of word clusters; wherein; the number of connections matrix for each word cluster comprises an N×
N matrix, wherein N is equal to the number of words in the cluster word list, wherein the (i,j) entry of the number of connections matrix for i≠
j contains a number of connections in the document between words i and j when word i precedes word j, and wherein the (i,i) entry of the number of connections matrix contains a number of appearances in the document of word i; andthe total distance matrix for each word cluster comprises an N×
N matrix, wherein the (i,j) entry of the total distance matrix for i≠
j contains a total distance between words i and j for all connections in the document when word i precedes word j, and wherein the (i,i) entry of the total distance matrix contains a weight of word i in the document. - View Dependent Claims (13, 14)
-
-
15. A computer-implemented method for determining the relevance of the content of a first set of documents to the content of a second set of documents, the method comprising:
-
extracting from the first set of documents a corresponding first set of document extract entries and from the second set of documents a corresponding second set of document extract entries, wherein each entry in the first and second sets of document extract entries comprises a weighted word histogram for a corresponding document; generating from the first set of document extract entries a first set of word clusters, and from the second set of document extract entries a second set of word clusters, wherein each word cluster in the first and second sets of word clusters comprises a cluster word list, a total distance matrix, and a number of connections matrix; and determining a degree of similarity between clusters from the first set of word clusters and clusters from the second set of word clusters; wherein the determining step comprises; intersecting a first cluster from the first set of word clusters and a second cluster from the second set of word clusters, thereby dividing the first cluster into four first subclusters and the second cluster into four second subclusters; and calculating a function of the four first and four second subclusters, wherein the function comprises the calculation of a quantity chosen from the group consisting of a sum of diagonal matrix elements, a sum of off-diagonal matrix elements, and a sum of all matrix elements.
-
Specification