Method and apparatus for measuring similarity among electronic documents
First Claim
1. A computer implemented method of categorizing a plurality of new electronic documents into a set of categories, comprising the steps of:
- establishing a plurality of training sets, wherein each training set is associated with a category and includes training documents that have been classified as belonging to said associated category;
determining how strongly each document of said plurality of documents corresponds to each of said plurality of categories by determining similarity between said each document and the training documents that belong to the training set of said category; and
wherein the step of determining similarity is performed using a matrix representing document similarity that is derived by combining two or more measures of document similarity.
5 Assignments
0 Petitions
Accused Products
Abstract
A method and apparatus are provided for determining when electronic documents stored in a large collection of documents are similar to one another. A plurality of similarity information is derived from the documents. The similarity information may be based on a variety of factors, including hyperlinks in the documents, text similarity, user click-through information, similarity in the titles of the documents or their location identifiers, and patterns of user viewing. The similarity information is fed to a combination function that synthesizes the various measures of similarity information into combined similarity information. Using the combined similarity information, an objective function is iteratively maximized in order to yield a generalized similarity value that expresses the similarity of particular pairs of documents. In an embodiment, the generalized similarity value is used to determine the proper category, among a taxonomy of categories in an index, cache or search system, into which certain documents belong.
-
Citations
34 Claims
-
1. A computer implemented method of categorizing a plurality of new electronic documents into a set of categories, comprising the steps of:
-
establishing a plurality of training sets, wherein each training set is associated with a category and includes training documents that have been classified as belonging to said associated category; determining how strongly each document of said plurality of documents corresponds to each of said plurality of categories by determining similarity between said each document and the training documents that belong to the training set of said category; and wherein the step of determining similarity is performed using a matrix representing document similarity that is derived by combining two or more measures of document similarity. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33)
-
-
34. A computer-readable recording medium carrying one or more sequences of instructions, wherein execution of the one or more sequences of instructions by one or more processors causes the one or more processors to perform the steps of:
-
establishing a plurality of training sets, wherein each training set is associated with a category and includes training documents that have been classified as belonging to said associated category; determining how strongly each document of said plurality of documents corresponds to each of said plurality of categories by determining similarity between said each document and the documents that belong to the training set of said category; and wherein the step of determining similarity is performed using a matrix representing document similarity that is derived by combining two or more measures of document similarity.
-
Specification