System and method for identifying similarities among objects in a collection

US 6,941,321 B2
Filed: 10/19/1999
Issued: 09/06/2005
Est. Priority Date: 01/26/1999
Status: Expired due to Fees

First Claim

Patent Images

1. A computer-implemented method for calculating the similarity between two objects in a collection of objects, wherein each object is associated with at least a first feature vector and a second feature vector, each of the first and second feature vectors being a multi-dimensional vector, the first feature vector being representative of a first feature of the objects and the second feature vector being representative of a second feature of the objects, the first feature being a one of a first set of multi-modal features including a text feature, a URL feature, an inlink feature and an outlink feature, and the second feature being an image feature, comprising the steps of:

identifying the first feature vector for a first object and the first feature vector of a second object;

computing a first distance metric between the first feature vector for the first object and the first feature vector for the second object;

identifying, without reference to textual information, the second feature vector of the first object and the second feature vector of the second object;

computing a second distance metric between the second feature vector for the first object and the second feature vector for the second object; and

computing a sum of the first distance metric and the second distance metric.

View all claims

10 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method for browsing, retrieving, and recommending information from a collection uses multi-modal features of the documents in the collection, as well as an analysis of users'"'"' prior browsing and retrieval behavior. The system and method are premised on various disclosed methods for quantitatively representing documents in a document collection as vectors in multi-dimensional vector spaces, quantitatively determining similarity between documents, and clustering documents according to those similarities. The system and method also rely on methods for quantitatively representing users in a user population, quantitatively determining similarity between users, clustering users according to those similarities, and visually representing clusters of users by analogy to clusters of documents.

Citations

30 Claims

1. A computer-implemented method for calculating the similarity between two objects in a collection of objects, wherein each object is associated with at least a first feature vector and a second feature vector, each of the first and second feature vectors being a multi-dimensional vector, the first feature vector being representative of a first feature of the objects and the second feature vector being representative of a second feature of the objects, the first feature being a one of a first set of multi-modal features including a text feature, a URL feature, an inlink feature and an outlink feature, and the second feature being an image feature, comprising the steps of:
- identifying the first feature vector for a first object and the first feature vector of a second object;
  
  computing a first distance metric between the first feature vector for the first object and the first feature vector for the second object;
  
  identifying, without reference to textual information, the second feature vector of the first object and the second feature vector of the second object;
  
  computing a second distance metric between the second feature vector for the first object and the second feature vector for the second object; and
  
  computing a sum of the first distance metric and the second distance metric.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method of claim 1, wherein each object corresponds to a document in a collection of documents.
  - 3. The method of claim 2, wherein the first feature comprises the text feature, and wherein the first distance metric comprises a cosine similarity measure between the first feature vector for the first object and the first feature vector for the second object.
  - 4. The method of claim 2, wherein the first feature comprises the URL feature, and wherein the first distance metric comprises a cosine similarity measure between the first feature vector of the first object and the first feature vector of the second object.
  - 5. The method of claim 2, wherein the first feature comprises the inlink feature, and wherein the first distance metric comprises a cosine similarity measure between the first feature vector of the first object and the first feature vector of the second object.
  - 6. The method of claim 2, wherein the first feature comprises the outlink feature, and wherein the first distance metric comprises a cosine similarity measure between the first feature vector of the first object and the first feature vector of the second object.
  - 7. The method of claim 2, wherein a third feature vector is associated with each document, the third feature vector being a multi-dimensional vector representative of a text genre feature and wherein the method further comprises the steps of:
    - identifying the third feature vector for the first document and the third feature vector of the second document;
      
      computing a third distance metric between the third feature vector for the first document and the third feature vector for the second document; and
      
      computing a sum of the first, second and third distance metrics.
  - 8. The method of claim 2, wherein the second feature comprises the image feature as represented by a color histogram of an image associated with the document, and wherein the second distance metric comprises a cosine similarity measure between the second feature vector of the first object and the second feature vector of the second object.
  - 9. The method of claim 2, wherein the second feature comprises the image feature as represented by a color histogram of an image associated with the document, and wherein the second distance metric comprises a normalized intersection measure between the second feature vector of the first object and the second feature vector of the second object.
  - 10. The method of claim 2, wherein the second feature comprises the image feature as represented by a color complexity feature of an image associated with the document, and wherein the second distance metric comprises a cosine similarity measure between the second feature vector of the first object and the second feature vector of the second object.
  - 11. The method of claim 1, wherein the step of computing a sum comprises uses a first weighting factor for the first distance metric and a second weighting factor for the second distance metric.

12. A computer-readable medium storing instructions for calculating the similarity between two documents in a collection of documents, wherein each document is associated with at least two multi-dimensional vectors representative of a color complexity feature of an image included in the document, the instructions comprising:
- identifying a first horizontal complexity vector corresponding to a first document without reference to textual information, a first vertical complexity vector corresponding to the first document without reference to textual information, a second horizontal complexity vector corresponding to a second document without reference to textual information, and a second vertical complexity vector corresponding to the second document without reference to textual information; and
  
  computing a distance metric between the first document and the second document, wherein the distance metric comprises a normalized sum of a cosine similarity measure between the first horizontal complexity vector and the second horizontal complexity vector, and between the first vertical complexity vector and the second vertical complexity vector.

13. A computer-readable medium for transmitting computer instructions calculating the similarity between two objects in a collection of objects, wherein each object is associated with at least a first set of feature vector and a second set of feature vectors, each of the feature vectors of the first and second set of feature vectors being a multi-dimensional vector representative of a feature of an object, the features of the first set of feature vectors being a one of a text feature, a URL feature, an inlink feature and an outlink feature, and the features of the second set of feature vectors being an image feature, the instructions comprising:
- identifying the first set of feature vectors corresponding to a first object and the first set of feature vectors corresponding to the second object;
  
  identifying without reference to textual information the second set of feature vectors corresponding to the first object and the second set of feature vectors corresponding to the second object;
  
  computing a distance metric between each vector in the sets of feature vectors associated with the first object and each vector in the sets of feature vectors associated with the second object; and
  
  summing the distance metrics into a composite distance metric.
- View Dependent Claims (14)
- - 14. The computer-readable medium for transmitting computer instructions for calculating the similarity between two objects in a collection of objects of claim 13, wherein the step of summing the distance metrics uses a distinct weighting factor for each type of feature vector.

15. A computer-implemented method for calculating the similarity between characteristics of two users in a population of users of a document collection, wherein each user is associated with a multi-dimensional vector representative of a user feature, and each document in the collection of documents is associated with at least one multi-dimensional vector representative of a document feature, the user feature representing for each user at least a document browsing history comprising the steps of:
- identifying a first vector corresponding to a first user and a second vector corresponding to a second user; and
  
  wherein the first vector represents a mediated representation of the first user through the document feature corresponding to the documents in the collection accessed by the first user; and
  
  the second vector represents a mediated representation of the second user through the document feature corresponding to the documents in the collection accessed by the second user;
  
  computing a first distance metric between the first vector and the second vector.
- View Dependent Claims (16, 17, 18, 19, 20, 21, 22)
- - 16. The method of claim 15, wherein the user feature further comprises demographic information about the users of the population of users of the document collection.
  - 17. The method of claim 15, wherein the user feature further comprises group membership information about the users of the population of users of the document collection.
  - 18. The method of claim 15, wherein the mediated representation of each user is calculated by multiplying a first matrix and a second matrix, wherein:
    - the first matrix comprises a first plurality of column vectors each representing a document in the collection by way of the document feature; and
      
      the second matrix comprises a second plurality of column vectors each representing a user in the user population by way of document accesses.
  - 19. The method of claim 18, wherein the document feature comprises the text represented by documents in the collection.
  - 20. The method of claim 18, wherein the document feature comprises the outlinks represented by documents in the collection.
  - 21. The method of claim 18, wherein the document feature comprises the inlinks represented by documents in the collection.
  - 22. The method of claim 18, wherein the document feature comprises the URLs represented by documents in the collection.

23. A computer-readable medium storing instructions for calculating the similarity between two documents in a collection of documents, wherein each document is associated with at least a first feature vector, a second feature vector, a third feature vector and a fourth feature vector, each of the first, second, third and fourth feature vectors each being a multi-dimensional vector, the first feature vector being representative of a text feature of the documents, the second feature vector being representative of an image feature of the documents, the third feature vector being representative of a text genre feature of the documents, and the fourth feature vector being representative of a link feature of the documents, the instructions comprising:
- identifying the first, second, third and fourth feature vectors for the first document and the first, second, third and fourth feature vectors of a second document;
  
  computing a first distance metric between the first feature vector for the first document and the first feature vector for the second document;
  
  computing a second distance metric between the second feature vector for the first document and the second feature vector for the second document; and
  
  computing a third distance metric between the third feature vector for the first document and the third feature vector for the second document; and
  
  computing a fourth distance metric between the fourth feature vector for the first document and the fourth feature vector for the second document.
- View Dependent Claims (24, 25, 26, 27, 28)
- - 24. The computer readable medium of claim 23 wherein the image feature is represented by a color histogram and wherein the second distance metric comprises a normalized intersection measure between the second feature vector of the first document and the second feature vector of the second document.
  - 25. The computer readable medium of claim 24 wherein the instructions further comprise:
    - computing a sum of the first, second, third and fourth distance metrics.
  - 26. The computer readable medium of claim 23 wherein the image feature is represented by a color complexity feature and wherein the second distance metric comprises a cosine similarity measure between the second feature vector of the first document and the second feature vector of the second document.
  - 27. The computer readable medium of claim 23 wherein each document is associated with a fifth feature vector, the fifth feature vector being a multi-dimensional vector representative a user information feature of the documents, and wherein the instructions further comprise:
    - identifying the fifth feature vector for the first document and the second document; and
      
      computing a fifth distance metric between the fifth feature vector for the first document and the fifth feature vector for the second document.
  - 28. The computer readable medium of claim 27 wherein the instructions further comprise:
    - computing a sum of the first, second, third, fourth and fifth distance metrics.

29. A computer-implemented system for calculating the similarity between two objects in a collection of objects, wherein each object is associated with at least a first feature vector and a second feature vector, each of the first and second feature vectors being a multi-dimensional vector, the first feature vector being representative of a first feature of the objects and the second feature vector being representative of a second feature of the objects, the first feature being a one of a first set of multi-modal features including a text feature, a URL feature, an inlink feature and an outlink feature, and the second feature being an image feature, comprising:
- means for identifying the first feature vector for a first object and the first feature vector of a second object;
  
  means for computing a first distance metric between the first feature vector for the first object and the first feature vector for the second object;
  
  means for identifying, without reference to textual information, the second feature vector of the first object and the second feature vector of the second object;
  
  means for computing a second distance metric between the second feature vector for the first object and the second feature vector for the second object; and
  
  means for computing a sum of the first distance metric and the second distance metric.

30. A computer-implemented system for calculating the similarity between two objects in a collection of objects, wherein each object is associated with at least a first feature vector and a second feature vector, each of the first and second feature vectors being a multi-dimensional vector, the first feature vector being representative of a first feature of the objects and the second feature vector being representative of a second feature of the objects, the first feature being a one of a first set of multi-modal features including a text feature, a URL feature, an inlink feature and an outlink feature, and the second feature being an image feature, comprising:
- a processor adapted to execute instructions; and
  
  a computer-readable memory storing instructions for causing the processor to calculate the similarity between two objects in a collection of objects;
  
  identifying the first feature vector for a first object and the first feature vector of a second object;
  
  computing a first distance metric between the first feature vector for the first object and the first feature vector for the second object;
  
  identifying, without reference to textual information, the second feature vector of the first object and the second feature vector of the second object;
  
  computing a second distance metric between the second feature vector for the first object and the second feature vector for the second object; and
  
  computing a sum of the first distance metric and the second distance metric.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Xerox Corporation (Xerox Holdings Corp.)
Inventors
Li, Jun, Chi, Ed H., Chen, Francine R., Pirolli, Peter L., Pitkow, James E., Schuetze, Hinrich
Primary Examiner(s)
Alam, Shahid
Assistant Examiner(s)
Fleurantin, Jean Bolte

Application Number

US09/421,767
Publication Number

US 20030074369A1
Time in Patent Office

2,149 Days
Field of Search

704/10, 704/222, 709/203, 715/513, 707 1- 10, 707100-1041, 707200-206
US Class Current

1/1
CPC Class Codes

G06F 16/355   Class or cluster creation o...

G06F 16/904   Browsing; Visualisation the...

Y10S 707/99944   Object-oriented database st...

Y10S 707/99945   Object-oriented database st...

System and method for identifying similarities among objects in a collection

First Claim

10 Assignments

0 Petitions

Accused Products

Abstract

Citations

30 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for identifying similarities among objects in a collection

First Claim

10 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

30 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links