Propagating relevance from labeled documents to unlabeled documents

US 8,019,763 B2
Filed: 02/27/2006
Issued: 09/13/2011
Est. Priority Date: 02/27/2006
Status: Expired due to Fees

First Claim

Patent Images

1. A computing device with a processor and memory for propagating relevance of labeled documents to unlabeled documents, comprising:

a document store that contains representations of documents, some of the documents being labeled with relevance to a query and others of the documents not being labeled with relevance to the query;

a graph component that creates a graph of the documents with the documents represented as nodes being connected by edges representing similarity between documents having content, includinga document similarity component that determines the similarity between two documents by retrieving the content of the two documents and generating a similarity metric based on the retrieved content of the two documents to indicate similarity between the retrieved content of the two documents;

a build graph component that builds a graph in which nodes representing similar documents are connected via edges, such that each node has an edge to a number of other nodes that are most similar to it;

a generate weights component that generates weights for the edges based on similarity of the documents represented by the connected nodes such that the weight for an edge between documents whose content are similar is higher than the weight for an edge between documents whose content are not as similar; and

a normalize weights component that normalizes the weights of the graph so that weights are relative to the weights of the connected documents;

a propagate relevance component that propagates relevance of the labeled documents to the unlabeled documents based on the similarity between documents as indicated by the weights of the edges of the graph connecting the documents such that an unlabeled document that is similar to labeled documents as represented by the weights of the edges of the graph has its relevance set based on the relevance of those similar labeled documents, the propagation of relevance for a first document includes determining the relevance for the first document based on a summation of weighted relevances of the documents wherein the weighted relevance of a second document to the first document is based on the weight of the edge of the graph connecting the second document to the first document and the relevance of the second document; and

a training component that trains a ranking function to generate relevance of a document to a query based on the documents, the query, and the labeled and propagated relevanceswherein the components comprise computer-executable instructions for execution by the processor.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and system for propagating the relevance of labeled documents to a query to unlabeled documents is provided. The propagation system provides training data that includes queries, documents labeled with their relevance to the queries, and unlabeled documents. The propagation system then calculates the similarity between pairs of documents in the training data. The propagation system then propagates the relevance of the labeled documents to similar, but unlabeled, documents. The propagation system may iteratively propagate labels of the documents until the labels converge on a solution. The training data with the propagated relevances can then be used to train a ranking function.

Citations

20 Claims

1. A computing device with a processor and memory for propagating relevance of labeled documents to unlabeled documents, comprising:
- a document store that contains representations of documents, some of the documents being labeled with relevance to a query and others of the documents not being labeled with relevance to the query;
  
  a graph component that creates a graph of the documents with the documents represented as nodes being connected by edges representing similarity between documents having content, includinga document similarity component that determines the similarity between two documents by retrieving the content of the two documents and generating a similarity metric based on the retrieved content of the two documents to indicate similarity between the retrieved content of the two documents;
  
  a build graph component that builds a graph in which nodes representing similar documents are connected via edges, such that each node has an edge to a number of other nodes that are most similar to it;
  
  a generate weights component that generates weights for the edges based on similarity of the documents represented by the connected nodes such that the weight for an edge between documents whose content are similar is higher than the weight for an edge between documents whose content are not as similar; and
  
  a normalize weights component that normalizes the weights of the graph so that weights are relative to the weights of the connected documents;
  
  a propagate relevance component that propagates relevance of the labeled documents to the unlabeled documents based on the similarity between documents as indicated by the weights of the edges of the graph connecting the documents such that an unlabeled document that is similar to labeled documents as represented by the weights of the edges of the graph has its relevance set based on the relevance of those similar labeled documents, the propagation of relevance for a first document includes determining the relevance for the first document based on a summation of weighted relevances of the documents wherein the weighted relevance of a second document to the first document is based on the weight of the edge of the graph connecting the second document to the first document and the relevance of the second document; and
  
  a training component that trains a ranking function to generate relevance of a document to a query based on the documents, the query, and the labeled and propagated relevanceswherein the components comprise computer-executable instructions for execution by the processor.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The computing device of claim 1 wherein the build graph component establishes edges between nodes using a nearest neighbor algorithm.
  - 3. The computing device of claim 2 wherein the nearest neighbor algorithm uses a Euclidean distance metric.
  - 4. The computing device of claim 2 wherein the build graph component connects a node to its 10 nearest neighbors.
  - 5. The computing device of claim 1 wherein the build graph component establishes edges between nodes to create a minimum spanning tree.
  - 6. The computing device of claim 1 wherein the relevance of the labeled documents is generated by searching for documents related to the query in a corpus of documents and the unlabeled documents are not included in the corpus of documents.
  - 7. The computing device of claim 1 wherein the propagate relevance component propagates relevance using a manifold ranking based algorithm.
  - 8. The computing device of claim 1 wherein the propagation of the relevance is performed iteratively until the relevances of the documents converge.
  - 9. The computing device of claim 8 wherein the propagation of the relevance factors in a decay factor.

10. A computing device with a processor and memory for propagating relevance of labeled documents to unlabeled documents, comprising:
- a document store that contains representations of documents, some of the documents being labeled with relevance to a query and others of the documents not being labeled with relevance to the query;
  
  a graph component that creates a graph of the documents with the documents represented as nodes being connected by edges representing similarity between documents having content, includinga document similarity component that determines the similarity between two documents by retrieving the content of the two documents and generating a similarity metric based on the retrieved content of the two documents to indicate similarity between the retrieved content of the two documents;
  
  a build graph component that builds a graph in which nodes representing similar documents are connected via edges, such that each node has an edge to a number of other nodes that are most similar to it;
  
  a generate weights component that generates weights for the edges based on similarity of the documents represented by the connected nodes such that the weight for an edge between documents whose content are similar is higher than the weight for an edge between documents whose content are not as similar; and
  
  a normalize weights component that normalizes the weights of the graph so that weights are relative to the weights of the connected documents;
  
  a propagate relevance component that propagates relevance of the labeled documents to the unlabeled documents based on the similarity between documents as indicated by the weights of the edges of the graph connecting the documents such that an unlabeled document that is similar to labeled documents as represented by the weights of the edges of the graph has its relevance set based on the relevance of those similar labeled documents; and
  
  a training component that trains a ranking function to generate relevance of a document to a query based on the documents, the query, and the labeled and propagated relevanceswherein the components comprise computer-executable instructions for execution by the processor andwherein the propagate relevance component propagates relevance according to the following equation;
  
  ƒ
  
  *=(1−
  
  α
  
  )(I−
  
  α
  
  S)^−
  
  1y where ƒ
  
  * represents a propagated relevance vector, S is a similarity matrix, y represents an initial relevance vector, and α
  
  represents a decay rate.

11. A computing device with a processor and memory for propagating relevance of labeled documents to unlabeled documents, comprising:
- a document store that contains representations of documents, some of the documents being labeled with relevance to a query and others of the documents not being labeled with relevance to the query;
  
  a graph component that creates a graph of the documents with the documents represented as nodes being connected by edges representing similarity between documents having content, includinga document similarity component that determines the similarity between two documents by retrieving the content of the two documents and generating a similarity metric based on the retrieved content of the two documents to indicate similarity between the retrieved content of the two documents;
  
  a build graph component that builds a graph in which nodes representing similar documents are connected via edges, such that each node has an edge to a number of other nodes that are most similar to it;
  
  a generate weights component that generates weights for the edges based on similarity of the documents represented by the connected nodes such that the weight for an edge between documents whose content are similar is higher than the weight for an edge between documents whose content are not as similar; and
  
  a normalize weights component that normalizes the weights of the graph so that weights are relative to the weights of the connected documents;
  
  a propagate relevance component that propagates relevance of the labeled documents to the unlabeled documents based on the similarity between documents as indicated by the weights of the edges of the graph connecting the documents such that an unlabeled document that is similar to labeled documents as represented by the weights of the edges of the graph has its relevance set based on the relevance of those similar labeled documents; and
  
  a training component that trains a ranking function to generate relevance of a document to a query based on the documents, the query, and the labeled and propagated relevanceswherein the components comprise computer-executable instructions for execution by the processor andwherein the propagate relevance component propagates relevance according to the following equation;
  
  ƒ
  
  *=(I+α
  
  S+α
  
  ²S²+ . . . +α
  
  ^NS^N)y where ƒ
  
  * represents a propagated relevance vector, S is a similarity matrix, y represents an initial relevance vector, and α
  
  represents a decay rate, and where n represents an exponent for which ƒ
  
  * converges on a solution.

12. A computing device with a processor and memory for propagating relevance of labeled pages to a query to unlabeled pages to the query, comprising:
- a page store that contains representations of pages, some of the pages being labeled with relevance to a query and others of the pages not being labeled with relevance to the query;
  
  a graph component that creates a graph of the pages with the pages represented as nodes connected by edges representing similarity between the pages having content, including;
  
  a similarity component that determines the similarity between two pages by retrieving the content of the two pages and generating a similarity metric based on the retrieved content of the two pages to indicate similarity between the retrieved content of the two pages;
  
  a build graph component that builds a graph in which nodes representing similar pages are connected via edges; and
  
  a generate weights component that generates weights for the edges based on similarity of the pages represented by the connected nodes such that the weight for an edge between pages whose content are similar is higher than the weight for an edge between pages whose content are not as similar; and
  
  a propagate relevance component that propagates relevance of the labeled pages to the unlabeled pages based on the similarity between pages as indicated by the weights of the edges of the graph connecting the pages and based on a manifold ranking algorithm such that an unlabeled page that is similar to labeled pages as represented by the edges of the graph has its relevance set based on the relevance of those similar labeled pages, the propagation of relevance for a first page includes determining the relevance for the first page based on a summation of weighted relevances of the pages wherein the weighted relevance of a second page to the first page is based on the weight of the edge of the graph connecting the second page to the first page and the relevance of the second page.
- View Dependent Claims (13, 14, 15, 16, 17)
- - 13. The computing device of claim 12 wherein the build graph component establishes edges between nodes using a nearest neighbor algorithm.
  - 14. The computing device of claim 13 wherein the nearest neighbor algorithm uses a Euclidean distance metric.
  - 15. The computing device of claim 13 wherein the build graph component connects a node to its 10 nearest neighbors.
  - 16. The computing device of claim 12 wherein the generate weights component uses a Manhattan distance metric for representing the similarity between pages.
  - 17. The computing device of claim 12 wherein each page is represented by a feature vector and the similarity between pages is represented by distance in feature vector space.

18. A computer-readable storage medium containing instructions for controlling a computer system to propagate relevance of documents to a query to other documents, by a method comprising:
- for each pair of documents, determining similarity between a first document of the pair and a second document of the pair, the determining including retrieving first content of the first document and second content of second document and generating a similarity metric based on comparison of the first content and the second content to indicate similarity between the first document and the second document;
  
  creating a graph of the documents represented as nodes connected by edges having weights representing similarity between documents, some documents being labeled indicating their relevance to the query and other documents being unlabeled, the weight for an edge between documents whose content are similar is higher than the weight for an edge between document whose content are not as similar;
  
  propagating the relevance of the labeled documents to the unlabeled documents based on the weights of the edges between nodes using a manifold ranking based algorithm such that an unlabeled document that is similar to labeled documents as represented by the edges of the graph has its relevance set based on the relevance of those similar labeled documents, the propagation of relevance for a first document includes determining the relevance for the first document based on a summation of weighted relevances of the documents wherein the weighted relevance of a second document to the first document is based on the weight of the edge of the graph connecting the second document to the first document and the relevance of the second document; and
  
  training a ranking function to generate relevance of a document to a query base on the documents, the query, and the labeled relevance and the propagated relevance.
- View Dependent Claims (19)
- - 19. The computer-readable storage medium of claim 18 wherein the creating of the graph includes connecting edges using a nearest neighbor algorithm and establishing the weight of an edge based on distance between documents represented by the nodes connected by the edge.

20. A computer-readable storage medium containing instructions for controlling a computer system to propagate relevance of documents to a query to other documents, by a method comprising:
- for each pair of documents, determining similarity between a first document of the pair and a second document of the pair, the determining including retrieving first content of the first document and second content of second document and generating a similarity metric based on comparison of the first content and the second content to indicate similarity between the first document and the second document;
  
  creating a graph of the documents represented as nodes connected by edges having weights representing similarity between documents, some documents being labeled indicating their relevance to the query and other documents being unlabeled, the weight for an edge between documents whose content are similar is higher than the weight for an edge between documents whose content are not as similar;
  
  propagating the relevance of the labeled documents to the unlabeled documents based on the weights of the edges between nodes using a manifold ranking based algorithm such that an unlabeled document that is similar to labeled documents as represented by the edges of the graph has its relevance set based on the relevance of those similar labeled documents; and
  
  training a ranking function to generate relevance of a document to a query base on the documents, the query, and the labeled relevance and the propagated relevancewherein the propagating of the relevance of the labeled documents includes using a Taylor expansion to iteratively solve the following equation;
  
  ƒ
  
  *=(1−
  
  α
  
  )(I−
  
  α
  
  S)^−
  
  1y where ƒ
  
  * represents a propagated relevance vector, S is a similarity matrix, y represents an initial relevance vector, and a represents α
  
  decay rate.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Ma, Wei-Ying, Li, Mingjing, Li, Zhiwei, Wang, Jue
Primary Examiner(s)
Cottingham; John R.
Assistant Examiner(s)
Reyes; Mariela D

Application Number

US11/364,807
Publication Number

US 20070203940A1
Time in Patent Office

2,024 Days
Field of Search

707/736, 707/738, 707/748, 707/749
US Class Current

707/738
CPC Class Codes

G06F 16/951 Indexing; Web crawling tech...

Propagating relevance from labeled documents to unlabeled documents

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Propagating relevance from labeled documents to unlabeled documents

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links