ITERATIVE SET EXPANSION USING SAMPLES

US 20120323932A1
Filed: 06/20/2011
Published: 12/20/2012
Est. Priority Date: 06/20/2011
Status: Active Grant

First Claim

Patent Images

1. A computer system for iterative set expansion using samples, the system comprising:

a processor and memory configured to execute software instructions embodied within the following components;

an input component that receives a set of seed terms and a set of terms and associated contexts with which to expand the set of seed terms;

a data modeling component that models the received terms and seeds as a bipartite graph with candidate terms being nodes on one side and identified context nodes on the other side;

a similarity determining component that determines a similarity metric between two candidate nodes in the graph based on the candidate nodes'"'"' relationship to the context nodes in the graph;

a relevance determining component that determines a relevance metric that indicates how similar a node in the graph is to the received seed terms and corresponding nodes;

a coherence determining component that determines a coherence metric that indicates how consistent a concept set is that includes the seed nodes and one or more candidate nodes;

a quality measurement component that combines the determined relevance metric and coherence metric to determine a quality metric that indicates relevance and coherence among a set of nodes in the graph;

an iterative expansion component that identifies an expanded seed set having a high quality metric; and

a set reporting component that reports the identified expanded seed set as output.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A set expansion system is described herein that uses general-purpose web data to expand a set of seed entities. The system includes a simple yet effective quality metric to measure the expanded set, and includes two iterative thresholding processes to rank candidate entities. The system models web data sources and integrates relevance and coherence measurements to evaluate potential set candidates using an iterative process. The system uses general-purpose web data that is not specific to the given seeds. The system defines quality of the result set as the sum of two component scores: the relevance of a set of entities that measures their similarity with the given seeds, and the coherence of the set of entities produced which is how closely the entities in the set are related to each other. Based on this quality measure, the system develops a class of iterative set expansion processes.

Citations

20 Claims

1. A computer system for iterative set expansion using samples, the system comprising:
- a processor and memory configured to execute software instructions embodied within the following components;
  
  an input component that receives a set of seed terms and a set of terms and associated contexts with which to expand the set of seed terms;
  
  a data modeling component that models the received terms and seeds as a bipartite graph with candidate terms being nodes on one side and identified context nodes on the other side;
  
  a similarity determining component that determines a similarity metric between two candidate nodes in the graph based on the candidate nodes'"'"' relationship to the context nodes in the graph;
  
  a relevance determining component that determines a relevance metric that indicates how similar a node in the graph is to the received seed terms and corresponding nodes;
  
  a coherence determining component that determines a coherence metric that indicates how consistent a concept set is that includes the seed nodes and one or more candidate nodes;
  
  a quality measurement component that combines the determined relevance metric and coherence metric to determine a quality metric that indicates relevance and coherence among a set of nodes in the graph;
  
  an iterative expansion component that identifies an expanded seed set having a high quality metric; and
  
  a set reporting component that reports the identified expanded seed set as output.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The system of claim 1 wherein the data modeling component assigns weights to the edges between nodes based on a quality of a source from which the terms were extracted.
  - 3. The system of claim 1 wherein the data modeling component models web query log data by dividing each query into a context of a fixed number of tokens of prefix or suffix and a remaining term.
  - 4. The system of claim 1 wherein the data modeling component models web query log data by weighting the edges using a mutual information probability calculation and discarding edges below a threshold probability.
  - 5. The system of claim 1 wherein the coherence determining component considers similarity of nodes to other candidate nodes to identify nodes that are relevant but nonetheless likely do not belong in the same expanded set because they are incoherent compared to other candidate nodes.
  - 6. The system of claim 1 wherein the quality measurement component determines coherence in addition to relevance to reduce noise and allows the system to operate with readily available but noisy datasets.

7. A computer-implemented method to expand a set of seeds while applying a dynamic threshold of relatedness, the method comprising:
- receiving a set of terms with contexts and one or more identified seeds, wherein the seeds are terms that are related to a concept for which to identify additional related terms from the set of terms;
  
  determining a relevance score for each term based on the identified seeds;
  
  ranking the received set of terms by the determined relevance score;
  
  selecting an initial threshold ranking value for separating terms in the set related to the seeds from terms not related to the seeds;
  
  picking a top ranked number of terms above a threshold from the ranked set of terms based on the selected initial threshold to form a new set;
  
  determining a quality measurement that identifies how well each term relates to the picked threshold number of terms in the new set;
  
  ranking the terms in the new set based on the determined quality measurement;
  
  selecting a next threshold to use to separate terms in the new set related to the seeds from terms not related to the seeds;
  
  using the selected next threshold to select a threshold number of terms from the ranked new set;
  
  repeating the steps of determining the quality measurement, ranking the terms, and selecting a threshold number of terms for a determined number of iterations; and
  
  reporting the resulting expanded seed set that includes the terms in the received set that are the highest quality matches to the received seeds,wherein the preceding steps are performed by at least one processor.
- View Dependent Claims (8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
- - 8. The method of claim 7 wherein receiving the terms and seeds comprises receiving the terms and seeds programmatically through an application-programming interface (API) that exposes the method to software components to provide set expansion functionality.
  - 9. The method of claim 7 wherein receiving the terms and seeds comprises receiving the set of terms from a list retrieved from the Internet.
  - 10. The method of claim 7 wherein receiving the terms and seeds comprises receiving the set of terms from a web query log.
  - 11. The method of claim 7 wherein receiving the terms and seeds comprises receiving a noisy set of terms that includes many unrelated terms in the set, and wherein the method identifies those terms that are most related to expand the set of seeds while eliminating the noise.
  - 12. The method of claim 7 wherein determining the relevance score comprises calculating a similarity metric that is determined using a Jaccard similarity or Cosine similarity function between each term and the identified seeds.
  - 13. The method of claim 7 wherein ranking the received terms comprises invoking a sorting function that orders the terms by the determined relevance scores.
  - 14. The method of claim 7 wherein selecting the initial threshold comprises identifying those terms with relevance scores above the threshold as related to the seeds and those terms below as not to be related to the seeds.
  - 15. The method of claim 7 wherein selecting the initial threshold comprises using iterative threshold selection or Otsu'"'"'s thresholding to select the threshold.
  - 16. The method of claim 7 wherein picking the top ranked number of terms comprises selecting terms with a relevance score that is above or equal to the selected initial threshold and using matching terms to form an initial expanded seed set that will be refined in each iteration of an iterative process to determine terms most related to the seeds.
  - 17. The method of claim 7 wherein determining the quality measurement comprises calculating a relevance score and a coherence score, wherein the quality measurement is combination of a weighted relevance score and a weighted coherence score.
  - 18. The method of claim 7 wherein selecting the next threshold comprises selecting a value that differs from the initial threshold based on a distribution of the data in the new set.
  - 19. The method of claim 7 further comprising repeating the steps of determining a quality measurement, ranking the terms, selecting the next threshold, and using the selected next threshold to select a threshold number of items for a fixed number of iterations to iteratively improve a resulting expanded seed set.

20. A computer-readable storage medium comprising instructions for controlling a computer system to expand a set of seeds using a static threshold, wherein the instructions, upon execution, cause a processor to perform actions comprising:
- receiving a set of terms with contexts modeled as a general bipartite graph and identified seeds, wherein the seeds are terms that are related to a concept for which to identify additional related terms from the set of terms;
  
  determining a relevance score for each term based on the identified seeds;
  
  ranking the received set of terms by the determined relevance score;
  
  determining a static threshold and picking a top ranked number of terms above a threshold from the ranked set of terms to form a new set;
  
  determining a quality measurement that identifies how well each term relates to the picked threshold number of terms in the new set;
  
  ranking the terms in the new set based on the determined quality measurement;
  
  using the previously determined static threshold to select a threshold number of terms from the ranked new set;
  
  upon determining that the selected threshold number of terms from the ranked new set does not match the top ranked number of terms from the previously ranked set of terms,replacing the lowest ranked term in the previously ranked set with the highest ranked term in the new set that is not already in the previously ranked set; and
  
  repeating the steps of determining the quality measurement, ranking the terms, selecting a threshold number of terms, and replacing the lowest ranked term until the sets match; and
  
  upon determining that the selected threshold number of terms from the ranked new set matches the top ranked number of terms from the previously ranked set of terms, reporting the resulting expanded seed set that includes the terms in the received set that are the highest quality matches to the received seeds.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Xin, Dong, He, Yeye, Cheng, Tao

Granted Patent

US 8,589,408 B2
Time in Patent Office

Days
Field of Search
US Class Current

707/749
CPC Class Codes

G06F 16/367 Ontology

ITERATIVE SET EXPANSION USING SAMPLES

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

ITERATIVE SET EXPANSION USING SAMPLES

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links