ITERATIVE SET EXPANSION USING SAMPLES
First Claim
1. A computer system for iterative set expansion using samples, the system comprising:
- a processor and memory configured to execute software instructions embodied within the following components;
an input component that receives a set of seed terms and a set of terms and associated contexts with which to expand the set of seed terms;
a data modeling component that models the received terms and seeds as a bipartite graph with candidate terms being nodes on one side and identified context nodes on the other side;
a similarity determining component that determines a similarity metric between two candidate nodes in the graph based on the candidate nodes'"'"' relationship to the context nodes in the graph;
a relevance determining component that determines a relevance metric that indicates how similar a node in the graph is to the received seed terms and corresponding nodes;
a coherence determining component that determines a coherence metric that indicates how consistent a concept set is that includes the seed nodes and one or more candidate nodes;
a quality measurement component that combines the determined relevance metric and coherence metric to determine a quality metric that indicates relevance and coherence among a set of nodes in the graph;
an iterative expansion component that identifies an expanded seed set having a high quality metric; and
a set reporting component that reports the identified expanded seed set as output.
2 Assignments
0 Petitions
Accused Products
Abstract
A set expansion system is described herein that uses general-purpose web data to expand a set of seed entities. The system includes a simple yet effective quality metric to measure the expanded set, and includes two iterative thresholding processes to rank candidate entities. The system models web data sources and integrates relevance and coherence measurements to evaluate potential set candidates using an iterative process. The system uses general-purpose web data that is not specific to the given seeds. The system defines quality of the result set as the sum of two component scores: the relevance of a set of entities that measures their similarity with the given seeds, and the coherence of the set of entities produced which is how closely the entities in the set are related to each other. Based on this quality measure, the system develops a class of iterative set expansion processes.
-
Citations
20 Claims
-
1. A computer system for iterative set expansion using samples, the system comprising:
-
a processor and memory configured to execute software instructions embodied within the following components; an input component that receives a set of seed terms and a set of terms and associated contexts with which to expand the set of seed terms; a data modeling component that models the received terms and seeds as a bipartite graph with candidate terms being nodes on one side and identified context nodes on the other side; a similarity determining component that determines a similarity metric between two candidate nodes in the graph based on the candidate nodes'"'"' relationship to the context nodes in the graph; a relevance determining component that determines a relevance metric that indicates how similar a node in the graph is to the received seed terms and corresponding nodes; a coherence determining component that determines a coherence metric that indicates how consistent a concept set is that includes the seed nodes and one or more candidate nodes; a quality measurement component that combines the determined relevance metric and coherence metric to determine a quality metric that indicates relevance and coherence among a set of nodes in the graph; an iterative expansion component that identifies an expanded seed set having a high quality metric; and a set reporting component that reports the identified expanded seed set as output. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A computer-implemented method to expand a set of seeds while applying a dynamic threshold of relatedness, the method comprising:
-
receiving a set of terms with contexts and one or more identified seeds, wherein the seeds are terms that are related to a concept for which to identify additional related terms from the set of terms; determining a relevance score for each term based on the identified seeds; ranking the received set of terms by the determined relevance score; selecting an initial threshold ranking value for separating terms in the set related to the seeds from terms not related to the seeds; picking a top ranked number of terms above a threshold from the ranked set of terms based on the selected initial threshold to form a new set; determining a quality measurement that identifies how well each term relates to the picked threshold number of terms in the new set; ranking the terms in the new set based on the determined quality measurement; selecting a next threshold to use to separate terms in the new set related to the seeds from terms not related to the seeds; using the selected next threshold to select a threshold number of terms from the ranked new set; repeating the steps of determining the quality measurement, ranking the terms, and selecting a threshold number of terms for a determined number of iterations; and reporting the resulting expanded seed set that includes the terms in the received set that are the highest quality matches to the received seeds, wherein the preceding steps are performed by at least one processor. - View Dependent Claims (8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
-
-
20. A computer-readable storage medium comprising instructions for controlling a computer system to expand a set of seeds using a static threshold, wherein the instructions, upon execution, cause a processor to perform actions comprising:
-
receiving a set of terms with contexts modeled as a general bipartite graph and identified seeds, wherein the seeds are terms that are related to a concept for which to identify additional related terms from the set of terms; determining a relevance score for each term based on the identified seeds; ranking the received set of terms by the determined relevance score; determining a static threshold and picking a top ranked number of terms above a threshold from the ranked set of terms to form a new set; determining a quality measurement that identifies how well each term relates to the picked threshold number of terms in the new set; ranking the terms in the new set based on the determined quality measurement; using the previously determined static threshold to select a threshold number of terms from the ranked new set; upon determining that the selected threshold number of terms from the ranked new set does not match the top ranked number of terms from the previously ranked set of terms, replacing the lowest ranked term in the previously ranked set with the highest ranked term in the new set that is not already in the previously ranked set; and repeating the steps of determining the quality measurement, ranking the terms, selecting a threshold number of terms, and replacing the lowest ranked term until the sets match; and upon determining that the selected threshold number of terms from the ranked new set matches the top ranked number of terms from the previously ranked set of terms, reporting the resulting expanded seed set that includes the terms in the received set that are the highest quality matches to the received seeds.
-
Specification