Device and method for term set expansion based on semantic similarity
First Claim
1. A set expansion processing device comprising:
- a receiver for receiving a seed string from a user;
a searcher for ordering a search engine to search, with the seed string, a first set of documents containing the seed string and generate snippets from the first set of documents received from the search engine;
a segment acquirer for generating segments composed of strings by partitioning the generated snippets, including the seed string, using one or more predetermined segment partition strings, wherein the strings composing the segments are arranged in order of appearance;
a segment component acquirer for generating segment components by partitioning each of the generated segments using one or more predetermined segment component partition strings;
a segment score computer for computing a segment score for each of the generated segments based on the variance or the standard deviation from the mean value of the lengths of the segment components appearing in their corresponding segments;
a segment component score computer for computing a segment component score for each of the segment components contained in each of the generated segments, based on a distance between the position of the seed string and the position of each corresponding segment component in the segment in which the corresponding segment component appears, and further based on the segment score computed for the segment in which the corresponding segment component appears;
a selector for selecting, from the segment components, instance candidates as part of an expanded set of terms contained in the same semantic category as the seed string based on the computed segment component score for each of the generated segment components, wherein the instance candidates include the seed string; and
an extractor for;
ordering the search engine to search, using the instance candidates, a second set of documents containing the instance candidates and generate additional snippets from the second set of documents received from the search engine;
generating a connection graph indicating n-grams connected to each of the instance candidates from the additional snippets by searching using the instance candidates;
computing a semantic similarity between the seed string and the instance candidates based on a left-side context similarity between n-grams followed by the seed string and n-grams followed by each of the instance candidates in the connection graph, and based on a right-side context similarity between n-grams following the seed string and n-grams following each of the instance candidates in the connection graph; and
extracting an instance that should be contained in the expanded set of terms from the instance candidates based on the semantic similarity,wherein, when the searcher orders the search engine to search, with the same semantic category as the seed string, the search engine outputs a third set of documents containing the expanded set of terms, including the extracted instance.
3 Assignments
0 Petitions
Accused Products
Abstract
A receiving unit (101) receives a seed string. A search unit (102) searches snippets of documents containing the seed string. A segment acquisition unit (103) obtains segments by partitioning the snippets using a segment partition string. A segment component acquisition unit (104) obtains segment components by partitioning the segments using a segment component partition string. A segment score computation unit (105) calculates a segment score for a segment based on the standard deviation of the lengths of the segment components. A segment component score computation unit (106) calculates a segment component score for a segment component based on the segment score and the distance between the position of the seed string and the position of the segment component. A selection unit (107) selects any of the segment components as candidates for instances contained in the expanded set of the seed string based on the segment component scores.
18 Citations
6 Claims
-
1. A set expansion processing device comprising:
-
a receiver for receiving a seed string from a user; a searcher for ordering a search engine to search, with the seed string, a first set of documents containing the seed string and generate snippets from the first set of documents received from the search engine; a segment acquirer for generating segments composed of strings by partitioning the generated snippets, including the seed string, using one or more predetermined segment partition strings, wherein the strings composing the segments are arranged in order of appearance; a segment component acquirer for generating segment components by partitioning each of the generated segments using one or more predetermined segment component partition strings; a segment score computer for computing a segment score for each of the generated segments based on the variance or the standard deviation from the mean value of the lengths of the segment components appearing in their corresponding segments; a segment component score computer for computing a segment component score for each of the segment components contained in each of the generated segments, based on a distance between the position of the seed string and the position of each corresponding segment component in the segment in which the corresponding segment component appears, and further based on the segment score computed for the segment in which the corresponding segment component appears; a selector for selecting, from the segment components, instance candidates as part of an expanded set of terms contained in the same semantic category as the seed string based on the computed segment component score for each of the generated segment components, wherein the instance candidates include the seed string; and an extractor for; ordering the search engine to search, using the instance candidates, a second set of documents containing the instance candidates and generate additional snippets from the second set of documents received from the search engine; generating a connection graph indicating n-grams connected to each of the instance candidates from the additional snippets by searching using the instance candidates; computing a semantic similarity between the seed string and the instance candidates based on a left-side context similarity between n-grams followed by the seed string and n-grams followed by each of the instance candidates in the connection graph, and based on a right-side context similarity between n-grams following the seed string and n-grams following each of the instance candidates in the connection graph; and extracting an instance that should be contained in the expanded set of terms from the instance candidates based on the semantic similarity, wherein, when the searcher orders the search engine to search, with the same semantic category as the seed string, the search engine outputs a third set of documents containing the expanded set of terms, including the extracted instance. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A set expansion processing method comprising steps performed by a computer, the steps comprising:
-
a receiving step of receiving a seed string from a user; a search step of ordering a search engine to search, with the seed string, a first set of documents containing the seed string and generate snippets from the first set of documents received from the search engine; a segment acquisition step of generating segments composed of strings by partitioning the generated snippets, including the seed string, using one or more predetermined segment partition strings, wherein the strings composing the segments are arranged in order of appearance; a segment component acquisition step of generating segment components by partitioning each of the generated segments using one or more predetermined segment component partition strings; a segment score computation step of computing a segment score for each of the generated segments based on the variance or the standard deviation from the mean value of the lengths of the segment components appearing in their corresponding segments; a segment component score computation step of computing a segment component score for each of the segment components contained in each of the generated segments, based on a distance between the position of the seed string and the position of each corresponding segment component in the segment in which the corresponding segment component appears, and further based on the segment score computed for the segment in which the corresponding segment component appears; a selection step of selecting, from the segment components, instance candidates as part of an expanded set of terms contained in the same semantic category as the seed string based on the computed segment component score for each of the generated segment components, wherein the instance candidates include the seed string; and an extraction step of; ordering the search engine to search, using the instance candidates, a second set of documents containing the instance candidates and generate additional snippets from the second set of documents received from the search engine; generating a connection graph indicating n-grams connected to each of the instance candidates from the additional snippets by searching using the instance candidates; computing a semantic similarity between the seed string and the instance candidates based on a left-side context similarity between n-grams followed by the seed string and n-grams followed by each of the instance candidates in the connection graph, and based on a right-side context similarity between n-grams following the seed string and n-grams following each of the instance candidates in the connection graph; and extracting an instance that should be contained in the expanded set of terms from the instance candidates based on the semantic similarity, wherein, when the search engine is ordered to search with the same semantic category as the seed string, the search engine outputs a third set of documents containing the expanded set of terms, including the extracted instance.
-
Specification