Device and method for term set expansion based on semantic similarity

US 9,268,821 B2
Filed: 02/22/2012
Issued: 02/23/2016
Est. Priority Date: 03/04/2011
Status: Active Grant

First Claim

Patent Images

1. A set expansion processing device comprising:

a receiver for receiving a seed string from a user;

a searcher for ordering a search engine to search, with the seed string, a first set of documents containing the seed string and generate snippets from the first set of documents received from the search engine;

a segment acquirer for generating segments composed of strings by partitioning the generated snippets, including the seed string, using one or more predetermined segment partition strings, wherein the strings composing the segments are arranged in order of appearance;

a segment component acquirer for generating segment components by partitioning each of the generated segments using one or more predetermined segment component partition strings;

a segment score computer for computing a segment score for each of the generated segments based on the variance or the standard deviation from the mean value of the lengths of the segment components appearing in their corresponding segments;

a segment component score computer for computing a segment component score for each of the segment components contained in each of the generated segments, based on a distance between the position of the seed string and the position of each corresponding segment component in the segment in which the corresponding segment component appears, and further based on the segment score computed for the segment in which the corresponding segment component appears;

a selector for selecting, from the segment components, instance candidates as part of an expanded set of terms contained in the same semantic category as the seed string based on the computed segment component score for each of the generated segment components, wherein the instance candidates include the seed string; and

an extractor for;

ordering the search engine to search, using the instance candidates, a second set of documents containing the instance candidates and generate additional snippets from the second set of documents received from the search engine;

generating a connection graph indicating n-grams connected to each of the instance candidates from the additional snippets by searching using the instance candidates;

computing a semantic similarity between the seed string and the instance candidates based on a left-side context similarity between n-grams followed by the seed string and n-grams followed by each of the instance candidates in the connection graph, and based on a right-side context similarity between n-grams following the seed string and n-grams following each of the instance candidates in the connection graph; and

extracting an instance that should be contained in the expanded set of terms from the instance candidates based on the semantic similarity,wherein, when the searcher orders the search engine to search, with the same semantic category as the seed string, the search engine outputs a third set of documents containing the expanded set of terms, including the extracted instance.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A receiving unit (101) receives a seed string. A search unit (102) searches snippets of documents containing the seed string. A segment acquisition unit (103) obtains segments by partitioning the snippets using a segment partition string. A segment component acquisition unit (104) obtains segment components by partitioning the segments using a segment component partition string. A segment score computation unit (105) calculates a segment score for a segment based on the standard deviation of the lengths of the segment components. A segment component score computation unit (106) calculates a segment component score for a segment component based on the segment score and the distance between the position of the seed string and the position of the segment component. A selection unit (107) selects any of the segment components as candidates for instances contained in the expanded set of the seed string based on the segment component scores.

18 Citations

View as Search Results

6 Claims

1. A set expansion processing device comprising:
- a receiver for receiving a seed string from a user;
  
  a searcher for ordering a search engine to search, with the seed string, a first set of documents containing the seed string and generate snippets from the first set of documents received from the search engine;
  
  a segment acquirer for generating segments composed of strings by partitioning the generated snippets, including the seed string, using one or more predetermined segment partition strings, wherein the strings composing the segments are arranged in order of appearance;
  
  a segment component acquirer for generating segment components by partitioning each of the generated segments using one or more predetermined segment component partition strings;
  
  a segment score computer for computing a segment score for each of the generated segments based on the variance or the standard deviation from the mean value of the lengths of the segment components appearing in their corresponding segments;
  
  a segment component score computer for computing a segment component score for each of the segment components contained in each of the generated segments, based on a distance between the position of the seed string and the position of each corresponding segment component in the segment in which the corresponding segment component appears, and further based on the segment score computed for the segment in which the corresponding segment component appears;
  
  a selector for selecting, from the segment components, instance candidates as part of an expanded set of terms contained in the same semantic category as the seed string based on the computed segment component score for each of the generated segment components, wherein the instance candidates include the seed string; and
  
  an extractor for;
  
  ordering the search engine to search, using the instance candidates, a second set of documents containing the instance candidates and generate additional snippets from the second set of documents received from the search engine;
  
  generating a connection graph indicating n-grams connected to each of the instance candidates from the additional snippets by searching using the instance candidates;
  
  computing a semantic similarity between the seed string and the instance candidates based on a left-side context similarity between n-grams followed by the seed string and n-grams followed by each of the instance candidates in the connection graph, and based on a right-side context similarity between n-grams following the seed string and n-grams following each of the instance candidates in the connection graph; and
  
  extracting an instance that should be contained in the expanded set of terms from the instance candidates based on the semantic similarity,wherein, when the searcher orders the search engine to search, with the same semantic category as the seed string, the search engine outputs a third set of documents containing the expanded set of terms, including the extracted instance.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The set expansion processing device of claim 1, wherein:
    - the extractor computes the similarity between the seed string and the instance candidates based on similarities between n-grams connected to the seed string before the seed string and n-grams connected to the instance candidate before the instance candidates, and similarities between n-grams connected to the seed string after the seed string and n-grams connected to the instance candidate after the instance candidates.
  - 3. The set expansion processing device of claim 2, wherein for each of the generated segments, when the variance or standard deviation of the lengths of the segment components appearing in that segment exceeds a predetermined threshold value, the corresponding segment score and the corresponding segment component score become values such that segment components contained in that segment are not selected by the selector as the candidates.
  - 4. The set expansion processing device of claim 1, wherein for each of the generated segments, when the variance or standard deviation of the lengths of the segment components appearing in that segment exceeds a predetermined threshold value, the corresponding segment score and the corresponding segment component score become values such that segment components contained in that segment are not selected by the selector as the candidates.
  - 5. The set expansion processing device of claim 1, wherein the segment component score of each segment component appearing in each of the generated segments decays exponentially with respect to the shortest distance between the position where the received seed string appears in that segment and the position where the segment component appears in that segment.

6. A set expansion processing method comprising steps performed by a computer, the steps comprising:
- a receiving step of receiving a seed string from a user;
  
  a search step of ordering a search engine to search, with the seed string, a first set of documents containing the seed string and generate snippets from the first set of documents received from the search engine;
  
  a segment acquisition step of generating segments composed of strings by partitioning the generated snippets, including the seed string, using one or more predetermined segment partition strings, wherein the strings composing the segments are arranged in order of appearance;
  
  a segment component acquisition step of generating segment components by partitioning each of the generated segments using one or more predetermined segment component partition strings;
  
  a segment score computation step of computing a segment score for each of the generated segments based on the variance or the standard deviation from the mean value of the lengths of the segment components appearing in their corresponding segments;
  
  a segment component score computation step of computing a segment component score for each of the segment components contained in each of the generated segments, based on a distance between the position of the seed string and the position of each corresponding segment component in the segment in which the corresponding segment component appears, and further based on the segment score computed for the segment in which the corresponding segment component appears;
  
  a selection step of selecting, from the segment components, instance candidates as part of an expanded set of terms contained in the same semantic category as the seed string based on the computed segment component score for each of the generated segment components, wherein the instance candidates include the seed string; and
  
  an extraction step of;
  
  ordering the search engine to search, using the instance candidates, a second set of documents containing the instance candidates and generate additional snippets from the second set of documents received from the search engine;
  
  generating a connection graph indicating n-grams connected to each of the instance candidates from the additional snippets by searching using the instance candidates;
  
  computing a semantic similarity between the seed string and the instance candidates based on a left-side context similarity between n-grams followed by the seed string and n-grams followed by each of the instance candidates in the connection graph, and based on a right-side context similarity between n-grams following the seed string and n-grams following each of the instance candidates in the connection graph; and
  
  extracting an instance that should be contained in the expanded set of terms from the instance candidates based on the semantic similarity,wherein, when the search engine is ordered to search with the same semantic category as the seed string, the search engine outputs a third set of documents containing the expanded set of terms, including the extracted instance.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Rakuten Group, Inc.
Original Assignee
Rakuten, Inc. (Rakuten Group, Inc.)
Inventors
Hagiwara, Masato
Primary Examiner(s)
Saeed, Usmaan
Assistant Examiner(s)
Tamaru, Michael K

Application Number

US13/700,898
Publication Number

US 20130144875A1
Time in Patent Office

1,462 Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 16/24575   using context

G06F 16/24578   using ranking

G06F 16/3338   Query expansion

G06F 16/35   Clustering; Classification

G06Q 30/0601   Electronic shopping [e-shop...

Device and method for term set expansion based on semantic similarity

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

18 Citations

6 Claims

Specification

Solutions

Use Cases

Quick Links

Device and method for term set expansion based on semantic similarity

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

18 Citations

6 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links