Search apparatus, search method, and non-transitory computer readable medium storing program that input a query representing a subset of a document set stored to a document database and output a keyword that often appears in the subset

US 8,892,574 B2
Filed: 11/06/2009
Issued: 11/18/2014
Est. Priority Date: 11/26/2008
Status: Active Grant

First Claim

Patent Images

1. A search apparatus including a central processing unit, comprising:

a cluster creation unit that creates a plurality of regions from a word document matrix specifying a co-occurrence relationship between a word set and a document set, by dividing the word set and the document set into a plurality of subsets;

a region abstract creation unit that calculates, for each of the plurality of regions, a region frequency representing a number of documents including a word in each region, creates an abstract matrix specifying each region frequency for each region, and stores the created abstract matrix into an abstract matrix storage unit;

a region upper limit calculation unit that, when information representing at least one subset of the plurality of subsets is input, examines a relationship between the information representing the at least one subset of the plurality of subsets and the plurality of regions, refers to abstract information for each of the plurality of regions from the obtained result of the relationship, and calculates, for each of the plurality of regions, an upper limit value of the frequency of the word included in each of the plurality of regions for the at least one subset of the plurality of subsets;

a word frequency calculation unit that sums the upper limit value of the frequency of the word for each region with a common word in the plurality of regions, and specifies the summed value as the upper limit value of the frequency of the word for each region with the common word; and

a document frequency reference unit that determines a region to be searched in the plurality of regions according to the upper limit value of the frequency of the word for each region with the common word, and further specifies a top number of words with high frequency according to the determined region to be searched, and outputs the specified top number of words with high frequency as characteristic words in the at least one subset of the plurality of subsets.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Provided is a search apparatus, a search method, and a program that can improve search speed for a document set even when an object to be searched is a large-scale document set. A search apparatus, in an embodiment, includes an abstract matrix storage unit, a word frequency calculation unit, and a document frequency reference unit.

11 Citations

13 Claims

1. A search apparatus including a central processing unit, comprising:
- a cluster creation unit that creates a plurality of regions from a word document matrix specifying a co-occurrence relationship between a word set and a document set, by dividing the word set and the document set into a plurality of subsets;
  
  a region abstract creation unit that calculates, for each of the plurality of regions, a region frequency representing a number of documents including a word in each region, creates an abstract matrix specifying each region frequency for each region, and stores the created abstract matrix into an abstract matrix storage unit;
  
  a region upper limit calculation unit that, when information representing at least one subset of the plurality of subsets is input, examines a relationship between the information representing the at least one subset of the plurality of subsets and the plurality of regions, refers to abstract information for each of the plurality of regions from the obtained result of the relationship, and calculates, for each of the plurality of regions, an upper limit value of the frequency of the word included in each of the plurality of regions for the at least one subset of the plurality of subsets;
  
  a word frequency calculation unit that sums the upper limit value of the frequency of the word for each region with a common word in the plurality of regions, and specifies the summed value as the upper limit value of the frequency of the word for each region with the common word; and
  
  a document frequency reference unit that determines a region to be searched in the plurality of regions according to the upper limit value of the frequency of the word for each region with the common word, and further specifies a top number of words with high frequency according to the determined region to be searched, and outputs the specified top number of words with high frequency as characteristic words in the at least one subset of the plurality of subsets.
- View Dependent Claims (2, 3, 4)
- - 2. The search apparatus according to claim 1, whereinthe region abstract creation unit, for each of the plurality of regions, obtains at least the frequency in the region of the word included in each region, and further specifies a maximum value of the obtained frequency in the region of the word, and sets the specified maximum value as the abstract information, andthe region upper limit calculation unit, for each of the plurality of regions, obtains a number of the documents composing the at least one subset of the plurality of subsets included in the region as a relationship between the information representing the at least one subset of the plurality of subsets and the plurality of regions, compares the obtained number of documents and the maximum value of the frequency for the region, and calculates the upper limit value of the frequency according to a comparison result.
  - 3. The search apparatus according to claim 1, whereinthe region abstract creation unit, for each of the plurality of regions, obtains at least a first bitstream indicating whether the word in the region is included in the document of each region, and sets the obtained first bitstream as the abstract information, andthe region upper limit calculation unit, for each of the plurality of regions, obtains a second bitstream indicating whether the document composing the at least one subset of the plurality of subsets is included in the region as the relationship between the information representing the at least one subset of the plurality of subsets and the plurality of regions, executes an AND operation between the second bitstream obtained by the region upper limit calculation unit and the first bitstream obtained by the region abstract creation unit, and calculates the upper limit value of the frequency according to a result of the AND operation.
  - 4. The search apparatus according to claim 1, further comprising:
    - a plurality of cluster processing units;
      
      a cluster process expansion unit; and
      
      a cluster process selection unit, whereinthe cluster creation unit, the region abstract creation unit, the abstract matrix storage unit, the region upper limit calculation unit, the word frequency calculation unit, and the document frequency reference unit are included in each cluster processing unit of the plurality of cluster processing units,each said cluster creation unit, included in each said cluster processing unit of the plurality of cluster processing units, performs a different clustering process from each other,when the information representing the at least one subset of the plurality of subsets is input, the cluster process expansion unit inputs the information representing the at least one subset of the plurality of subsets into each of the region upper limit calculation unit of the plurality of cluster processing units, andthe cluster process selection unit receives the upper limit value of the frequency of the word for each region with the common word, which is specified by the word frequency calculation unit in each of the plurality of cluster processing units, and selects at least one of the plurality of cluster processing units according to distribution of the upper limit value of the frequency of the received each word, and makes only the document frequency reference unit of the selected cluster processing unit perform the process.

5. A search method comprising:
- (a) creating a plurality of regions from a word document matrix specifying a co-occurrence relationship between a word set and a document set, by dividing the word set and the document set into a plurality of subsets;
  
  (b) calculating, for each of the plurality of regions, a region frequency representing a number of documents including a word in each region, creating an abstract matrix specifying each region frequency for each region, and storing the created abstract matrix into an abstract matrix storage unit;
  
  (c) when information representing at least one subset of the plurality of subsets is input, examining a relationship between the information representing the at least one subset of the plurality of subsets and the plurality of regions, referring to abstract information for each of the plurality of regions from the obtained result of the relationship, and calculating, for each of the plurality of regions, an upper limit value of the frequency of the word included in each of the plurality of regions for the at least one subset of the plurality of subsets;
  
  (d) summing the upper limit value of the frequency of the word for each region with a common word in the plurality of regions, and specifying the summed value as the upper limit value of the frequency of the word for each region with the common word; and
  
  (e) determining a region to be searched in the plurality of regions according to the upper limit value of the frequency of the word for each region with the common word, further specifying a top number of words with high frequency according to the determined region to be searched, and outputting the specified top number of words with high frequency as characteristic words to the at least one subset of the plurality of subsets.
- View Dependent Claims (6, 7, 8)
- - 6. The search method according to claim 5, wherein:
    - in the creating of (a), for each of the plurality of regions created , at least the frequency in the region of the word included in each region is obtained, a maximum value of the obtained frequency in the region of the word is specified, and the specified maximum value is set as the abstract information, andin the calculating of (c), for each of the plurality of regions, a number of documents is obtained that compose the at least one subset of the plurality of subsets included in the region as a relationship between the information representing the at least one subset of the plurality of subsets and the plurality of regions, the obtained number of documents is compared with the maximum value of the frequency for the region, and the upper limit value of the frequency is calculated according to a comparison result.
  - 7. The search method according to claim 5, wherein:
    - in the creating of (a), for each of the plurality of regions created, at least a first bitstream is obtained that indicates whether the word in the region is included in the document of each region, and the obtained first bitstream is set as the abstract information, andin the calculating of (c), for each of the plurality of regions, a second bitstream is obtained that indicates whether the document composing the at least one subset of the plurality of subsets is included in the region as the relationship between the information representing the at least one subset of the plurality of subsets and the plurality of regions, an AND operation is executed between the obtained second bitstream and the first bitstream obtained in the creation of (a), and the upper limit value of the frequency is calculated according to a result of the AND operation.
  - 8. The search method according to claim 5, wherein, for each of a plurality of clustering process types, (a) through (d) are executed with at least one upper limit value of the frequency of the word for each region with the common word being selected from the upper limit value of the frequency of the word for each region with the common word, and the output of (e) including the selected upper limit value.

9. A non-transitory recording medium storing a program for causing a computer to execute:
- (a) creates a plurality of regions from a word document matrix specifying a co-occurrence relationship between a word set and a document set, by dividing the word set and the document set into a plurality of subsets;
  
  (b) a process that calculates, for each of the plurality of regions, a region frequency representing a number of documents including a word in each region, creates an abstract matrix specifying each region frequency for each region, and stores the created abstract matrix into an abstract matrix storage unit;
  
  (c) a process that, when information representing the at least one subset of the plurality of subsets is input, examines a relationship between the information representing the at least one subset of the plurality of subsets and the plurality of regions, refers to abstract information for each of the plurality of regions from the obtained result of the relationship, and calculates, for each of the plurality of regions, an upper limit value of the frequency of the word included in each of the plurality of regions for the at least one subset of the plurality of subsets;
  
  (d) a process that sums the upper limit value of the frequency of the word for each region with a common word in the plurality of regions, and specifies the summed value as the upper limit value of the frequency of the word by each region with the common word; and
  
  (e) a process that determines a region to be searched in the plurality of regions according to the upper limit value of the frequency of the word for each region with the common word, further specifies a top number of words with high frequency according to the determined region to be searched, and outputs the specified top number of words with high frequency as characteristic words to the at least one subset of the plurality of subsets.
- View Dependent Claims (10, 11, 12)
- - 10. The non-transitory recording medium storing the program according to claim 9, wherein:
    - in the process of (a), for each of the plurality of regions created, at least the frequency in the region of the word included in each region is obtained, further a maximum value of the obtained frequency of the word is specified, and the specified maximum value is set as the abstract information, andin the process of (c), for each of the plurality of regions, a number of documents is obtained that compose the at least one subset of the plurality of subsets included in the region as a relationship between the information representing the at least one subset of the plurality of subsets and the plurality of regions, the obtained number of documents is compared with the maximum value of the frequency for the region, and the upper limit value of the frequency is calculated according to a comparison result.
  - 11. The non-transitory recording medium storing the program according to claim 10, wherein:
    - in the process of (a), for each of the plurality of regions created, at least a first bitstream is obtained that indicates whether the word in the region is included in the document of each region, and the obtained first bitstream is set as the abstract information, andin the process of (c), for each of the plurality of regions, a second bitstream is obtained that indicates whether the document composing the at least one subset of the plurality of subsets is included in the region as the relationship between the information representing the at least one subset of the plurality of subsets and the plurality of regions, an AND operation is executed between the obtained second bitstream and the first bitstream obtained in the process of (a), and the upper limit value of the frequency is calculated according to a result of the AND operation.
  - 12. The non-transitory recording medium storing the program according to claim 9, wherein, for each of a plurality of clustering process types, (a) through (d) are executed with at least one upper limit value of the frequency of the word for each region with the common word being selected from the upper limit value of the frequency of the word for each region with the common word, and the process of (e) includes the selected upper limit value.

13. A search apparatus including a central processing unit, comprising:
- a cluster creation unit that creates a plurality of regions from a word document matrix specifying a co-occurrence relationship between a word set and a document set, by dividing the word set and the document set into a plurality of subsets;
  
  a region abstract creation unit that calculates, for each of the plurality of regions, a region frequency representing a number of documents including a word in each region, creates an abstract matrix specifying each region frequency for each region, and stores the created abstract matrix into an abstract matrix storage unit;
  
  a calculator that, when the information representing the at least one subset of the plurality of subsets is input, examines a relationship between the information representing the at least one subset of the plurality of subsets and the plurality of regions, refers to abstract information for each of the plurality of regions from the obtained result of the relationship, and calculates, for each of the plurality of regions, an upper limit value of the frequency of the word included in each of the plurality of regions for the at least one subset of the plurality of subsets;
  
  a word frequency calculator that sums the upper limit value of the frequency for each region with a common word in the plurality of regions, and specifies the summed value as the upper limit value of the frequency of the word for each region with the common word; and
  
  a document frequency reference device that determines a region to be searched in the plurality of regions according to the upper limit value of the frequency of the word for each region with the common word, and further specifies a top number of words with high frequency according to the determined region to be searched, and outputs the specified top number of words with high frequency as characteristic words in the at least one subset of the plurality of subsets.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
NEC Corporation
Original Assignee
NEC Corporation
Inventors
Kusumura, Yukitaka
Primary Examiner(s)
Perveen, Rehana
Assistant Examiner(s)
Waldron, Scott A

Application Number

US13/129,342
Publication Number

US 20110219000A1
Time in Patent Office

1,838 Days
Field of Search

707/737, 707/750
US Class Current

707/750
CPC Class Codes

G06F 16/3346 using probabilistic model

Search apparatus, search method, and non-transitory computer readable medium storing program that input a query representing a subset of a document set stored to a document database and output a keyword that often appears in the subset

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

11 Citations

13 Claims

Specification

Solutions

Use Cases

Quick Links

Search apparatus, search method, and non-transitory computer readable medium storing program that input a query representing a subset of a document set stored to a document database and output a keyword that often appears in the subset

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

11 Citations

13 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links