Search apparatus, search method, and non-transitory computer readable medium storing program that input a query representing a subset of a document set stored to a document database and output a keyword that often appears in the subset
First Claim
Patent Images
1. A search apparatus including a central processing unit, comprising:
- a cluster creation unit that creates a plurality of regions from a word document matrix specifying a co-occurrence relationship between a word set and a document set, by dividing the word set and the document set into a plurality of subsets;
a region abstract creation unit that calculates, for each of the plurality of regions, a region frequency representing a number of documents including a word in each region, creates an abstract matrix specifying each region frequency for each region, and stores the created abstract matrix into an abstract matrix storage unit;
a region upper limit calculation unit that, when information representing at least one subset of the plurality of subsets is input, examines a relationship between the information representing the at least one subset of the plurality of subsets and the plurality of regions, refers to abstract information for each of the plurality of regions from the obtained result of the relationship, and calculates, for each of the plurality of regions, an upper limit value of the frequency of the word included in each of the plurality of regions for the at least one subset of the plurality of subsets;
a word frequency calculation unit that sums the upper limit value of the frequency of the word for each region with a common word in the plurality of regions, and specifies the summed value as the upper limit value of the frequency of the word for each region with the common word; and
a document frequency reference unit that determines a region to be searched in the plurality of regions according to the upper limit value of the frequency of the word for each region with the common word, and further specifies a top number of words with high frequency according to the determined region to be searched, and outputs the specified top number of words with high frequency as characteristic words in the at least one subset of the plurality of subsets.
1 Assignment
0 Petitions
Accused Products
Abstract
Provided is a search apparatus, a search method, and a program that can improve search speed for a document set even when an object to be searched is a large-scale document set. A search apparatus, in an embodiment, includes an abstract matrix storage unit, a word frequency calculation unit, and a document frequency reference unit.
11 Citations
13 Claims
-
1. A search apparatus including a central processing unit, comprising:
-
a cluster creation unit that creates a plurality of regions from a word document matrix specifying a co-occurrence relationship between a word set and a document set, by dividing the word set and the document set into a plurality of subsets; a region abstract creation unit that calculates, for each of the plurality of regions, a region frequency representing a number of documents including a word in each region, creates an abstract matrix specifying each region frequency for each region, and stores the created abstract matrix into an abstract matrix storage unit; a region upper limit calculation unit that, when information representing at least one subset of the plurality of subsets is input, examines a relationship between the information representing the at least one subset of the plurality of subsets and the plurality of regions, refers to abstract information for each of the plurality of regions from the obtained result of the relationship, and calculates, for each of the plurality of regions, an upper limit value of the frequency of the word included in each of the plurality of regions for the at least one subset of the plurality of subsets; a word frequency calculation unit that sums the upper limit value of the frequency of the word for each region with a common word in the plurality of regions, and specifies the summed value as the upper limit value of the frequency of the word for each region with the common word; and a document frequency reference unit that determines a region to be searched in the plurality of regions according to the upper limit value of the frequency of the word for each region with the common word, and further specifies a top number of words with high frequency according to the determined region to be searched, and outputs the specified top number of words with high frequency as characteristic words in the at least one subset of the plurality of subsets. - View Dependent Claims (2, 3, 4)
-
-
5. A search method comprising:
-
(a) creating a plurality of regions from a word document matrix specifying a co-occurrence relationship between a word set and a document set, by dividing the word set and the document set into a plurality of subsets; (b) calculating, for each of the plurality of regions, a region frequency representing a number of documents including a word in each region, creating an abstract matrix specifying each region frequency for each region, and storing the created abstract matrix into an abstract matrix storage unit; (c) when information representing at least one subset of the plurality of subsets is input, examining a relationship between the information representing the at least one subset of the plurality of subsets and the plurality of regions, referring to abstract information for each of the plurality of regions from the obtained result of the relationship, and calculating, for each of the plurality of regions, an upper limit value of the frequency of the word included in each of the plurality of regions for the at least one subset of the plurality of subsets; (d) summing the upper limit value of the frequency of the word for each region with a common word in the plurality of regions, and specifying the summed value as the upper limit value of the frequency of the word for each region with the common word; and (e) determining a region to be searched in the plurality of regions according to the upper limit value of the frequency of the word for each region with the common word, further specifying a top number of words with high frequency according to the determined region to be searched, and outputting the specified top number of words with high frequency as characteristic words to the at least one subset of the plurality of subsets. - View Dependent Claims (6, 7, 8)
-
-
9. A non-transitory recording medium storing a program for causing a computer to execute:
-
(a) creates a plurality of regions from a word document matrix specifying a co-occurrence relationship between a word set and a document set, by dividing the word set and the document set into a plurality of subsets; (b) a process that calculates, for each of the plurality of regions, a region frequency representing a number of documents including a word in each region, creates an abstract matrix specifying each region frequency for each region, and stores the created abstract matrix into an abstract matrix storage unit; (c) a process that, when information representing the at least one subset of the plurality of subsets is input, examines a relationship between the information representing the at least one subset of the plurality of subsets and the plurality of regions, refers to abstract information for each of the plurality of regions from the obtained result of the relationship, and calculates, for each of the plurality of regions, an upper limit value of the frequency of the word included in each of the plurality of regions for the at least one subset of the plurality of subsets; (d) a process that sums the upper limit value of the frequency of the word for each region with a common word in the plurality of regions, and specifies the summed value as the upper limit value of the frequency of the word by each region with the common word; and (e) a process that determines a region to be searched in the plurality of regions according to the upper limit value of the frequency of the word for each region with the common word, further specifies a top number of words with high frequency according to the determined region to be searched, and outputs the specified top number of words with high frequency as characteristic words to the at least one subset of the plurality of subsets. - View Dependent Claims (10, 11, 12)
-
-
13. A search apparatus including a central processing unit, comprising:
-
a cluster creation unit that creates a plurality of regions from a word document matrix specifying a co-occurrence relationship between a word set and a document set, by dividing the word set and the document set into a plurality of subsets; a region abstract creation unit that calculates, for each of the plurality of regions, a region frequency representing a number of documents including a word in each region, creates an abstract matrix specifying each region frequency for each region, and stores the created abstract matrix into an abstract matrix storage unit; a calculator that, when the information representing the at least one subset of the plurality of subsets is input, examines a relationship between the information representing the at least one subset of the plurality of subsets and the plurality of regions, refers to abstract information for each of the plurality of regions from the obtained result of the relationship, and calculates, for each of the plurality of regions, an upper limit value of the frequency of the word included in each of the plurality of regions for the at least one subset of the plurality of subsets; a word frequency calculator that sums the upper limit value of the frequency for each region with a common word in the plurality of regions, and specifies the summed value as the upper limit value of the frequency of the word for each region with the common word; and a document frequency reference device that determines a region to be searched in the plurality of regions according to the upper limit value of the frequency of the word for each region with the common word, and further specifies a top number of words with high frequency according to the determined region to be searched, and outputs the specified top number of words with high frequency as characteristic words in the at least one subset of the plurality of subsets.
-
Specification