Document retrieval system and document retrieval method
First Claim
1. A document retrieval system, comprising:
- a document database for storing data for a plurality of documents;
an arithmetic unit that;
includes a numeric value data reading unit configured to read, from the data on the documents stored in the document database, numeric value data for which numeric value intervals are to be generated;
calculates indices used for indexing numeric values and texts in each of the documents stored in the document database, each of the indices used for indexing the text being a group of a term constituting the text and a frequency of the term in the document, each of the indices used for indexing the numeric value being a group of a label describing a feature represented by the numeric value, an interval including the numeric value, and a frequency of the numeric value in the document;
receives a designation of a document as a retrieval input; and
computes a similarity between the designated document and each of the documents stored in the document database by use of the indices; and
a numeric value distribution percentage designating unit configured to designate the numeric value intervals from distribution percentages of numeric values based on a distribution of the numeric value data, whereinthe arithmetic unit includes a numeric value range designating unit configured to generate the numeric value intervals based on distribution percentages inputted by the numeric value distribution percentage designating unit; and
the numeric value distribution percentage designating means generates numeric value intervals whose numeric value widths are so adjusted that each of the numeric value intervals includes an equal number of numeric values.
1 Assignment
0 Petitions
Accused Products
Abstract
A document retrieval is performed with similarities between documents in numeric data taken into consideration. To this end, generated is a set E of intervals in which each element of a set D of numeric values representing a feature A is included in any one of the intervals. Each numeric value in each document is indexed by assigning, with 1, an interval including an element x of the set D, and with 0, an interval without the element x. Each document data including numeric values is indexed by indexing its text part with term frequencies, and by indexing its numeric-value part with the above-described numeric value indexing scheme. By use of indices thus created for each of the document data, similarities between the document data are calculated using a vector space model or a probability model, and the document data are presented in order of similarity.
7 Citations
13 Claims
-
1. A document retrieval system, comprising:
-
a document database for storing data for a plurality of documents; an arithmetic unit that;
includes a numeric value data reading unit configured to read, from the data on the documents stored in the document database, numeric value data for which numeric value intervals are to be generated;
calculates indices used for indexing numeric values and texts in each of the documents stored in the document database, each of the indices used for indexing the text being a group of a term constituting the text and a frequency of the term in the document, each of the indices used for indexing the numeric value being a group of a label describing a feature represented by the numeric value, an interval including the numeric value, and a frequency of the numeric value in the document;
receives a designation of a document as a retrieval input; and
computes a similarity between the designated document and each of the documents stored in the document database by use of the indices; anda numeric value distribution percentage designating unit configured to designate the numeric value intervals from distribution percentages of numeric values based on a distribution of the numeric value data, wherein the arithmetic unit includes a numeric value range designating unit configured to generate the numeric value intervals based on distribution percentages inputted by the numeric value distribution percentage designating unit; and the numeric value distribution percentage designating means generates numeric value intervals whose numeric value widths are so adjusted that each of the numeric value intervals includes an equal number of numeric values. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A document retrieval method comprising:
-
receiving a designation of a document as a retrieval input; reading from data on documents stored in the document database, numeric value data for which numeric value intervals are to be generated; calculating a similarity between the document designated as the retrieval input and each of the documents stored in the document database by use of indices of the designated document and indices of each document stored in the document database, the indices used for indexing numeric values and texts in a corresponding document, each of the indices used for indexing the text being a group of a term constituting the text and a frequency of the term in the corresponding document, each of the indices used for indexing the numeric value being a group of a label describing a feature represented by the numeric value, an interval including the numeric value, and a frequency of the numeric value in the corresponding document; presenting the documents stored in the document database in order of similarity; deciding the numeric value intervals from distribution percentages of numeric values based on a distribution of the numeric value data; and generating the numeric value intervals based on distribution percentages inputted by the numeric value distribution percentage designating unit; wherein the generating the numeric value intervals generates numeric value intervals whose numeric value widths are so adjusted that each of the numeric value intervals includes an equal number of numeric values. - View Dependent Claims (7, 8, 9, 10)
-
-
11. A document retrieval method comprising:
-
extracting a group of a feature and a numeric value from each of a plurality of document data stored in a document database, to obtain numeric value data for which numeric value intervals are to be generated; converting the extracted numeric value into an interval in accordance with a numeric conversion table, and then indexing the extracted numeric value with a group of the feature, the interval and a frequency, the numeric conversion table being dedicated to each feature type, and used for converting an numeric value into an interval; indexing each text in the document with a group of a term constituting the text and a frequency of the term in the document; calculating a similarity between document data designated as a retrieval input and each of the documents stored in the document database by use of data on the document indexed as above; presenting the document data stored in the document database in order of similarity; deciding the numeric value intervals from distribution percentages of numeric values based on a distribution of the numeric value data; generating the numeric value intervals based on distribution percentages inputted by the numeric value distribution percentage designating unit; and wherein the generating the numeric value intervals generates numeric value intervals whose numeric value widths are so adjusted that each of the numeric value intervals includes an equal number of numeric values. - View Dependent Claims (12, 13)
-
Specification