Information processing apparatus , method, and computer-readable recording medium for performing full text retrieval of documents

US 8,180,781 B2
Filed: 05/28/2009
Issued: 05/15/2012
Est. Priority Date: 05/28/2008
Status: Active Grant

First Claim

Patent Images

1. An information processing apparatus for creating a retrieval result displaying a list of retrieval documents, comprising:

a computer memory that stores a feature word file database configured to register, for each of a plurality of stored documents, document identification identifying the document, feature words extracted from full text data of the document, and weight values indicating weights of the feature words in which the feature words and the weight values are corresponded to the document identification;

a computer processor;

a document retrieval part, executable by the computer processor, configured to retrieve the retrieval documents, from among the plurality of stored documents, corresponding to a retrieval condition by conducting a full text retrieval of documents;

a document scoring part, executable by the computer processor, configured to order the retrieval documents by scores indicating degrees of relevance to the retrieval condition;

a document grouping part, executable by the computer processor, configured to group the retrieval documents into a plurality of groups based on an average rate of change of all the scores such that the groups are divided at a point where a difference in respective scores between two retrieval documents is greater than the average rate of change of all the scores; and

a document clustering part, executable by the computer processor, configured to conduct a clustering process with respect to the retrieval documents based on the feature words and the weight values of the feature words acquired from the feature word file database, by using the document identifications of the retrieval documents as keys,wherein the average rate of change of the scores indicates a clustering accuracy, and the document clustering part conducts the clustering process with respect to the retrieval documents in a group, for each of the plurality of groups to which the retrieval documents are grouped by the document grouping part;

wherein the feature words are extracted based on first values indicating appearance frequencies of words obtained from the full text data and second values indicating appearance frequencies of morpheme occurrences obtained when a morphological analysis is conducted, wherein a morpheme is a smallest semantically meaningful unit in a language, and the morphological analysis analyzing behavior and combination of morphemes.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An information processing apparatus for creating a retrieval result displaying a list of retrieval documents. Retrieval documents corresponding to a retrieval condition are classified into groups based on scores indicating degrees of relevance to the retrieval condition. A clustering process is conducted with respect to the retrieval documents in a group, for each of groups to which the retrieval documents belong.

Citations

6 Claims

1. An information processing apparatus for creating a retrieval result displaying a list of retrieval documents, comprising:
- a computer memory that stores a feature word file database configured to register, for each of a plurality of stored documents, document identification identifying the document, feature words extracted from full text data of the document, and weight values indicating weights of the feature words in which the feature words and the weight values are corresponded to the document identification;
  
  a computer processor;
  
  a document retrieval part, executable by the computer processor, configured to retrieve the retrieval documents, from among the plurality of stored documents, corresponding to a retrieval condition by conducting a full text retrieval of documents;
  
  a document scoring part, executable by the computer processor, configured to order the retrieval documents by scores indicating degrees of relevance to the retrieval condition;
  
  a document grouping part, executable by the computer processor, configured to group the retrieval documents into a plurality of groups based on an average rate of change of all the scores such that the groups are divided at a point where a difference in respective scores between two retrieval documents is greater than the average rate of change of all the scores; and
  
  a document clustering part, executable by the computer processor, configured to conduct a clustering process with respect to the retrieval documents based on the feature words and the weight values of the feature words acquired from the feature word file database, by using the document identifications of the retrieval documents as keys,wherein the average rate of change of the scores indicates a clustering accuracy, and the document clustering part conducts the clustering process with respect to the retrieval documents in a group, for each of the plurality of groups to which the retrieval documents are grouped by the document grouping part;
  
  wherein the feature words are extracted based on first values indicating appearance frequencies of words obtained from the full text data and second values indicating appearance frequencies of morpheme occurrences obtained when a morphological analysis is conducted, wherein a morpheme is a smallest semantically meaningful unit in a language, and the morphological analysis analyzing behavior and combination of morphemes.
- View Dependent Claims (2, 3, 4)
- - 2. The information processing apparatus as claimed in claim 1, wherein the document clustering part expresses the feature words by vectors, and conducts the clustering process with respect to the retrieval documents based on values of cosines of angles formed by the vectors.
  - 3. The information processing apparatus as claimed in claim 1, wherein the feature words are extracted based on the first values indicating the appearance frequencies of words obtained from the full text data and third values indicating appearance frequencies of words obtained from a corpus containing a large and structured set of texts.
  - 4. The information processing apparatus as claimed in claim 1, further comprising a cluster merge part configured to express the feature words of the retrieval document in each of clusters in which the clustering process is conducted by the clustering process part by vectors, and merge clusters in which the retrieval documents are closer than a predetermined threshold in distance.

5. A full text retrieval method in an information processing apparatus for creating a retrieval result displaying a list of retrieval documents, the method executable by a computer processor and comprising steps of:
- retrieving, by a document retrieval part, the retrieval documents, from among a plurality of stored documents, corresponding to a retrieval condition by conducting a full text retrieval of documents;
  
  ordering, by a document scoring part the retrieval documents by scores indicating degrees of relevance to the retrieval condition;
  
  grouping, by a document grouping part, the retrieval documents based on the scores into a plurality of groups based on an average rate of change of all the scores such that the groups are divided at a point where a difference in scores between two retrieval documents is greater than the average rate of change of all the scores, wherein the average rate of change of the scores indicates a clustering accuracy; and
  
  conducting a clustering process, by a document clustering part, with respect to the retrieval documents based on feature words of the documents extracted from full text data of the documents and weight values indicating weights of the feature words, wherein the document clustering part conducts the clustering process with respect to the retrieval documents in a group, for each of the groups to which the retrieval documents are grouped by the document grouping part;
  
  wherein the feature words are extracted based on first values indicating appearance frequencies of words obtained from the full text data and second values indicating appearance frequencies of morpheme occurrences obtained when a morphological analysis is conducted, wherein a morpheme is a smallest semantically meaningful unit in a language and the morphological analysis analyzing behavior and combination of morphemes.

6. A non-transitory computer-readable recording medium recorded thereon a computer program for causing an information processing apparatus to perform a full text retrieval method for creating a retrieval result displaying a list of retrieval documents, the method comprising:
- retrieving, by a document retrieval part, the retrieval documents, from among a plurality of stored documents, corresponding to a retrieval condition by conducting a full text retrieval of documents;
  
  ordering, by a document scoring part, the retrieval documents by scores indicating degrees of relevance to the retrieval condition;
  
  grouping, by a document grouping part, the retrieval documents into a plurality of groups based on an average rate of change of all the scores such that the groups are divided at a point where a difference in respective scores between two retrieval documents is greater than the average rate of change of all the scores; and
  
  conducting a clustering process, by a document clustering part, with respect to the retrieval documents based on feature words of the documents extracted from full text data of the documents and weight values indicating weights of the feature words,wherein the average rate of change of the scores indicates a clustering accuracy, and the document clustering part conducts the clustering process with respect to the retrieval documents in a group, for each of the plurality of groups to which the retrieval documents are grouped by the document grouping part;
  
  wherein the feature words are extracted based on first values indicating appearance frequencies of words obtained from the full text data and second values indicating appearance frequencies of morpheme occurrences obtained when a morphological analysis is conducted, wherein a morpheme is a smallest semantically meaningful unit in a language, and the morphological analysis analyzing behavior and combination of morphemes.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Ricoh Company Limited
Original Assignee
Ricoh Company Limited
Inventors
Hiraoka, Takuya
Primary Examiner(s)
Hoang, Son T

Application Number

US12/473,616
Publication Number

US 20090300007A1
Time in Patent Office

1,083 Days
Field of Search

707/748, 707/758
US Class Current

707/748
CPC Class Codes

G06F 16/93 Document management systems

Information processing apparatus , method, and computer-readable recording medium for performing full text retrieval of documents

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

6 Claims

Specification

Solutions

Use Cases

Quick Links

Information processing apparatus , method, and computer-readable recording medium for performing full text retrieval of documents

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

6 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links