Generating a data structure for information retrieval

US 8,229,900 B2
Filed: 04/03/2008
Issued: 07/24/2012
Est. Priority Date: 12/19/2002
Status: Expired due to Fees

First Claim

Patent Images

1. A computer system comprising:

a computer processor configured to store documents in a database;

a cluster subsystem configured to convert documents in a database into vectors;

a construction subsystem configured to construct a hierarchical structure for the vectors by randomly assigning the vectors to nodes;

a comparison subsystem configured to generate for each one of a plurality documents in the database a patch comprising a list of the documents in the database most similar to the respective one of a plurality of documents in the database;

a confidence subsystem configured to generate self-confidence values for each of the generated patches such that the generated self-confidence values comprise the proportion of documents of a first one of the generated patches that are also in a second one of the generated patches,the confidence subsystem being configured to use weighted self-confidence values to compute relative self-confidence values for each of the generated patches;

a cluster estimation subsystem configured to determine best size of a cluster of each of the generated patches, anda graphical subsystem for displaying the generated patches.

View all claims

0 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A computer system for generating data structures for information retrieval of documents stored in a database. The computer system includes: a neighborhood patch generation system for defining patch of nodes having predetermined similarities in a hierarchy structure. The neighborhood patch generation subsystem includes a hierarchy generation subsystem for generating a hierarchy structure upon the document-keyword vectors and a patch definition subsystem. The computer system also comprises a cluster estimation subsystem for generating cluster data of the document-keyword vectors using the similarities of the patches.

42 Citations

View as Search Results

8 Claims

1. A computer system comprising:
- a computer processor configured to store documents in a database;
  
  a cluster subsystem configured to convert documents in a database into vectors;
  
  a construction subsystem configured to construct a hierarchical structure for the vectors by randomly assigning the vectors to nodes;
  
  a comparison subsystem configured to generate for each one of a plurality documents in the database a patch comprising a list of the documents in the database most similar to the respective one of a plurality of documents in the database;
  
  a confidence subsystem configured to generate self-confidence values for each of the generated patches such that the generated self-confidence values comprise the proportion of documents of a first one of the generated patches that are also in a second one of the generated patches,the confidence subsystem being configured to use weighted self-confidence values to compute relative self-confidence values for each of the generated patches;
  
  a cluster estimation subsystem configured to determine best size of a cluster of each of the generated patches, anda graphical subsystem for displaying the generated patches.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The system of claim 1, wherein said system includes a confidence determination subsystem for computing inter-patch confidence values between said patches and intra-patch confidence values, and said cluster estimation subsystem being configured to select said patches depending on said inter-patch confidence values to represent clusters of said document-keyword vectors;
    - and wherein one of the patches is created for each of a plurality of elements in the hierarchical structure.
  - 3. The system of claim 1, wherein said cluster estimation subsystem estimates sizes of said clusters depending on said intra-patch confidence values.
  - 4. The system of claim 1, wherein said system further comprises a user query receiving subsystem for receiving said query and extracting data for information retrieval to generate a query vector, and an information retrieval subsystem for computing similarities between said document-keyword vectors and said query vector to select said document-keyword vectors.
  - 5. The system of claim 4, wherein said best size of a cluster is estimated using said vectors with respect to said user query.

6. A graphical user interface system for graphically presenting estimated clusters on a display device in response to a user input query, said graphical user interface system comprising:
- a database for storing documents;
  
  a computer for generating document-keyword vectors for said documents stored in said database and for estimating clusters of documents in response to said user input query; and
  
  a display for displaying on screen said estimated clusters together with confidence relations between said clusters and hierarchical information pertaining to cluster size.

7. A computer system comprising:
- a neighborhood patch generation subsystem configured to generate groups of nodes having similarities as determined using a search structure, said neighborhood patch generation subsystem including a subsystem configured to generate a hierarchical structure upon said document-keyword vectors;
  
  a patch defining subsystem configured to create patch relationships among said nodes with respect to a metric distance between nodes, wherein a size of one of the patches is based on a cost of patch boundary sharpness;
  
  a cluster estimation subsystem configured to generate cluster data of said document-keyword vectors using said similarities of patches; and
  
  a cluster defining subsystem configured to increase cluster size and reduce the number of clusters of a smallest size.
- View Dependent Claims (8)
- - 8. The system of claim 7, including a computer that comprises a confidence determination subsystem for computing inter-patch confidence values between said patches and intra-patch confidence values, said cluster estimation subsystem being configured to:
    - select said patches depending on said inter-patch confidence values to represent clusters of said document-keyword vectors;
      
      determine a size of a best subset of each of the patches to serve as a cluster candidate;
      
      estimate sizes of said clusters depending on said intra-patch confidence values; and
      
      eliminate redundant cluster candidates.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Houle, Michael Edward
Primary Examiner(s)
Trujillo, James
Assistant Examiner(s)
VU, THONG H

Application Number

US12/062,411
Publication Number

US 20090006378A1
Time in Patent Office

1,573 Days
Field of Search

707/999.3, 707/736, 707/737, 707/804, 707/6, 707/748, 707/100, 707/5, 707/738, 707/101, 707/693, 707/709, 707/764
US Class Current

707/692
CPC Class Codes

G06F 16/35   Clustering; Classification

G06F 16/93   Document management systems

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99943   Generating database or data...

Generating a data structure for information retrieval

First Claim

0 Assignments

0 Petitions

Accused Products

Abstract

42 Citations

8 Claims

Specification

Solutions

Use Cases

Quick Links

Generating a data structure for information retrieval

First Claim

0 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

42 Citations

8 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links