Searching databases that identifying group documents forming high-dimensional torus geometric k-means clustering, ranking, summarizing based on vector triplets

US 6,862,586 B1
Filed: 02/11/2000
Issued: 03/01/2005
Est. Priority Date: 02/11/2000
Status: Expired due to Fees

First Claim

Patent Images

1. A method of perforating a database search comprising:

searching a database using a query, said searching identifying a group of hyperlinked documents;

forming a high-dimensional torus geometric representation of said hyperlinked documents, wherein each hyperlinked document is represented by a vector triplet comprising a normalized word frequency, a normalized out-link frequency and a normalized in-link frequency;

clustering said result items into clusters based on said high-dimensional torus geometric representation;

ranking items within each cluster of said clusters based on said high-dimensional torus geometric representation;

summarizing contents of said clusters based on said high-dimensional torus geometric representation, wherein said clustering of the said vector triplets on said high-dimensional torus geometric representation is performed using a toric k-means clustering process that uses a cosine-type similarity measure between document vector triplets, thereby producing clusters of vector triplets and producing a concept triplet for each of the clusters; and

summarizing said clusters of vector triplets based on nuggets of information including;

identifying a closeness of said vector triplets in a cluster to said concept triplet for said cluster on said high-dimensional torus geometric representation;

iidentifying said words with a highest normalized word frequency in said concept triplet for said cluster as the most frequent key-words for each of said clusters;

identifying said out-links with a highest normalized out-link frequency in the concept triplet for the cluster as most frequent key out-links for each of said clusters;

identifying said in-links with a highest normalized in-link frequency in the concept triplet for the cluster as most frequent important in-links for each cluster;

identifying hypertext items relevant to the user'"'"'s query by using a weighting of terms used in said query;

identifying documents closest to said concept triplet as most typical documents in a cluster, using a cosine-type textual content similarity measure between document vector triplets; and

identifying documents closest to said concept triplet as most typical documents in a cluster, using a cosine-type out-link similarity measure between document vector triplets; and

identifying documents closest to said concept triplet as most typical documents in a cluster, using a cosine-type in-link similarity measure between document vector triplets.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and structure for performing a database search includes searching a database using a query (searching producing result items), and ranking the result items based on one or more of a frequency of an occurrence of in-links and out-links in each of the result items.

Citations

8 Claims

1. A method of perforating a database search comprising:
- searching a database using a query, said searching identifying a group of hyperlinked documents;
  
  forming a high-dimensional torus geometric representation of said hyperlinked documents, wherein each hyperlinked document is represented by a vector triplet comprising a normalized word frequency, a normalized out-link frequency and a normalized in-link frequency;
  
  clustering said result items into clusters based on said high-dimensional torus geometric representation;
  
  ranking items within each cluster of said clusters based on said high-dimensional torus geometric representation;
  
  summarizing contents of said clusters based on said high-dimensional torus geometric representation, wherein said clustering of the said vector triplets on said high-dimensional torus geometric representation is performed using a toric k-means clustering process that uses a cosine-type similarity measure between document vector triplets, thereby producing clusters of vector triplets and producing a concept triplet for each of the clusters; and
  
  summarizing said clusters of vector triplets based on nuggets of information including;
  
  identifying a closeness of said vector triplets in a cluster to said concept triplet for said cluster on said high-dimensional torus geometric representation;
  
  iidentifying said words with a highest normalized word frequency in said concept triplet for said cluster as the most frequent key-words for each of said clusters;
  
  identifying said out-links with a highest normalized out-link frequency in the concept triplet for the cluster as most frequent key out-links for each of said clusters;
  
  identifying said in-links with a highest normalized in-link frequency in the concept triplet for the cluster as most frequent important in-links for each cluster;
  
  identifying hypertext items relevant to the user'"'"'s query by using a weighting of terms used in said query;
  
  identifying documents closest to said concept triplet as most typical documents in a cluster, using a cosine-type textual content similarity measure between document vector triplets; and
  
  identifying documents closest to said concept triplet as most typical documents in a cluster, using a cosine-type out-link similarity measure between document vector triplets; and
  
  identifying documents closest to said concept triplet as most typical documents in a cluster, using a cosine-type in-link similarity measure between document vector triplets.

2. A method of performing a database search comprising:
- searching a database using a query, said searching identifying a group of documents;
  
  forming a high-dimensional torus geometric representation of said documents, wherein each document is represented by a vector triplet comprising a normalized word frequency, a normalized out-link frequency and a normalized in-link frequency;
  
  identifying documents closest to a concept triplet as most typical documents in a cluster, using a cosine-type out-link similarity measure between document vector triplets; and
  
  identifying documents closest to said concept triplet as most typical documents in a cluster, using a cosine-type in-link similarity measure between document vector triplets.
- View Dependent Claims (3, 4, 5, 6, 7, 8)
- - 3. The method in claim 2, further comprising:
    - clustering said result items into clusters based on said high-dimensional torus geometric representation;
      
      ranking items within each cluster of said clusters based on said high-dimensional torus geometric representation; and
      
      summarizing contents of said clusters based on said high-dimensional torus geometric representation.
  - 4. The method in claim 3, wherein said clustering comprises agglomerative clustering, hierarchical clustering, EM algorithm, or mixture modeling.
  - 5. The method in claim 3, wherein said ranking includes identifying a most typical vector triplet in each of said clusters of vector triplets.
  - 6. The method in claim 2, wherein said normalized out-ink frequency comprises a number of said documents linked to, cited, or pointed to by said document.
  - 7. The method in claim 2, wherein said normalized in-link frequency comprises a number of said documents linking to, citing, or pointing to said document.
  - 8. The method in claim 2, wherein said normalized word frequency comprises a number of unique words, terms, or n-grams contained in said document.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Modha, Dharmendra Shantilal, Spangler, William Scott, Krishna, Vikas, Kreulen, Jeffrey Thomas, Strong, Hovey Raymond Jr.
Primary Examiner(s)
Channavajjala, Srirama

Application Number

US09/502,452
Time in Patent Office

1,845 Days
Field of Search

707 1- 10, 707100-1041, 707200-205, 7075001-5011, 707512-515, 707529-532, 707900-902, 707907-908, 382224-225, 382/228, 382/230, 382156-160, 382305-308, 358/403, 706/15, 706 47- 50, 345/440, 704 9- 10
US Class Current

1/1
CPC Class Codes

G06F 16/94   Hypermedia Hyperlinking G06...

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99937   Sorting

Y10S 707/99943   Generating database or data...

Searching databases that identifying group documents forming high-dimensional torus geometric k-means clustering, ranking, summarizing based on vector triplets

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

8 Claims

Specification

Solutions

Use Cases

Quick Links

Searching databases that identifying group documents forming high-dimensional torus geometric k-means clustering, ranking, summarizing based on vector triplets

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

8 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links