System And Method For Scoring Concepts In A Document Set

US 20100049708A1
Filed: 10/26/2009
Published: 02/25/2010
Est. Priority Date: 07/25/2003
Status: Active Grant

First Claim

Patent Images

1. A system for scoring concepts in a document set, comprising:

a database to maintain a set of documents;

a concept identification module to identify concepts comprising two or more terms extracted from the document set and to designate each document having one or more of the concepts as a candidate seed document;

a scoring module to calculate a score for each of the concepts identified within each candidate seed document based on a frequency of occurrence, concept weight, structural weight, and corpus weight;

a vector module to form a vector for each candidate seed document comprising the concepts located in that candidate seed document and the associated concept scores;

a document comparison module to compare the vector for each candidate seed document with a center of one or more clusters each comprising thematically-related documents and to select at least one of the candidate seed documents that is sufficiently distinct from the other candidate seed documents as a seed document for a new cluster; and

a clustering module to place each of the unselected candidate seed documents into one of the clusters having a most similar cluster center.

View all claims

11 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method for scoring concepts in a document set is provided. Concepts including two or more terms extracted from the document set are identified. Each document having one or more of the concepts is designated as a candidate seed document. A score is calculated for each of the concepts identified within each candidate seed document based on a frequency of occurrence, concept weight, structural weight, and corpus weight. A vector is formed for each candidate seed document. The vector is compared with a center of one or more clusters each comprising thematically-related documents. At least one of the candidate seed documents that is sufficiently distinct from the other candidate seed documents is selected as a seed document for a new cluster. Each of the unselected candidate seed documents is placed into one of the clusters having a most similar cluster center.

Citations

20 Claims

1. A system for scoring concepts in a document set, comprising:
- a database to maintain a set of documents;
  
  a concept identification module to identify concepts comprising two or more terms extracted from the document set and to designate each document having one or more of the concepts as a candidate seed document;
  
  a scoring module to calculate a score for each of the concepts identified within each candidate seed document based on a frequency of occurrence, concept weight, structural weight, and corpus weight;
  
  a vector module to form a vector for each candidate seed document comprising the concepts located in that candidate seed document and the associated concept scores;
  
  a document comparison module to compare the vector for each candidate seed document with a center of one or more clusters each comprising thematically-related documents and to select at least one of the candidate seed documents that is sufficiently distinct from the other candidate seed documents as a seed document for a new cluster; and
  
  a clustering module to place each of the unselected candidate seed documents into one of the clusters having a most similar cluster center.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. A system according to claim 1, further comprising:
    - a similarity module to determine a similarity between each candidate seed document and each of the one or more cluster centers based on the comparison.
  - 3. A system according to claim 2, wherein the similarity is determined as an inner product of the candidate seed document and the cluster center.
  - 4. A system according to claim 1, further comprising:
    - a preprocessing module to convert each document into a document record and to preprocess the document records to obtain the extracted terms.
  - 5. A system according to claim 1, wherein the score is determined as a function of summation of the frequency of occurrence, concept weight, structural weight, and corpus weight.
  - 6. A system according to claim 1, wherein the scoring module determines one or more of the frequency of occurrence as a count of occurrences for at least one of the concepts within one of the documents, the concept weight as reflecting a specificity of meaning for the at least one concept within the document, the structural weight reflecting a degree of significance based on structural location within the document for the at least one concept, and the corpus weight by inversely weighing a count of occurrences for the at least one concept within the document.
  - 7. A system according to claim 1, wherein the clustering module applies a minimum fit criterion to the placement of the unselected candidate seed documents.
  - 8. A system according to claim 1, further comprising:
    - a document relocation module to apply a threshold to each of the clusters, to select those documents within at least one of the clusters that falls outside of the threshold as outlier documents, and to relocate the outlier documents.
  - 9. A system according to claim 8, wherein each outlier document is placed into the cluster having a best fit based on measures of similarity between that outlier documents and that cluster.
  - 10. A system according to claim 1, further comprising:
    - a score compression module to compress the concept scores.

11. A method for scoring concepts in a document set, comprising:
- maintaining a set of documents;
  
  identifying concepts comprising two or more terms extracted from the document set and designating each document having one or more of the concepts as a candidate seed document;
  
  determining a score for each of the concepts identified within each candidate seed document based on a frequency of occurrence, concept weight, structural weight, and corpus weight;
  
  forming a vector for each candidate seed document comprising the concepts located in that candidate seed document and the associated concept scores;
  
  comparing the vector for each candidate seed document with a center of one or more clusters each comprising thematically-related documents and selecting at least one of the candidate seed documents that is sufficiently distinct from the other candidate seed documents as a seed document for a new cluster; and
  
  placing each of the unselected candidate seed documents into one of the clusters having a most similar cluster center.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
- - 12. A method according to claim 11, further comprising:
    - determining a similarity between each candidate seed document and each of the one or more cluster centers based on the comparison.
  - 13. A method according to claim 12, wherein the similarity is determined as an inner product of the candidate seed document and the cluster center.
  - 14. A method according to claim 11, further comprising:
    - converting each document into a document record; and
      
      preprocessing the document records to obtain the extracted terms.
  - 15. A method according to claim 11, wherein the score is determined as a function of summation of the frequency of occurrence, concept weight, structural weight, and corpus weight.
  - 16. A method according to claim 11, further comprising one or more of:
    - determining the frequency of occurrence as a count of occurrences for at least one of the concepts within one of the documents;
      
      determining the concept weight as reflecting a specificity of meaning for the at least one concept within the document;
      
      determining the structural weight reflecting a degree of significance based on structural location within the document for the at least one concept; and
      
      determining the corpus weight by inversely weighing a count of occurrences for the at least one concept within the document.
  - 17. A method according to claim 11, further comprising:
    - applying a minimum fit criterion to the placement of the unselected candidate seed documents.
  - 18. A method according to claim 11, further comprising:
    - applying a threshold to each of the clusters and selecting those documents within at least one of the clusters that falls outside of the threshold as outlier documents; and
      
      relocating the outlier documents.
  - 19. A method according to claim 18, wherein each outlier document is placed into the cluster having a best fit based on measures of similarity between that outlier documents and that cluster.
  - 20. A method according to claim 11, further comprising:
    - compressing the concept scores.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nuix North America Inc. (Nuix Ltd.)
Original Assignee
Vanderbilt Minerals, LLC
Inventors
Kawai, Kenji, Evans, Lynne Marie

Granted Patent

US 8,626,761 B2
Time in Patent Office

Days
Field of Search
US Class Current

N/A
CPC Class Codes

G06F 16/35   Clustering; Classification

G06F 16/353   into predefined classes

G06F 16/355   Class or cluster creation o...

G06F 16/36   Creation of semantic tools,...

Y10S 707/99934   Query formulation, input pr...

Y10S 707/99937   Sorting

System And Method For Scoring Concepts In A Document Set

First Claim

11 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

System And Method For Scoring Concepts In A Document Set

First Claim

11 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links