System and method for scoring concepts in a document set

US 8,626,761 B2
Filed: 10/26/2009
Issued: 01/07/2014
Est. Priority Date: 07/25/2003
Status: Active Grant

First Claim

Patent Images

1. A system for scoring concepts in a document set, comprising:

a database to maintain a set of documents;

a concept identification module to identify concepts comprising two or more terms extracted from the document set and to designate each document having one or more of the concepts as a candidate seed document;

a value module to determine for each of the concepts identified within each candidate seed document, values for a frequency of occurrence of that concept within that candidate seed document, a concept weight reflecting a specificity of meaning for that concept within that candidate seed document, a structural weight reflecting a degree of significance based on a location of that concept within that candidate seed document, and a corpus weight inversely weighing a reference count of the occurrence for that concept within the document set according to the equation;

View all claims

11 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method for scoring concepts in a document set is provided. Concepts including two or more terms extracted from the document set are identified. Each document having one or more of the concepts is designated as a candidate seed document. A score is calculated for each of the concepts identified within each candidate seed document based on a frequency of occurrence, concept weight, structural weight, and corpus weight. A vector is formed for each candidate seed document. The vector is compared with a center of one or more clusters each comprising thematically-related documents. At least one of the candidate seed documents that is sufficiently distinct from the other candidate seed documents is selected as a seed document for a new cluster. Each of the unselected candidate seed documents is placed into one of the clusters having a most similar cluster center.

231 Citations

16 Claims

1. A system for scoring concepts in a document set, comprising:
- a database to maintain a set of documents;
  
  a concept identification module to identify concepts comprising two or more terms extracted from the document set and to designate each document having one or more of the concepts as a candidate seed document;
  
  a value module to determine for each of the concepts identified within each candidate seed document, values for a frequency of occurrence of that concept within that candidate seed document, a concept weight reflecting a specificity of meaning for that concept within that candidate seed document, a structural weight reflecting a degree of significance based on a location of that concept within that candidate seed document, and a corpus weight inversely weighing a reference count of the occurrence for that concept within the document set according to the equation;
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. A system according to claim 1, further comprising:
    - a similarity module to determine a similarity between each candidate seed document and each of the one or more cluster centers based on the comparison.
  - 3. A system according to claim 2, wherein the similarity is determined as an inner product of the candidate seed document and the cluster center.
  - 4. A system according to claim 1, further comprising:
    - a preprocessing module to convert each document into a document record and to preprocess the document records to obtain the extracted terms.
  - 5. A system according to claim 1, wherein the clustering module applies a minimum fit criterion to the placement of the unselected candidate seed documents.
  - 6. A system according to claim 1, further comprising:
    - a document relocation module to apply a threshold to each of the clusters, to select those documents within at least one of the clusters that falls outside of the threshold as outlier documents, and to relocate the outlier documents.
  - 7. A system according to claim 6, wherein each outlier document is placed into the cluster having a best fit based on measures of similarity between that outlier documents and that cluster.
  - 8. A system according to claim 1, further comprising:
    - a score compression module to compress the concept scores.

9. A method for scoring concepts in a document set, comprising:
- maintaining a set of documents;
  
  identifying concepts comprising two or more terms extracted from the document set and designating each document having one or more of the concepts as a candidate seed document;
  
  determining for each of the concepts identified within each candidate seed document, values for a frequency of occurrence of that concept within that candidate seed document, a concept weight reflecting a specificity of meaning for that concept within that candidate seed document, a structural weight reflecting a degree of significance based on a location of that concept within that candidate seed document, and a corpus weight inversely weighing a reference count of the occurrence for that concept within the document set according to the equation;
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
- - 10. A method according to claim 9, further comprising:
    - determining a similarity between each candidate seed document and each of the one or more cluster centers based on the comparison.
  - 11. A method according to claim 10, wherein the similarity is determined as an inner product of the candidate seed document and the cluster center.
  - 12. A method according to claim 9, further comprising:
    - converting each document into a document record; and
      
      preprocessing the document records to obtain the extracted terms.
  - 13. A method according to claim 9, further comprising:
    - applying a minimum fit criterion to the placement of the unselected candidate seed documents.
  - 14. A method according to claim 9, further comprising:
    - applying a threshold to each of the clusters and selecting those documents within at least one of the clusters that falls outside of the threshold as outlier documents; and
      
      relocating the outlier documents.
  - 15. A method according to claim 14, wherein each outlier document is placed into the cluster having a best fit based on measures of similarity between that outlier documents and that cluster.
  - 16. A method according to claim 9, further comprising:
    - compressing the concept scores.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nuix North America Inc. (Nuix Ltd.)
Original Assignee
FTI Consulting Technology LLC (FTI Consulting Incorporated)
Inventors
Kawai, Kenji, Evans, Lynne Marie
Primary Examiner(s)
Yen, Syling
Assistant Examiner(s)
HARPER, ELIYAH STONE

Application Number

US12/606,171
Publication Number

US 20100049708A1
Time in Patent Office

1,534 Days
Field of Search

707/999.102, 707/999.104, 707/999.101
US Class Current

707/736
CPC Class Codes

G06F 16/35   Clustering; Classification

G06F 16/353   into predefined classes

G06F 16/355   Class or cluster creation o...

G06F 16/36   Creation of semantic tools,...

Y10S 707/99934   Query formulation, input pr...

Y10S 707/99937   Sorting

System and method for scoring concepts in a document set

First Claim

11 Assignments

0 Petitions

Accused Products

Abstract

231 Citations

16 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for scoring concepts in a document set

First Claim

11 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

231 Citations

16 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links