System and method for grouping similar documents

US 8,380,718 B2
Filed: 09/02/2011
Issued: 02/19/2013
Est. Priority Date: 08/31/2001
Status: Active Grant

First Claim

Patent Images

1. A system for grouping similar documents, comprising:

a frequency determination module to determine frequencies of occurrences for terms and noun phrases within a set of documents;

a threshold module to select a subset of the documents by removing those documents having terms and noun phrases that fall outside a bounded range of upper and lower conditions for frequency of occurrence;

a mapping module to map each of the documents in the subset to a cluster of documents based on a similarity of the documents in the subset to the cluster documents; and

a processor to execute the modules.

View all claims

8 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method for grouping similar documents is provided. Frequencies of occurrences are determined for terms and noun phrases within a set of documents. A subset of the documents is selected by removing those documents having terms and noun phrases that fall outside a bounded range of upper and lower conditions for frequency of occurrence. Each of the documents in the subset is mapped to a cluster of documents based on a similarity of the documents to the cluster documents.

Citations

20 Claims

1. A system for grouping similar documents, comprising:
- a frequency determination module to determine frequencies of occurrences for terms and noun phrases within a set of documents;
  
  a threshold module to select a subset of the documents by removing those documents having terms and noun phrases that fall outside a bounded range of upper and lower conditions for frequency of occurrence;
  
  a mapping module to map each of the documents in the subset to a cluster of documents based on a similarity of the documents in the subset to the cluster documents; and
  
  a processor to execute the modules.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. A system according to claim 1, further comprising:
    - a theme generation module to generate themes for the document subset from the terms and phrases that fall within the bounded range.
  - 3. A system according to claim 1, further comprising:
    - database records for each term and noun phrase, the database records each comprising an identifier, string, and frequency.
  - 4. A system according to claim 1, further comprising:
    - a frequency table comprising the terms and noun phrases, and their respective frequency of occurrence within each document.
  - 5. A system according to claim 1, further comprising:
    - a histogram comprising an x-axis defining the individual terms and noun phrases for each document and a y-axis defining the frequencies of occurrence for each term and noun phrase.
  - 6. A system according to claim 1, further comprising:
    - a corpus graph of the frequencies of occurrence for the terms and phrases over all the documents to determine a number of the documents including each of the terms and noun phrases.
  - 7. A system according to claim 1, further comprising:
    - a similarity module to measure the similarity as an inner product, which is represented as a distance.
  - 8. A system according to claim 7, further comprising:
    - a matrix of the mapped clusters and documents in the subset.
  - 9. A system according to claim 8, wherein the matrix includes the inner product between each mapped document in the subset and each cluster.
  - 10. A system according to claim 7, wherein the inner product for each document falls within a predefined variance of other related documents to identify a set amount of similarity.

11. A method for grouping similar documents, comprising:
- determining frequencies of occurrences for terms and noun phrases within a set of documents;
  
  selecting a subset of the documents by removing those documents having terms and noun phrases that fall outside a bounded range of upper and lower conditions for frequency of occurrence;
  
  mapping each of the documents in the subset to a cluster of documents based on a similarity of the documents in the subset to the cluster documents.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
- - 12. A method according to claim 11, further comprising:
    - generating themes for the document subset from the terms and phrases that fall within the bounded range.
  - 13. A method according to claim 11, further comprising:
    - maintaining database records for each term and noun phrase, the database records each comprising an identifier, string, and frequency.
  - 14. A method according to claim 11, further comprising:
    - generating a frequency table comprising the terms and noun phrases, and their respective frequency of occurrence within each document.
  - 15. A method according to claim 11, further comprising:
    - generating a histogram comprising an x-axis defining the individual terms and noun phrases for each document and a y-axis defining the frequencies of occurrence for each term and noun phrase.
  - 16. A method according to claim 11, further comprising:
    - generating a corpus graph of the frequencies of occurrence for the terms and phrases over all the documents to determine a number of the documents including each of the terms and noun phrases.
  - 17. A method according to claim 11, further comprising:
    - measuring the similarity as an inner product, which is represented as a distance.
  - 18. A method according to claim 17, further comprising:
    - generating a matrix of the mapped clusters and documents in the subset.
  - 19. A method according to claim 18, wherein the matrix includes the inner product between each mapped document in the subset and each cluster.
  - 20. A method according to claim 17, wherein the inner product for each document falls within a predefined variance of other related documents to identify a set amount of similarity.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nuix North America Inc. (Nuix Ltd.)
Original Assignee
FTI Consulting Technology LLC (FTI Consulting Incorporated)
Inventors
Gallivan, Dan, Kawai, Kenji
Primary Examiner(s)
Spieler, William

Application Number

US13/225,325
Publication Number

US 20110320453A1
Time in Patent Office

536 Days
Field of Search

707/738, 707/750, 707/777
US Class Current

707/738
CPC Class Codes

G06F 16/23   Updating

G06F 16/24575   using context

G06F 16/285   Clustering or classification

G06F 16/313   Selection or weighting of t...

G06F 16/35   Clustering; Classification

G06F 16/355   Class or cluster creation o...

G06F 16/93   Document management systems

G06F 16/955   using information identifie...

G06F 3/0641   De-duplication techniques

Y10S 707/99932   Access augmentation or opti...

Y10S 707/99936   Pattern matching access

Y10S 707/99943   Generating database or data...

Y10S 707/99945   Object-oriented database st...

System and method for grouping similar documents

First Claim

8 Assignments

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for grouping similar documents

First Claim

8 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links