System and method for dynamically evaluating latent concepts in unstructured documents

US 6,978,274 B1
Filed: 08/31/2001
Issued: 12/20/2005
Est. Priority Date: 08/31/2001
Status: Active Grant

First Claim

Patent Images

1. A computer-implement system for analyzing unstructured documents for conceptual relationships, comprising:

a histogram module determining a frequency of occurrences of concepts in a set of unstructured documents, each concept representing an element occurring in one or more of the unstructured documents;

a selection module selecting a subset of concepts out of the frequency of occurrences, grouping one or more concepts from the concepts subset, and assigning weights to one or more clusters of concepts for each group of concepts; and

a best fit module calculating a best fit approximation for each document indexed by each such group of concepts between the frequency of occurrences and the weighted cluster for each such concept grouped into the group of concepts.

View all claims

12 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system and method for dynamically evaluating latent concepts in unstructured documents is disclosed. A multiplicity of concepts are extracted from a set of unstructured documents into a lexicon. The lexicon uniquely identifies each concept and a frequency of occurrence. A frequency of occurrence representation is created for the documents set. The frequency representation provides an ordered corpus of the frequencies of occurrence of each concept. A subset of concepts is selected from the frequency of occurrence representation filtered against a pre-defined threshold. A group of weighted clusters of concepts selected from the concepts subset is generated. A matrix of best fit approximations is determined for each document weighted against each group of weighted clusters of concepts.

Citations

44 Claims

1. A computer-implement system for analyzing unstructured documents for conceptual relationships, comprising:
- a histogram module determining a frequency of occurrences of concepts in a set of unstructured documents, each concept representing an element occurring in one or more of the unstructured documents;
  
  a selection module selecting a subset of concepts out of the frequency of occurrences, grouping one or more concepts from the concepts subset, and assigning weights to one or more clusters of concepts for each group of concepts; and
  
  a best fit module calculating a best fit approximation for each document indexed by each such group of concepts between the frequency of occurrences and the weighted cluster for each such concept grouped into the group of concepts.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. A system according to claim 1, further comprising:
    - an extraction module extracting features from each of the unstructured documents and normalizing the extracted features into the concepts.
  - 3. A system according to claim 2, further comprising:
    - a structured database storing the extracted features as uniquely identified records.
  - 4. A system according to claim 1, further comprising:
    - a visualization module visualizing the frequency of occurrences, comprising at least one of creating a histogram mapping the frequency of occurrences for each document in the unstructured documents set and creating a corpus graph mapping the frequency of occurrence for all such documents in the unstructured documents set.
  - 5. A system according to claim 1, further comprising:
    - a threshold comprising a median and edge conditions, each such concept in the concepts subset occurring within the edge conditions.
  - 6. A system according to claim 1, further comprising:
    - an inner product module determining, for each group of concepts, the best fit approximation as the inner product between the frequency of occurrences and the weighted cluster for each such concept in the group of concepts.
  - 7. A system according to claim 6, wherein the inner product d_clusteris calculated according to the equation comprising:
    - $d_{cluster} = \sum_{i \to n} {doc}_{{term}_{i}} \cdot {cluster}_{{term}_{i}}$ where doc_conceptrepresents the frequency of occurrence for a given concept in the document and cluster_conceptrepresents the weight for a given cluster.
  - 8. A system according to claim 1, further comprising:
    - a control module iteratively re-determining the best fit approximation responsive to a change in the set of unstructured documents.

9. A computer-implemented method for analyzing unstructured documents for conceptual relationships, comprising:
- determining a frequency of occurrences of concepts in a set of unstructured documents, each concept representing an element occurring in one or more of the unstructured documents;
  
  selecting a subset of concepts out of the frequency of occurrences;
  
  grouping one or more concepts from the concepts subset;
  
  assigning weights to one or more clusters of concepts for each group of concepts; and
  
  calculating a best fit approximation for each document indexed by each such group of concepts between the frequency of occurrences and the weighted cluster for each such concept grouped into the group of concepts.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16, 17)
- - 10. A method according to claim 9, further comprising:
    - extracting features from each of the unstructured documents; and
      
      normalizing the extracted features into the concepts.
  - 11. A method according to claim 10, further comprising:
    - storing the extracted features as uniquely identified records in a structured database.
  - 12. A method according to claim 9, further comprising:
    - visualizing the frequency of occurrences, comprising at least one of;
      
      creating a histogram mapping the frequency of occurrences for each document in the unstructured documents set; and
      
      creating a corpus graph mapping the frequency of occurrence for all such documents in the unstructured documents set.
  - 13. A method according to claim 9, further comprising:
    - defining a threshold comprising a median and edge conditions, each such concept in the concepts subset occurring within the edge conditions.
  - 14. A method according to claim 9, further comprising:
    - for each group of concepts, determining the best fit approximation as the inner product between the frequency of occurrences and the weighted cluster for each such concept in the group of concepts.
  - 15. A method according to claim 14, wherein the inner product d_clusteris calculated according to the equation comprising:
    - $d_{cluster} = \sum_{i \to n} {doc}_{{term}_{i}} \cdot {cluster}_{{term}_{i}}$ where doc_conceptrepresents the frequency of occurrence for a given concept in the document and cluster_conceptrepresents the weight for a given cluster.
  - 16. A method according to claim 9, further comprising:
    - iteratively re-determining the best fit approximation responsive to a change in the set of unstructured documents.
  - 17. A computer-readable storage medium holding code for performing the method according to claim 9, 10, 11, 12, 13, 14, 15, or 16.

18. A computer-implemented system for dynamically evaluating latent concepts in unstructured documents, comprising:
- an extraction module extracting a multiplicity of concepts from a set of unstructured documents into a lexicon uniquely identifying each concept and a frequency of occurrence;
  
  a frequency mapping module creating a frequency of occurrence representation for each documents set, the representation providing an ordered corpus of the frequencies of occurrence of each concept;
  
  a concept selection module selecting a subset of concepts from the frequency of occurrence representation filtered against a minimal set of concepts each referenced in at least two documents with no document in the corpus being unreferenced;
  
  a group generation module generating a group of weighted clusters of concepts selected from the concepts subset; and
  
  a best fit module determining a matrix of best fit approximations for each document weighted against each group of weighted clusters of concepts.
- View Dependent Claims (19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30)
- - 19. A system according to claim 18, further comprising:
    - a histogram module creating a histogram mapping the frequency of occurrence representation for each document in the documents set.
  - 20. A system according to claim 19, further comprising:
    - a data mining module mining the multiplicity of concepts from each document as at least one of a noun, noun phrase and tri-gram.
  - 21. A system according to claim 18, further comprising:
    - a normalizing module normalizing the multiplicity of concepts into a substantially uniform lexicon.
  - 22. A system according to claim 21, wherein the substantially uniform lexicon is in third normal form.
  - 23. A system according to claim 18, further comprising:
    - a corpus mapping module creating a corpus graph mapping the frequency of occurrence representation for all documents in the documents set.
  - 24. A system according to claim 18, further comprising:
    - a threshold module defining the pre-defined threshold as a median value and a set of edge conditions and choosing those concepts falling within the edge conditions as the concepts subset.
  - 25. A system according to claim 18, further comprising:
    - a cluster module naming one or more of the concepts within the concepts subset to a cluster and assigning a weight to each concept with each such cluster.
  - 26. A system according to claim 25, further comprising:
    - a group module grouping one or more of the clusters into each such group of weighted clusters of concepts.
  - 27. A system according to claim 18, further comprising:
    - a Euclidean module calculating a Euclidean distance between the frequency of occurrence for each document and a corresponding weighted cluster.
  - 28. A system according to claim 18, further comprising:
    - a iteration module removing select documents from the documents set and iteratively reevaluating the matrix of best fit approximations based on a revised frequency of occurrence representation and concepts subset.
  - 29. A system according to claim 18, further comprising:
    - a structured database storing the lexicon, the lexicon comprising a plurality of records each uniquely identifying one such concept and an associated frequency of occurrence.
  - 30. A system according to claim 29, wherein the structured database is an SQL database.

31. A computer-implemented method for dynamically evaluating latent concepts in unstructured documents, comprising:
- extracting a multiplicity of concepts from a set of unstructured documents into a lexicon uniquely identifying each concept and a frequency of occurrence;
  
  creating a frequency of occurrence representation for each documents set, the representation providing an ordered corpus of the frequencies of occurrence of each concept;
  
  selecting a subset of concepts from the frequency of occurrence representation filtered against a minimal set of concepts each referenced in at least two documents with no document in the corpus being unreferenced;
  
  generating a group of weighted clusters of concepts selected from the concepts subset; and
  
  determining a matrix of best fit approximations for each document weighted against each group of weighted clusters of concepts.
- View Dependent Claims (32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44)
- - 32. A method according to claim 31, further comprising:
    - creating a histogram mapping the frequency of occurrence representation for each document in the documents set.
  - 33. A method according to claim 32, further comprising:
    - mining the multiplicity of concepts from each document as at least one of a noun, noun phrase and tri-gram.
  - 34. A method according to claim 31, further comprising:
    - normalizing the multiplicity of concepts into a substantially uniform lexicon.
  - 35. A method according to claim 34, wherein the substantially uniform lexicon is in third normal form.
  - 36. A method according to claim 31, further comprising:
    - creating a corpus graph mapping the frequency of occurrence representation for all documents in the documents set.
  - 37. A method according to claim 31, further comprising:
    - defining the pre-defined threshold as a median value and a set of edge conditions; and
      
      choosing those concepts falling within the edge conditions as the concepts subset.
  - 38. A method according to claim 31, further comprising:
    - naming one or more of the concepts within the concepts subset to a cluster; and
      
      assigning a weight to each concept with each such cluster.
  - 39. A method according to claim 38, further comprising:
    - grouping one or more of the clusters into each such group of weighted clusters of concepts.
  - 40. A method according to claim 31, further comprising:
    - calculating a Euclidean distance between the frequency of occurrence for each document and a corresponding weighted cluster.
  - 41. A method according to claim 31, further comprising:
    - removing select documents from the documents set; and
      
      iteratively reevaluating the matrix of best fit approximations based on a revised frequency of occurrence representation and concepts subset.
  - 42. A method according to claim 31, further comprising:
    - storing the lexicon in a structured database, the lexicon comprising a plurality of records each uniquely identifying one such concept and an associated frequency of occurrence.
  - 43. A method according to claim 42, wherein the structured database is an SQL database.
  - 44. A computer-readable storage medium holding code for performing the method according to claim 31, 32, 33, 34, 36, 37, 38, 39, 40, 41, or 42.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nuix North America Inc. (Nuix Ltd.)
Original Assignee
Attenex Corp. (FTI Consulting Incorporated)
Inventors
Kawai, Kenji, Gallivan, Dan
Primary Examiner(s)
Corrielus, Jean M.
Assistant Examiner(s)
Ly, Anh

Application Number

US09/944,474
Time in Patent Office

1,572 Days
Field of Search

707 1- 10, 707100-1041, 707200-205, 715/529, 715/530, 715/531, 704/9, 704/10, 704/1, 706/45
US Class Current

1/1
CPC Class Codes

G06F 16/23   Updating

G06F 16/24575   using context

G06F 16/285   Clustering or classification

G06F 16/313   Selection or weighting of t...

G06F 16/35   Clustering; Classification

G06F 16/355   Class or cluster creation o...

G06F 16/93   Document management systems

G06F 16/955   using information identifie...

G06F 3/0641   De-duplication techniques

Y10S 707/99932   Access augmentation or opti...

Y10S 707/99936   Pattern matching access

Y10S 707/99943   Generating database or data...

Y10S 707/99945   Object-oriented database st...

System and method for dynamically evaluating latent concepts in unstructured documents

First Claim

12 Assignments

0 Petitions

Accused Products

Abstract

Citations

44 Claims

Specification

Solutions

Use Cases

Quick Links

System and method for dynamically evaluating latent concepts in unstructured documents

First Claim

12 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

44 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links