Techniques for comparing and clustering documents

US 8,983,963 B2
Filed: 07/07/2011
Issued: 03/17/2015
Est. Priority Date: 07/07/2011
Status: Active Grant

First Claim

Patent Images

1. A method for analyzing documents, the method comprising:

importing into a database a plurality of documents and/or document portions, at least some of the documents and/or document portions being structured and at least some of the documents and/or document portions being unstructured;

organizing the imported documents and/or document portions into one or more collections by splitting the documents and/or document portions into one or more sub-documents and/or sub-document portions and treating both the structured and unstructured sub-documents and/or sub-document portions as respective unstructured sub-documents and/or sub-document portions;

receiving a selection of at least one of said one or more collections;

building one or more indexes of words and/or groups of words based on each said document or document portion in each said selection;

building a document-word matrix including a value indicative of a number of times each said word and/or group of words in the index of words and/or groups of words appears in each said document or document portion in each said selection;

generating, via at least one processor, clusters of documents from the selected one or more collections using the document-word matrix, wherein each cluster includes documents having a degree of similarity above a similarity threshold and wherein at least one document is included in a plurality of clusters;

receiving a user selection of documents from one generated cluster or from different generated clusters; and

in response to the user selection of the documents, calculating a degree of similarity between the selected documents.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Certain example embodiments relate to techniques for analyzing documents. A plurality of documents/document portions are imported into a database, with at least some of the documents/document portions being structured and at least some being unstructured. The imported documents/document portions are organized into one or more collections. A selection of at least one of the one or more collections is made. An index of words and/or groups of words is built (and optionally refined in accordance with one or more predefined rules) based on each of the document or document portion in each selection. A document-word matrix is built (and optionally weighted using a semantic approach), with the matrix including a value indicative of a number of times each word and/or group of words in the index appears in each document/document portion. One or more clusters of documents are generated using the document-word matrix.

18 Citations

View as Search Results

31 Claims

1. A method for analyzing documents, the method comprising:
- importing into a database a plurality of documents and/or document portions, at least some of the documents and/or document portions being structured and at least some of the documents and/or document portions being unstructured;
  
  organizing the imported documents and/or document portions into one or more collections by splitting the documents and/or document portions into one or more sub-documents and/or sub-document portions and treating both the structured and unstructured sub-documents and/or sub-document portions as respective unstructured sub-documents and/or sub-document portions;
  
  receiving a selection of at least one of said one or more collections;
  
  building one or more indexes of words and/or groups of words based on each said document or document portion in each said selection;
  
  building a document-word matrix including a value indicative of a number of times each said word and/or group of words in the index of words and/or groups of words appears in each said document or document portion in each said selection;
  
  generating, via at least one processor, clusters of documents from the selected one or more collections using the document-word matrix, wherein each cluster includes documents having a degree of similarity above a similarity threshold and wherein at least one document is included in a plurality of clusters;
  
  receiving a user selection of documents from one generated cluster or from different generated clusters; and
  
  in response to the user selection of the documents, calculating a degree of similarity between the selected documents.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
- - 2. The method of claim 1, further comprising:
    - building a weighted document-word matrix including a weighted value indicative of a number of times each said word in the index of words appears in each said document or document portion in each said selection, the weighting being performed in accordance with a semantics-based algorithm,wherein the one or more clusters of documents are generated using the weighted document-word matrix.
  - 3. The method of claim 2, wherein the semantics-based algorithm involves Latent Semantic Indexing (LSI).
  - 4. The method of claim 1, further comprising refining (a) the index of words and/or groups of words prior to the building of the document-word matrix and/or (b) document-word matrix, the refining being performed in accordance with one or more predefined rules.
  - 5. The method of claim 4, wherein the one or more predefined rules include rules defining semantic and/or non-semantic stopwords and specifying how the defined semantic and/or non-semantic stopwords are to be handled.
  - 6. The method of claim 4, wherein the one or more predefined rules include rules for standardizing transliterations or suspected transliterations.
  - 7. The method of claim 4, wherein the one or more predefined rules include rules for applying a stemming algorithm to reduce the size of the index of words and/or groups of words prior to the building of the document-word matrix and/or the document-word matrix.
  - 8. The method of claim 4, wherein a plurality of predefined rules are provided, the rules including rules (i) defining semantic and/or non-semantic stopwords and specifying how the defined semantic and/or non-semantic stopwords are to be handled, (ii) for standardizing transliterations or suspected transliterations, and (iii) for applying a stemming algorithm to reduce the size of the index of words and/or groups of words prior to the building of the document-word matrix and/or the document-word matrix.
  - 9. The method of claim 4, wherein an index is built for words and for groups of words.
  - 10. The method of claim 9, wherein the groups of words include a given word and one word immediately to the left of the given word.
  - 11. The method of claim 1, further comprising removing linkages between documents in a given cluster when threshold values indicative of degrees of similarity between two documents are not met.
  - 12. The method of claim 11, wherein the clustering is based on a cosine similarity calculation.
  - 13. The method of claim 12, wherein the threshold value is 80%.
  - 14. The method of claim 12, wherein the threshold is user-adjustable.
  - 15. The method of claim 12, further comprising:
    - determining degrees of similarity between the structured and unstructured documents and/or document portions; and
      
      ordering respective structured and unstructured documents and/or documents portions in the one or more clusters based on respective degrees of similarity.
  - 16. A non-transitory computer readable storage medium storing instructions, that when executed by at least one processor of a computer system, perform a method according to claim 1.
  - 17. The method of claim 1, wherein building the one or more indexes of words and groups of words includes building a first index for the words and a second index for the groups of words, the first index and second index being separate indexes.
  - 18. The method of claim 1, further comprising removing irrelevant sections from one or more documents in the database such that the removed sections are not used in building the index of words and/or groups of words.
  - 19. The method of claim 1, wherein each cluster is independently structured from the other clusters.

20. A method for analyzing documents, the method comprising:
- organizing a plurality of assets including structured and unstructured documents and/or document portions into a plurality of user-defined collections by splitting the documents and/or document portions into one or more sub-documents and/or sub-document portions and treating both the structured and unstructured sub-documents and/or sub-document portions as respective unstructured sub-documents and/or sub-document portions;
  
  enabling a user to select one or more of said user-defined collections for subsequent analysis;
  
  building an index of words and/or groups of words based on the documents and/or document portions in each said selected collection;
  
  building a document-word matrix including a value indicative of a number of times each said word and/or group of words in the index of words and/or groups of words appears in each said document or document portion in each said selection;
  
  refining the index of words and/or groups of words, and/or the document-word matrix based on predefined rules stored in a database of rules;
  
  weighting entries in the document-word matrix based on a semantic indexing approach;
  
  clustering together, via at least one processor, documents from the selected one or more collections in dependence on the weighted entries into a plurality of clusters, wherein each cluster includes a documents having a degree of similarity above a similarity threshold and wherein at least one document is included in a plurality of clusters;
  
  receiving a user selection of documents from one generated cluster or from different generated clusters; and
  
  in response to the user selection of the documents, calculating a degree of similarity between the selected documents.
- View Dependent Claims (21, 22, 23)
- - 21. The method of claim 20, wherein the predefined rules included rules (a) defining semantic and/or non-semantic stopwords and specifying how the defined semantic and/or non-semantic stopwords are to be handled, (b) for standardizing transliterations or suspected transliterations, and/or (c) for applying a stemming algorithm to reduce the size of (1)the index of words and/or groups of words, prior to the building of the document-word matrix and/or (2)the document-word matrix.
  - 22. The method of claim 20, further comprising removing linkages between documents in a given cluster when threshold values indicative of degrees of similarity between two documents are not met.
  - 23. A non-transitory computer readable storage medium storing instructions, that when executed by at least one processor of a computer system, perform a method according to claim 20.

24. An asset analysis system having a memory and at least one processor, comprising:
- a database configured to store a plurality of imported assets in one or more collections, the plurality of imported assets being documents and/or document portions, wherein at least some of the documents and/or document portions being structured and at least some being unstructured;
  
  an asset splitting module, executable via at least one processor, configured to split the documents and/or document portions into sub-documents and/or sub-document portions automatically and/or based on user input and store generated sub-documents and/or sub-document portions as assets in the database assets in the database and configured to treat both the structured and unstructured sub-documents and/or sub-document portions as respective unstructured sub-documents and/or sub-document portions;
  
  a user interface configured to enable a user to select one or more collections of the database assets for analysis;
  
  an index builder, under control of the at least one processor, configured to access from the database assets belonging to the one or more selected collections and generate a word and/or groups of words index, the word and/or groups of words index including a listing of words appearing in the accessed assets;
  
  a rules database configured to store a plurality of user-defined rules for refining the word and/or groups of words index and/or the document-word matrix;
  
  a matrix builder, under control of the at least one processor, configured to build a document-word matrix including a value indicative of a number of times each said word and/or group of words in the index of words and/or groups of words appears in each said accessed database asset;
  
  an index refining module, under control of the at least one processor, configured to refine the word and/or groups of words index and/or the document-word matrix, based on rules stored in the rules database;
  
  a weighting engine, under control of the at least one processor, configured to weight entries in the document-word matrix based on a semantic indexing approach;
  
  a clustering engine, under control of the at least one processor, configured to cluster together documents from the selected one or more collections into a plurality of clusters in dependence on the weighted entries, wherein each cluster includes documents having a degree of similarity above a similarity threshold and at least one document is included in a plurality of clusters;
  
  the user interface being further configured to receive a user selection of documents from one generated cluster or from different generated clusters; and
  
  a calculating engine, under control of the at least one processor, configured to in response to the user selection of the documents, calculate a degree of similarity between the selected documents.
- View Dependent Claims (25, 26, 27, 28, 29, 30, 31)
- - 25. The system of claim 24, wherein the rules in the rules database include rules (a) defining semantic and/or non-semantic stopwords and specifying how the defined semantic and/or non-semantic stopwords are to be handled, (b) for standardizing transliterations or suspected transliterations, and (c) for applying a stemming algorithm to reduce the size of the index of words and/or groups of words prior to the building of the document-word matrix and/or the document-word matrix.
  - 26. The system of claim 24, wherein an index is build for words and for groups of words.
  - 27. The system of claim 26, wherein the groups of words include a given word and one word immediately to the left of the given word.
  - 28. The system of claim 24, wherein the clustering engine is further configured to remove linkages between documents in a given cluster when threshold values indicative of degrees of similarity between two documents are not met.
  - 29. The system of claim 24, wherein the clustering is based on a cosine similarity calculation.
  - 30. The system of claim 24, wherein the threshold value is 80%.
  - 31. The method of claim 24, wherein the threshold is user-adjustable.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Software AG
Original Assignee
Software AG
Inventors
El Mansouri, Khalid, Fittges, Klaus
Primary Examiner(s)
Richardson, James E

Application Number

US13/177,849
Publication Number

US 20130013612A1
Time in Patent Office

1,349 Days
Field of Search

707/739
US Class Current

707/739
CPC Class Codes

G06F 16/1774 Locking methods, e.g. locki...

G06F 16/353 into predefined classes

Techniques for comparing and clustering documents

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

18 Citations

31 Claims

Specification

Solutions

Use Cases

Quick Links

Techniques for comparing and clustering documents

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

18 Citations

31 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links