×

Techniques for comparing and clustering documents

  • US 8,983,963 B2
  • Filed: 07/07/2011
  • Issued: 03/17/2015
  • Est. Priority Date: 07/07/2011
  • Status: Active Grant
First Claim
Patent Images

1. A method for analyzing documents, the method comprising:

  • importing into a database a plurality of documents and/or document portions, at least some of the documents and/or document portions being structured and at least some of the documents and/or document portions being unstructured;

    organizing the imported documents and/or document portions into one or more collections by splitting the documents and/or document portions into one or more sub-documents and/or sub-document portions and treating both the structured and unstructured sub-documents and/or sub-document portions as respective unstructured sub-documents and/or sub-document portions;

    receiving a selection of at least one of said one or more collections;

    building one or more indexes of words and/or groups of words based on each said document or document portion in each said selection;

    building a document-word matrix including a value indicative of a number of times each said word and/or group of words in the index of words and/or groups of words appears in each said document or document portion in each said selection;

    generating, via at least one processor, clusters of documents from the selected one or more collections using the document-word matrix, wherein each cluster includes documents having a degree of similarity above a similarity threshold and wherein at least one document is included in a plurality of clusters;

    receiving a user selection of documents from one generated cluster or from different generated clusters; and

    in response to the user selection of the documents, calculating a degree of similarity between the selected documents.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×