Techniques for comparing and clustering documents
First Claim
1. A method for analyzing documents, the method comprising:
- importing into a database a plurality of documents and/or document portions, at least some of the documents and/or document portions being structured and at least some of the documents and/or document portions being unstructured;
organizing the imported documents and/or document portions into one or more collections by splitting the documents and/or document portions into one or more sub-documents and/or sub-document portions and treating both the structured and unstructured sub-documents and/or sub-document portions as respective unstructured sub-documents and/or sub-document portions;
receiving a selection of at least one of said one or more collections;
building one or more indexes of words and/or groups of words based on each said document or document portion in each said selection;
building a document-word matrix including a value indicative of a number of times each said word and/or group of words in the index of words and/or groups of words appears in each said document or document portion in each said selection;
generating, via at least one processor, clusters of documents from the selected one or more collections using the document-word matrix, wherein each cluster includes documents having a degree of similarity above a similarity threshold and wherein at least one document is included in a plurality of clusters;
receiving a user selection of documents from one generated cluster or from different generated clusters; and
in response to the user selection of the documents, calculating a degree of similarity between the selected documents.
1 Assignment
0 Petitions
Accused Products
Abstract
Certain example embodiments relate to techniques for analyzing documents. A plurality of documents/document portions are imported into a database, with at least some of the documents/document portions being structured and at least some being unstructured. The imported documents/document portions are organized into one or more collections. A selection of at least one of the one or more collections is made. An index of words and/or groups of words is built (and optionally refined in accordance with one or more predefined rules) based on each of the document or document portion in each selection. A document-word matrix is built (and optionally weighted using a semantic approach), with the matrix including a value indicative of a number of times each word and/or group of words in the index appears in each document/document portion. One or more clusters of documents are generated using the document-word matrix.
18 Citations
31 Claims
-
1. A method for analyzing documents, the method comprising:
-
importing into a database a plurality of documents and/or document portions, at least some of the documents and/or document portions being structured and at least some of the documents and/or document portions being unstructured; organizing the imported documents and/or document portions into one or more collections by splitting the documents and/or document portions into one or more sub-documents and/or sub-document portions and treating both the structured and unstructured sub-documents and/or sub-document portions as respective unstructured sub-documents and/or sub-document portions; receiving a selection of at least one of said one or more collections; building one or more indexes of words and/or groups of words based on each said document or document portion in each said selection; building a document-word matrix including a value indicative of a number of times each said word and/or group of words in the index of words and/or groups of words appears in each said document or document portion in each said selection; generating, via at least one processor, clusters of documents from the selected one or more collections using the document-word matrix, wherein each cluster includes documents having a degree of similarity above a similarity threshold and wherein at least one document is included in a plurality of clusters; receiving a user selection of documents from one generated cluster or from different generated clusters; and in response to the user selection of the documents, calculating a degree of similarity between the selected documents. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19)
-
-
20. A method for analyzing documents, the method comprising:
-
organizing a plurality of assets including structured and unstructured documents and/or document portions into a plurality of user-defined collections by splitting the documents and/or document portions into one or more sub-documents and/or sub-document portions and treating both the structured and unstructured sub-documents and/or sub-document portions as respective unstructured sub-documents and/or sub-document portions; enabling a user to select one or more of said user-defined collections for subsequent analysis; building an index of words and/or groups of words based on the documents and/or document portions in each said selected collection; building a document-word matrix including a value indicative of a number of times each said word and/or group of words in the index of words and/or groups of words appears in each said document or document portion in each said selection; refining the index of words and/or groups of words, and/or the document-word matrix based on predefined rules stored in a database of rules; weighting entries in the document-word matrix based on a semantic indexing approach; clustering together, via at least one processor, documents from the selected one or more collections in dependence on the weighted entries into a plurality of clusters, wherein each cluster includes a documents having a degree of similarity above a similarity threshold and wherein at least one document is included in a plurality of clusters; receiving a user selection of documents from one generated cluster or from different generated clusters; and in response to the user selection of the documents, calculating a degree of similarity between the selected documents. - View Dependent Claims (21, 22, 23)
-
-
24. An asset analysis system having a memory and at least one processor, comprising:
-
a database configured to store a plurality of imported assets in one or more collections, the plurality of imported assets being documents and/or document portions, wherein at least some of the documents and/or document portions being structured and at least some being unstructured; an asset splitting module, executable via at least one processor, configured to split the documents and/or document portions into sub-documents and/or sub-document portions automatically and/or based on user input and store generated sub-documents and/or sub-document portions as assets in the database assets in the database and configured to treat both the structured and unstructured sub-documents and/or sub-document portions as respective unstructured sub-documents and/or sub-document portions; a user interface configured to enable a user to select one or more collections of the database assets for analysis; an index builder, under control of the at least one processor, configured to access from the database assets belonging to the one or more selected collections and generate a word and/or groups of words index, the word and/or groups of words index including a listing of words appearing in the accessed assets; a rules database configured to store a plurality of user-defined rules for refining the word and/or groups of words index and/or the document-word matrix; a matrix builder, under control of the at least one processor, configured to build a document-word matrix including a value indicative of a number of times each said word and/or group of words in the index of words and/or groups of words appears in each said accessed database asset; an index refining module, under control of the at least one processor, configured to refine the word and/or groups of words index and/or the document-word matrix, based on rules stored in the rules database; a weighting engine, under control of the at least one processor, configured to weight entries in the document-word matrix based on a semantic indexing approach; a clustering engine, under control of the at least one processor, configured to cluster together documents from the selected one or more collections into a plurality of clusters in dependence on the weighted entries, wherein each cluster includes documents having a degree of similarity above a similarity threshold and at least one document is included in a plurality of clusters; the user interface being further configured to receive a user selection of documents from one generated cluster or from different generated clusters; and a calculating engine, under control of the at least one processor, configured to in response to the user selection of the documents, calculate a degree of similarity between the selected documents. - View Dependent Claims (25, 26, 27, 28, 29, 30, 31)
-
Specification