TECHNIQUES FOR COMPARING AND CLUSTERING DOCUMENTS
First Claim
1. A method for analyzing documents, the method comprising:
- importing into a database a plurality of documents and/or document portions, at least some of the documents and/or document portions being structured and at least some of the documents and/or document portions being unstructured;
organizing the imported documents and/or document portions into one or more collections;
receiving a selection of at least one of said one or more collections;
building an index of words and/or groups of words based on each said document or document portion in each said selection;
building a document-word matrix including a value indicative of a number of times each said word and/or group of words in the index of words and/or groups of words appears in each said document or document portion in each said selection; and
generating, via at least one processor, one or more clusters of documents using the document-word matrix.
1 Assignment
0 Petitions
Accused Products
Abstract
Certain example embodiments relate to techniques for analyzing documents. A plurality of documents/document portions are imported into a database, with at least some of the documents/document portions being structured and at least some being unstructured. The imported documents/document portions are organized into one or more collections. A selection of at least one of the one or more collections is made. An index of words and/or groups of words is built (and optionally refined in accordance with one or more predefined rules) based on each of the document or document portion in each selection. A document-word matrix is built (and optionally weighted using a semantic approach), with the matrix including a value indicative of a number of times each word and/or group of words in the index appears in each document/document portion. One or more clusters of documents are generated using the document-word matrix.
54 Citations
28 Claims
-
1. A method for analyzing documents, the method comprising:
-
importing into a database a plurality of documents and/or document portions, at least some of the documents and/or document portions being structured and at least some of the documents and/or document portions being unstructured; organizing the imported documents and/or document portions into one or more collections; receiving a selection of at least one of said one or more collections; building an index of words and/or groups of words based on each said document or document portion in each said selection; building a document-word matrix including a value indicative of a number of times each said word and/or group of words in the index of words and/or groups of words appears in each said document or document portion in each said selection; and generating, via at least one processor, one or more clusters of documents using the document-word matrix. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 19)
-
-
16. A method for analyzing documents, the method comprising:
-
organizing a plurality of assets including structured and unstructured documents and/or document portions into a plurality of user-defined collections; enabling a user to select one or more of said user-defined collections for subsequent analysis; building an index of words and/or groups of words based on the documents and document portions in each said selected collection; building a document-word matrix including a value indicative of a number of times each said word and/or group of words in the index of words and/or groups of words appears in each said document or document portion in each said selection; refining the index of words and/or groups of words, and/or the document-word matrix based on predefined rules stored in a database of rules; weighting entries in the document-word matrix based on a semantic indexing approach; and clustering together, via at least one processor, documents in dependence on the weighted entries. - View Dependent Claims (17, 18, 20)
-
-
21. An asset analysis system, comprising:
-
a database configured to store a plurality of imported assets in one or more collections, the assets being documents and/or document portions, wherein at least some of the documents and/or document portions being structured and at least some being unstructured; an asset splitting module, executable via at least one processor, configured to split the documents and/or document portions into sub-documents and/or sub-document portions automatically and/or based on user input and store generated sub-documents and/or sub-document portions as assets in the database; a user interface configured to enable a user to select one or more collections of assets for analysis; an index builder, under control of the at least one processor, configured to access from the database assets belonging to the one or more selected collections and generate a word and/or groups of words index, the word and/or groups of words index including a listing of words appearing in the accessed assets; a rules database configured to store a plurality of user-defined rules for refining the word and/or groups of words index and/or the document-word matrix; a matrix builder, under control of the at least one processor, configured to build a document-word matrix including a value indicative of a number of times each said word and/or group of words in the index of words and/or groups of words appears in each said accessed asset; an index refining module, under control of the at least one processor, configured to refine the word and/or groups of words index and/or the document-word matrix, based on rules stored in the rules database; a weighting engine, under control of the at least one processor, configured to weight entries in the document-word matrix based on a semantic indexing approach; and a clustering engine, under control of the at least one processor, configured to cluster together documents in dependence on the weighted entries. - View Dependent Claims (22, 23, 24, 25, 26, 27, 28)
-
Specification