ENHANCED IDENTIFICATION OF DOCUMENT TYPES
First Claim
1. A method for document management, the method comprising:
- automatically extracting respective features from each of a set of documents;
processing the features in a computer so as to generate respective vectors for the documents, each vector comprising elements having respective values that represent properties of a respective document;
assessing a similarity between the documents by computing a measure of distance between the respective vectors; and
automatically clustering the documents responsively to the similarity so as to identify a cluster of the documents belonging to a common document type.
1 Assignment
0 Petitions
Accused Products
Abstract
A method for document management includes automatically extracting respective features from each of a set of documents. The features are processed in a computer so as to generate respective vectors for the documents, each vector including elements having respective values that represent properties of a respective document. A similarity between the documents is assessed by computing a measure of distance between the respective vectors. The documents are automatically clustered responsively to the similarity so as to identify a cluster of the documents belonging to a common document type. Similar methods may be used in supervised categorization, wherein documents are compared and categorized based on a training set that is prepared for each document type.
-
Citations
30 Claims
-
1. A method for document management, the method comprising:
-
automatically extracting respective features from each of a set of documents; processing the features in a computer so as to generate respective vectors for the documents, each vector comprising elements having respective values that represent properties of a respective document; assessing a similarity between the documents by computing a measure of distance between the respective vectors; and automatically clustering the documents responsively to the similarity so as to identify a cluster of the documents belonging to a common document type. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A method for document management, the method comprising:
-
receiving respective file names of a plurality of documents; processing each file name in a computer so as to divide the file name into a sequence of sub-tokens; assigning respective weights to the sub-tokens; assessing a similarity between the documents by computing a measure of distance between the respective file names based on the sub-tokens in each of the file names and on the respective weights of the sub-tokens; and automatically clustering the documents responsively to the similarity so as to identify at least one cluster of the documents belonging to a common document type. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A method for document management, the method comprising:
-
automatically identifying respective embedded objects in each of a set of documents; processing the embedded objects in a computer so as to extract respective embedded object features of the documents, wherein the embedded object features are indicative of format characteristics of the embedded objects in the documents; assessing a similarity between the documents by computing a measure of distance between the documents based on the respective embedded object features; and automatically clustering the documents responsively to the similarity so as to identify a cluster of the documents belonging to a common document type. - View Dependent Claims (16, 17)
-
-
18. A method for document management, the method comprising:
-
automatically extracting headings from each of a set of documents; processing the headings in a computer so as to generate respective heading features of the documents; assessing a similarity between the documents by computing a measure of distance between the documents based on the respective heading features; and automatically clustering the documents responsively to the similarity so as to identify a cluster of the documents belonging to a common document type. - View Dependent Claims (19, 20, 21, 22, 23, 24, 25)
-
-
26. A computer software product, comprising a computer-readable medium in which program instructions are stored, which instructions, when read by a computer, cause the computer to extract respective features from each of a set of documents, to process the features so as to generate respective vectors for the documents, each vector comprising elements having respective values that represent properties of a respective document, to assess a similarity between the documents by computing a measure of distance between the respective vectors, and to cluster the documents responsively to the similarity so as to identify a cluster of the documents belonging to a common document type.
-
27. A computer software product, comprising a computer-readable medium in which program instructions are stored, which instructions, when read by a computer, cause the computer to receive respective file names of a plurality of documents, to process each file name so as to divide the file name into a sequence of sub-tokens, to assign respective weights to the sub-tokens, to assess a similarity between the documents by computing a measure of distance between the respective file names based on the sub-tokens in each of the file names and on the respective weights of the sub-tokens, and to cluster the documents responsively to the similarity so as to identify at least one cluster of the documents belonging to a common document type.
-
28. A computer software product, comprising a computer-readable medium in which program instructions are stored, which instructions, when read by a computer, cause the computer to identify respective embedded objects in each of a set of documents, to process the embedded objects so as to extract respective embedded object features of the documents, wherein the embedded object features are indicative of format characteristics of the embedded objects in the documents, to assess a similarity between the documents by computing a measure of distance between the documents based on the respective embedded object features, and to cluster the documents responsively to the similarity so as to identify a cluster of the documents belonging to a common document type.
-
29. A computer software product, comprising a computer-readable medium in which program instructions are stored, which instructions, when read by a computer, cause the computer to extract headings from each of a set of the documents, to process the headings so as to generate respective heading features of the documents, to assess a similarity between the documents by computing a measure of distance between the documents based on the respective heading features, and to cluster the documents responsively to the similarity so as to identify a cluster of the documents belonging to a common document type.
-
30. A method for document management, the method comprising:
-
providing respective training sets comprising known documents belonging to each of a plurality of document types; automatically extracting respective features from the known documents and from each of a set of new documents; processing the features in a computer so as to generate respective vectors for the documents, each vector comprising elements having respective values that represent properties of a respective document; assessing a similarity between the new documents and the known documents in each of the training sets by computing a measure of distance between the respective vectors; and automatically categorizing the new documents with respect to the document types responsively to the similarity.
-
Specification