ENHANCED IDENTIFICATION OF DOCUMENT TYPES

US 20120041955A1
Filed: 08/10/2010
Published: 02/16/2012
Est. Priority Date: 08/10/2010
Status: Abandoned Application

First Claim

Patent Images

1. A method for document management, the method comprising:

automatically extracting respective features from each of a set of documents;

processing the features in a computer so as to generate respective vectors for the documents, each vector comprising elements having respective values that represent properties of a respective document;

assessing a similarity between the documents by computing a measure of distance between the respective vectors; and

automatically clustering the documents responsively to the similarity so as to identify a cluster of the documents belonging to a common document type.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for document management includes automatically extracting respective features from each of a set of documents. The features are processed in a computer so as to generate respective vectors for the documents, each vector including elements having respective values that represent properties of a respective document. A similarity between the documents is assessed by computing a measure of distance between the respective vectors. The documents are automatically clustered responsively to the similarity so as to identify a cluster of the documents belonging to a common document type. Similar methods may be used in supervised categorization, wherein documents are compared and categorized based on a training set that is prepared for each document type.

Citations

30 Claims

1. A method for document management, the method comprising:
- automatically extracting respective features from each of a set of documents;
  
  processing the features in a computer so as to generate respective vectors for the documents, each vector comprising elements having respective values that represent properties of a respective document;
  
  assessing a similarity between the documents by computing a measure of distance between the respective vectors; and
  
  automatically clustering the documents responsively to the similarity so as to identify a cluster of the documents belonging to a common document type.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method according to claim 1, wherein processing the features comprises generating a string corresponding to the vector, and wherein the elements of the vector comprise respective characters in the string.
  - 3. The method according to claim 2, wherein automatically extracting the respective features comprises parsing a hierarchical tree representation of each of the documents, and building the string to represent the tree by recursively traversing the nodes of the tree and adding the characters to the string so as to represent the traversed nodes.
  - 4. The method according to claim 2, wherein generating the string comprises, when the string exceeds a predetermined length, truncating the string to the predetermined length by selecting a first sequence of the characters from a beginning of the string and concatenating it with a second sequence of the characters from an end of the string.
  - 5. The method according to claim 2, wherein computing the measure of distance comprises computing a string distance between strings representing the respective vectors.
  - 6. The method according to claim 1, wherein at least some of the elements of the vectors comprise symbols that represent respective ranges of values of the properties.
  - 7. The method according to claim 1, wherein automatically extracting the respective features comprises identifying format features of the documents, and wherein the elements of the vectors represent respective characteristics of the format.

8. A method for document management, the method comprising:
- receiving respective file names of a plurality of documents;
  
  processing each file name in a computer so as to divide the file name into a sequence of sub-tokens;
  
  assigning respective weights to the sub-tokens;
  
  assessing a similarity between the documents by computing a measure of distance between the respective file names based on the sub-tokens in each of the file names and on the respective weights of the sub-tokens; and
  
  automatically clustering the documents responsively to the similarity so as to identify at least one cluster of the documents belonging to a common document type.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The method according to claim 8, wherein processing each file name comprises separating the file name into alpha, numeric, and symbol sub-tokens.
  - 10. The method according to claim 9, wherein each alpha sub-token consists of a sequence of letters, each having a respective case, such that the case does not change from lower case to upper case within the sequence.
  - 11. The method according to claim 9, wherein assigning the respective weights comprises assigning a greater weight to the alpha sub-tokens than to the numeric and symbol sub-tokens.
  - 12. The method according to claim 8, wherein assigning the respective weights comprises assigning a greater weight to acronyms than to other sub-tokens.
  - 13. The method according to claim 8, wherein computing the measure of the distance comprises computing a weighted sum of sub-token distances between the sub-tokens of a first document and corresponding sub-tokens of a second document, wherein the sub-token distances are weighted by the respective weights of the sub-tokens.
  - 14. The method according to claim 13, wherein computing the weighted sum comprises aligning each of the sub-tokens of the first document with a first corresponding sub-token of the second document in a forward order in order to compute a first weighted distance, and aligning each of the sub-tokens of the first document with a second corresponding sub-token of the second document in a reverse order in order to compute a second weighted distance, and combining the first and second weighted distances in order to find the measure of the distance between the respective file names.

15. A method for document management, the method comprising:
- automatically identifying respective embedded objects in each of a set of documents;
  
  processing the embedded objects in a computer so as to extract respective embedded object features of the documents, wherein the embedded object features are indicative of format characteristics of the embedded objects in the documents;
  
  assessing a similarity between the documents by computing a measure of distance between the documents based on the respective embedded object features; and
  
  automatically clustering the documents responsively to the similarity so as to identify a cluster of the documents belonging to a common document type.
- View Dependent Claims (16, 17)
- - 16. The method according to claim 15, wherein the embedded object features comprise a respective shape of each of the embedded objects.
  - 17. The method according to claim 15, wherein computing the measure of the distance comprises aligning each of the embedded objects in a first document with a corresponding embedded object in a second document, and computing an association score between the aligned embedded objects.

18. A method for document management, the method comprising:
- automatically extracting headings from each of a set of documents;
  
  processing the headings in a computer so as to generate respective heading features of the documents;
  
  assessing a similarity between the documents by computing a measure of distance between the documents based on the respective heading features; and
  
  automatically clustering the documents responsively to the similarity so as to identify a cluster of the documents belonging to a common document type.
- View Dependent Claims (19, 20, 21, 22, 23, 24, 25)
- - 19. The method according to claim 18, wherein automatically extracting the headings comprises distinguishing the headings from paragraphs of text with which the headings are associated in the documents.
  - 20. The method according to claim 19, wherein distinguishing the headings comprises assigning respective heading scores to the headings, indicating a respective level of confidence in each of the headings, and wherein processing the headings comprises choosing the headings for inclusion in the heading features responsively to the respective heading scores.
  - 21. The method according to claim 20, wherein computing the measure comprises computing a weighted sum of association scores between the headings, weighted by the heading scores.
  - 22. The method according to claim 18, wherein processing the headings comprises extracting format characteristics of the headings, and generating a heading style feature based on the format characteristics.
  - 23. The method according to claim 18, wherein processing the headings comprises extracting textual content from the headings, and generating a heading text feature based on the textual content.
  - 24. The method according to claim 23, wherein computing the measure of the distance comprises computing a heading text distance responsively to the textual content and computing a heading style distance responsively to format characteristics of the headings.
  - 25. The method according to claim 18, wherein computing the measure of the distance comprises aligning each of the headings in a first document with a corresponding heading in a second document, and computing an association score between the aligned headings.

26. A computer software product, comprising a computer-readable medium in which program instructions are stored, which instructions, when read by a computer, cause the computer to extract respective features from each of a set of documents, to process the features so as to generate respective vectors for the documents, each vector comprising elements having respective values that represent properties of a respective document, to assess a similarity between the documents by computing a measure of distance between the respective vectors, and to cluster the documents responsively to the similarity so as to identify a cluster of the documents belonging to a common document type.

27. A computer software product, comprising a computer-readable medium in which program instructions are stored, which instructions, when read by a computer, cause the computer to receive respective file names of a plurality of documents, to process each file name so as to divide the file name into a sequence of sub-tokens, to assign respective weights to the sub-tokens, to assess a similarity between the documents by computing a measure of distance between the respective file names based on the sub-tokens in each of the file names and on the respective weights of the sub-tokens, and to cluster the documents responsively to the similarity so as to identify at least one cluster of the documents belonging to a common document type.

28. A computer software product, comprising a computer-readable medium in which program instructions are stored, which instructions, when read by a computer, cause the computer to identify respective embedded objects in each of a set of documents, to process the embedded objects so as to extract respective embedded object features of the documents, wherein the embedded object features are indicative of format characteristics of the embedded objects in the documents, to assess a similarity between the documents by computing a measure of distance between the documents based on the respective embedded object features, and to cluster the documents responsively to the similarity so as to identify a cluster of the documents belonging to a common document type.

29. A computer software product, comprising a computer-readable medium in which program instructions are stored, which instructions, when read by a computer, cause the computer to extract headings from each of a set of the documents, to process the headings so as to generate respective heading features of the documents, to assess a similarity between the documents by computing a measure of distance between the documents based on the respective heading features, and to cluster the documents responsively to the similarity so as to identify a cluster of the documents belonging to a common document type.

30. A method for document management, the method comprising:
- providing respective training sets comprising known documents belonging to each of a plurality of document types;
  
  automatically extracting respective features from the known documents and from each of a set of new documents;
  
  processing the features in a computer so as to generate respective vectors for the documents, each vector comprising elements having respective values that represent properties of a respective document;
  
  assessing a similarity between the new documents and the known documents in each of the training sets by computing a measure of distance between the respective vectors; and
  
  automatically categorizing the new documents with respect to the document types responsively to the similarity.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Nogacom Ltd.
Original Assignee
Nogacom Ltd.
Inventors
Regev, Yizhar, Weiss, Gilad

Application Number

US12/853,310
Publication Number

US 20120041955A1
Time in Patent Office

Days
Field of Search
US Class Current

707/740
CPC Class Codes

G06F 16/355 Class or cluster creation o...

ENHANCED IDENTIFICATION OF DOCUMENT TYPES

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

30 Claims

Specification

Solutions

Use Cases

Quick Links

ENHANCED IDENTIFICATION OF DOCUMENT TYPES

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

30 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links