Method and system for creating subgroups of documents using optical character recognition data

US 9,069,768 B1
Filed: 04/03/2013
Issued: 06/30/2015
Est. Priority Date: 03/28/2012
Status: Active Grant

First Claim

Patent Images

1. A system for creating subgroups of documents using optical character recognition data, the system comprising:

one or more processors; and

a non-transitory computer readable medium storing a plurality of instructions, which when executed, cause the one or more processors to;

create a matrix for words included in documents, wherein each column-row combination in the matrix indicates whether a corresponding word that is associated with the column-row combination is included in a corresponding document that is associated with the column-row combination;

identify distances between pairs of the words in the matrix, wherein each distance is based on a number of the documents that differ in including a corresponding pair of the words;

create word clusters, wherein each word cluster comprises pairs of words associated with a corresponding distance less than a distance threshold;

create sets of word clusters, wherein a set of word clusters comprises word clusters that are not associated with any of the documents associated with other word clusters in the set of word clusters; and

create subgroups of the digitized documents based on a set of word clusters corresponding to a high word score relative to at least one other word score corresponding to at least one other set of word clusters.

View all claims

11 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Creating subgroups of documents using optical character recognition data is described. A matrix is created for words included in documents. Each column-row combination in the matrix indicates whether a corresponding word that is associated with the column-row combination is included in a corresponding document that is associated with the column-row combination. Distances are identified between pairs of the words. Each distance is based on a number of the documents that differ in including a corresponding pair of the words. Word clusters are created. Each word cluster includes pairs of words associated with a corresponding distance less than a distance threshold. Sets of word clusters are created. A set of word clusters includes word clusters that are not associated with any of the documents associated with other word clusters in the set. Subgroups of the digitized documents are created based on a set of word clusters with a highest word score.

61 Citations

View as Search Results

20 Claims

1. A system for creating subgroups of documents using optical character recognition data, the system comprising:
- one or more processors; and
  
  a non-transitory computer readable medium storing a plurality of instructions, which when executed, cause the one or more processors to;
  
  create a matrix for words included in documents, wherein each column-row combination in the matrix indicates whether a corresponding word that is associated with the column-row combination is included in a corresponding document that is associated with the column-row combination;
  
  identify distances between pairs of the words in the matrix, wherein each distance is based on a number of the documents that differ in including a corresponding pair of the words;
  
  create word clusters, wherein each word cluster comprises pairs of words associated with a corresponding distance less than a distance threshold;
  
  create sets of word clusters, wherein a set of word clusters comprises word clusters that are not associated with any of the documents associated with other word clusters in the set of word clusters; and
  
  create subgroups of the digitized documents based on a set of word clusters corresponding to a high word score relative to at least one other word score corresponding to at least one other set of word clusters.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The system of claim 1, wherein the words comprise keywords associated with the documents based on a comparison of the documents with at least one of a class and a template.
  - 3. The system of claim 1, wherein the documents comprise digitized optical character recognition data.
  - 4. The system of claim 1, wherein the documents are associated with a class in response to a comparison to classify documents similar to a first document of the documents.
  - 5. The system of claim 1, wherein the documents are associated with a template in response to a comparison to classify documents similar to a first document of the documents.
  - 6. The system of claim 1, wherein the highest word score is based on a total number of words in the set of word clusters.
  - 7. The system of claim 1, wherein the highest word score is based on an average number of words in the set of word clusters.

8. A computer-implemented method for creating subgroups of documents using optical character recognition data, the method comprising:
- creating a matrix for words included in documents, wherein each column-row combination in the matrix indicates whether a corresponding word that is associated with the column-row combination is included in a corresponding document that is associated with the column-row combination;
  
  identifying distances between pairs of the words in the matrix, wherein each distance is based on a number of the documents that differ in including a corresponding pair of the words;
  
  creating word clusters, wherein each word cluster comprises pairs of words associated with a corresponding distance less than a distance threshold;
  
  creating sets of word clusters, wherein a set of word clusters comprises word clusters that are not associated with any of the documents associated with other word clusters in the set of word clusters; and
  
  creating subgroups of the digitized documents based on a set of word clusters corresponding to a high word score relative to at least one other word score corresponding to at least one other set of word clusters.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The computer-implemented method of claim 8, wherein the words comprise keywords associated with the documents based on a comparison of the documents with at least one of a class and a template.
  - 10. The computer-implemented method of claim 8, wherein the documents comprise digitized optical character recognition data.
  - 11. The computer-implemented method of claim 8, wherein the documents are associated with a class in response to a comparison to classify documents similar to a first document of the documents.
  - 12. The computer-implemented method of claim 8, wherein the documents are associated with a template in response to a comparison to classify documents similar to a first document of the documents.
  - 13. The computer-implemented method of claim 8, wherein the highest word score is based on a total number of words in the set of word clusters.
  - 14. The computer-implemented method of claim 8, wherein the highest word score is based on an average number of words in the set of word clusters.

15. A computer program product, comprising computer-readable program code to be executed by one or more processors when retrieved from a non-transitory computer-readable medium, the program code including instructions to:
- create a matrix for words included in documents, wherein each column-row combination in the matrix indicates whether a corresponding word that is associated with the column-row combination is included in a corresponding document that is associated with the column-row combination;
  
  identify distances between pairs of the words in the matrix, wherein each distance is based on a number of the documents that differ in including a corresponding pair of the words;
  
  create word clusters, wherein each word cluster comprises pairs of words associated with a corresponding distance less than a distance threshold;
  
  create sets of word clusters, wherein a set of word clusters comprises word clusters that are not associated with any of the documents associated with other word clusters in the set of word clusters; and
  
  create subgroups of the digitized documents based on a set of word clusters corresponding to a high word score relative to at least one other word score corresponding to at least one other set of word clusters.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The computer program product of claim 15, wherein the words comprise keywords associated with the documents based on a comparison of the documents with at least one of a class and a template.
  - 17. The computer program product of claim 15, wherein the documents comprise digitized optical character recognition data.
  - 18. The computer program product of claim 15, wherein the documents are associated with a class in response to a comparison to classify documents similar to a first document of the documents.
  - 19. The computer program product of claim 15, wherein the documents are associated with a template in response to a comparison to classify documents similar to a first document of the documents.
  - 20. The computer program product of claim 15, wherein the highest word score is based on a total number of words in the set of word clusters.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Open Text Corporation
Original Assignee
EMC Corporation (Dell Technologies Inc.)
Inventors
Sampson, Steven
Primary Examiner(s)
NGUYEN, PHONG H

Application Number

US13/855,906
Time in Patent Office

818 Days
Field of Search

707/737, 707/749, 707/758
US Class Current

1/1
CPC Class Codes

G06F 16/285   Clustering or classification

G06F 16/35   Clustering; Classification

G06F 16/355   Class or cluster creation o...

G06V 30/10   Character recognition

G06V 30/196   using sequential comparison...

G06V 30/414   Extracting the geometrical ...

G06V 30/418   Document matching, e.g. of ...

G16C 99/00   Subject matter not provided...

Method and system for creating subgroups of documents using optical character recognition data

First Claim

11 Assignments

0 Petitions

Accused Products

Abstract

61 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for creating subgroups of documents using optical character recognition data

First Claim

11 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

61 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links