Method and system for creating subgroups of documents using optical character recognition data
First Claim
1. A system for creating subgroups of documents using optical character recognition data, the system comprising:
- one or more processors; and
a non-transitory computer readable medium storing a plurality of instructions, which when executed, cause the one or more processors to;
create a matrix for words included in documents, wherein each column-row combination in the matrix indicates whether a corresponding word that is associated with the column-row combination is included in a corresponding document that is associated with the column-row combination;
identify distances between pairs of the words in the matrix, wherein each distance is based on a number of the documents that differ in including a corresponding pair of the words;
create word clusters, wherein each word cluster comprises pairs of words associated with a corresponding distance less than a distance threshold;
create sets of word clusters, wherein a set of word clusters comprises word clusters that are not associated with any of the documents associated with other word clusters in the set of word clusters; and
create subgroups of the digitized documents based on a set of word clusters corresponding to a high word score relative to at least one other word score corresponding to at least one other set of word clusters.
11 Assignments
0 Petitions
Accused Products
Abstract
Creating subgroups of documents using optical character recognition data is described. A matrix is created for words included in documents. Each column-row combination in the matrix indicates whether a corresponding word that is associated with the column-row combination is included in a corresponding document that is associated with the column-row combination. Distances are identified between pairs of the words. Each distance is based on a number of the documents that differ in including a corresponding pair of the words. Word clusters are created. Each word cluster includes pairs of words associated with a corresponding distance less than a distance threshold. Sets of word clusters are created. A set of word clusters includes word clusters that are not associated with any of the documents associated with other word clusters in the set. Subgroups of the digitized documents are created based on a set of word clusters with a highest word score.
61 Citations
20 Claims
-
1. A system for creating subgroups of documents using optical character recognition data, the system comprising:
-
one or more processors; and a non-transitory computer readable medium storing a plurality of instructions, which when executed, cause the one or more processors to; create a matrix for words included in documents, wherein each column-row combination in the matrix indicates whether a corresponding word that is associated with the column-row combination is included in a corresponding document that is associated with the column-row combination; identify distances between pairs of the words in the matrix, wherein each distance is based on a number of the documents that differ in including a corresponding pair of the words; create word clusters, wherein each word cluster comprises pairs of words associated with a corresponding distance less than a distance threshold; create sets of word clusters, wherein a set of word clusters comprises word clusters that are not associated with any of the documents associated with other word clusters in the set of word clusters; and create subgroups of the digitized documents based on a set of word clusters corresponding to a high word score relative to at least one other word score corresponding to at least one other set of word clusters. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
-
8. A computer-implemented method for creating subgroups of documents using optical character recognition data, the method comprising:
-
creating a matrix for words included in documents, wherein each column-row combination in the matrix indicates whether a corresponding word that is associated with the column-row combination is included in a corresponding document that is associated with the column-row combination; identifying distances between pairs of the words in the matrix, wherein each distance is based on a number of the documents that differ in including a corresponding pair of the words; creating word clusters, wherein each word cluster comprises pairs of words associated with a corresponding distance less than a distance threshold; creating sets of word clusters, wherein a set of word clusters comprises word clusters that are not associated with any of the documents associated with other word clusters in the set of word clusters; and creating subgroups of the digitized documents based on a set of word clusters corresponding to a high word score relative to at least one other word score corresponding to at least one other set of word clusters. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A computer program product, comprising computer-readable program code to be executed by one or more processors when retrieved from a non-transitory computer-readable medium, the program code including instructions to:
-
create a matrix for words included in documents, wherein each column-row combination in the matrix indicates whether a corresponding word that is associated with the column-row combination is included in a corresponding document that is associated with the column-row combination; identify distances between pairs of the words in the matrix, wherein each distance is based on a number of the documents that differ in including a corresponding pair of the words; create word clusters, wherein each word cluster comprises pairs of words associated with a corresponding distance less than a distance threshold; create sets of word clusters, wherein a set of word clusters comprises word clusters that are not associated with any of the documents associated with other word clusters in the set of word clusters; and create subgroups of the digitized documents based on a set of word clusters corresponding to a high word score relative to at least one other word score corresponding to at least one other set of word clusters. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification