Taxonomy discovery
First Claim
Patent Images
1. A computer-based method for generating a taxonomy of a collection of documents, comprising:
- generating a term-by-document matrix for the collection of documents;
generating a vector for each document in the collection of documents based on the term-by-document matrix;
identifying document clusters based on similarity comparisons between pairs of the vectors;
identifying labels for the document clusters based on generalized entities included in documents of the document clusters; and
storing the labels in an electronic format accessible to a user.
8 Assignments
0 Petitions
Accused Products
Abstract
Discovering a taxonomy of a subset of a collection of documents by preprocessing a document collection; calculating a vector space for the preprocessed document collection; and grouping and labeling at least a first level of a taxonomy of a subset of the collection.
83 Citations
20 Claims
-
1. A computer-based method for generating a taxonomy of a collection of documents, comprising:
-
generating a term-by-document matrix for the collection of documents;
generating a vector for each document in the collection of documents based on the term-by-document matrix;
identifying document clusters based on similarity comparisons between pairs of the vectors;
identifying labels for the document clusters based on generalized entities included in documents of the document clusters; and
storing the labels in an electronic format accessible to a user. - View Dependent Claims (2, 6, 7, 8, 9, 10, 11)
-
-
3. A computer program product comprising a computer usable medium having computer readable program code stored therein that causes an application program for generating a taxonomy of a collection of documents to execute on an operating system of a computer, the computer readable program code comprising:
-
computer readable first program code for causing the computer to generate a term-by-document matrix for the collection of documents, computer readable second program code for causing the computer to generate a vector for each document in the collection of documents based on the term-by-document matrix;
computer readable third program code for causing the computer to identify document clusters based on similarity comparisons between pairs of the vectors;
computer readable fourth program code for causing the computer to identify labels for the document clusters based on generalized entities included in documents of the document clusters; and
computer readable fifth program code for causing the computer to store the labels in an electronic format accessible to a user. - View Dependent Claims (4, 12, 13, 14, 15, 16, 17)
-
-
5. A system for generating a taxonomy of a collection of documents, comprising:
-
a plurality of processors that each communication with at least one other processor in the plurality of processors over a network; and
a computer program product comprising a computer usable medium having computer readable program code stored therein that causes an application program for generating a taxonomy of a collection of documents to execute on at least one of the processors in the plurality of processors, wherein the computer program product includes computer readable first program code for causing the computer to generate a term-by-document matrix for the collection of documents;
computer readable second program code for causing the computer to generate a vector for each document in the collection of documents based on the term-by-document matrix, computer readable third program code for causing the computer to identify document clusters based on similarity comparisons between pairs of the vectors, computer readable fourth program code for causing the computer to identify labels for the document clusters based on generalized entities included in documents of the document clusters, computer readable fifth program code for causing the computer to transmit the labels over the network. - View Dependent Claims (18, 19, 20)
-
Specification