Taxonomy generation for electronic documents
First Claim
Patent Images
1. A computer-implemented method comprising:
- extracting terms from a plurality of electronic documents;
ranking the extracted terms using two or more term ranking algorithms;
aggregating rankings of the ranked extracted terms to produce first aggregate rankings, each of the rankings resulting from one of the two or more ranking algorithms;
selecting terms from the extracted terms, the selected terms having the first aggregate rankings above a pre-determined threshold;
generating term pairs from the selected terms;
ranking terms in each term pair based on a relative specificity of the selected terms using two more term pair ranking algorithms;
aggregating the ranks of the terms in each term pair to produce second aggregate rankings, each of the ranks resulting from the two or more term pair ranking algorithms;
selecting term pairs having the second aggregate rankings above a pre-determined threshold;
generating a term hierarchy from the selected term pairs;
assigning documents to nodes of the term hierarchy based on a number of terms within a branch of the term hierarchy associated with each node that match terms extracted from each document; and
storing assignments of the documents to the nodes to a memory for retrieval of one or more documents responsive to a search query.
2 Assignments
0 Petitions
Accused Products
Abstract
Systems and techniques to generate a term taxonomy for a collection of documents and filling the taxonomy with documents from the collection. In general, in one implementation, the technique includes: extracting terms from a plurality of documents; generating term pairs from the terms; ranking terms in each term pair based on a relative specificity of the terms; aggregating the ranks of the terms in each term pair; selecting term pairs based on the aggregate rankings; and generating a term hierarchy from the selected term pairs.
92 Citations
25 Claims
-
1. A computer-implemented method comprising:
-
extracting terms from a plurality of electronic documents; ranking the extracted terms using two or more term ranking algorithms; aggregating rankings of the ranked extracted terms to produce first aggregate rankings, each of the rankings resulting from one of the two or more ranking algorithms; selecting terms from the extracted terms, the selected terms having the first aggregate rankings above a pre-determined threshold; generating term pairs from the selected terms; ranking terms in each term pair based on a relative specificity of the selected terms using two more term pair ranking algorithms; aggregating the ranks of the terms in each term pair to produce second aggregate rankings, each of the ranks resulting from the two or more term pair ranking algorithms; selecting term pairs having the second aggregate rankings above a pre-determined threshold; generating a term hierarchy from the selected term pairs; assigning documents to nodes of the term hierarchy based on a number of terms within a branch of the term hierarchy associated with each node that match terms extracted from each document; and storing assignments of the documents to the nodes to a memory for retrieval of one or more documents responsive to a search query. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 22, 23, 24, 25)
-
-
12. An article comprising a machine-readable medium storing instructions executed by one or more machines to perform operations comprising:
-
extracting terms from a plurality of electronic documents; ranking the extracted terms using two or more term ranking algorithms; aggregating rankings of the ranked extracted terms to produce first aggregate rankings, each of the rankings resulting from one of the two or more ranking algorithms; selecting terms from the extracted terms, the selected terms having the first aggregate rankings above a pre-determined threshold; generating term pairs from the selected terms; ranking terms in each term pair based on a relative specificity of the selected terms using two more term pair ranking algorithms; aggregating the ranks of the terms in each term pair to produce second aggregate rankings, each of the ranks resulting from the two or more term pair ranking algorithms; selecting term having the second aggregate rankings above a pre-determined threshold; generating a term hierarchy from the selected term pairs; and storing assignments of the documents to the nodes to a memory for retrieval of one or more documents responsive to a search query. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19)
-
-
20. An apparatus comprising:
- a processor executing instructions to perform operations comprising;
ranking extracted terms using two or more term ranking algorithms, the extracted terms extracted from a plurality of electronic documents; aggregating rankings of the ranked extracted terms to produce first aggregate rankings, each of the rankings resulting from one of the two or more ranking algorithms; selecting terms from the extracted terms, the selected terms having the first aggregate rankings above a pre-determined threshold; generating term pairs from the selected terms; ranking terms in each term pair based on a relative specificity of the selected terms using two more term pair ranking algorithms; aggregating the ranks of the terms in each term pair to produce second aggregate rankings, each of the ranks resulting from the two or more term pair ranking algorithms; selecting term pairs having the second aggregate rankings above a pre-determined threshold; generating a term hierarchy from the selected term pairs; and storing the term hierarchy to a memory for retrieval of one or more documents responsive to a search query. - View Dependent Claims (21)
- a processor executing instructions to perform operations comprising;
Specification