Category processing of query topics and electronic document content topics
First Claim
Patent Images
1. A method for categorizing electronic document content of a plurality of documents for matching to user requests comprising the steps of:
- parsing said document content into a plurality of items, each of said items comprising a contiguous phrase of more than two words located within said document;
assigning each of said plurality of items at least one of a plurality of token IDs;
vectorizing said plurality of token IDs into a plurality of document vectors;
calculating the cosine measure of each of said document vectors against each other of said document vectors to provide a plurality of similarity measures, one similarity measure for each document against each other of said plurality of documents.
1 Assignment
0 Petitions
Accused Products
Abstract
A system for tailoring user queries and for categorizing and searching metadata about content provided on the internet and/or intranet for delivery in accordance with customized user profiles. The method and system categorizes query content and document content to facilitate the collection, storage and usage of same. Query content and document content are tokenized, vectorized, and provided for comparison processing by the inventive method.
174 Citations
14 Claims
-
1. A method for categorizing electronic document content of a plurality of documents for matching to user requests comprising the steps of:
-
parsing said document content into a plurality of items, each of said items comprising a contiguous phrase of more than two words located within said document;
assigning each of said plurality of items at least one of a plurality of token IDs;
vectorizing said plurality of token IDs into a plurality of document vectors;
calculating the cosine measure of each of said document vectors against each other of said document vectors to provide a plurality of similarity measures, one similarity measure for each document against each other of said plurality of documents. - View Dependent Claims (2, 3, 4, 5, 6)
comparing each of said similarity measures to a pre-set threshold.
-
-
3. The method of claim 2 further comprising storing each of said similarity measures which exceeds said pre-set threshold in a sparse matrix.
-
4. The method of claim 3 further comprising clustering said stored similarity measures in a plurality of clusters according to said cosine measures.
-
5. The method of claim 4 further comprising calculating a summary vector for each of said plurality of clusters.
-
6. The method of claim 5 further comprising the steps of:
-
identifying said summary vector as representing a new category for said documents in said cluster; and
creating a new category tag for said documents in said cluster.
-
-
7. A method for categorizing user input query content for matching user requests to electronic document content comprising the steps of:
-
parsing said query content into a plurality of items, each of said items comprising a contiguous phrase of more than two words located within said document;
assigning each of said plurality of items at least one of a plurality of token IDs;
vectorizing said plurality of token IDs into a plurality of query vectors;
calculating the cosine measure of each of said query vectors against each other of said query vectors to provide a plurality of similarity measures, one similarity measure for each query against each other of said plurality of queries. - View Dependent Claims (8, 9, 10, 11, 12)
comparing each of said similarity measures to a pre-set threshold.
-
-
9. The method of claim 8 further comprising storing each of said similarity measures which exceeds said pre-set threshold in a sparse matrix.
-
10. The method of claim 9 further comprising clustering said stored similarity measures in a plurality of clusters according to said cosine measures.
-
11. The method of claim 10 further comprising calculating a summary vector for each of said plurality of clusters.
-
12. The method of claim 11 further comprising the steps of:
-
identifying said summary vector as representing a new category for said queries in said cluster;
graphically presenting said clusters for human analysis; and
creating a new category tag for said queries in said cluster.
-
-
13. A method for categorizing electronic document content of a plurality of documents for matching to user requests comprising the steps of:
-
parsing said document content into a plurality of items, each of said items comprising one of a word or a contiguous phrase of words located within said document;
assigning to said plurality of items at least one of a plurality of token IDs, said token IDs representing a plurality of items;
vectorizing said plurality of token IDs into a plurality of document vectors;
calculating the cosine measure of each of said document vectors against each other of said document vectors to provide a plurality of similarity measures, one similarity measure for each document against each other of said plurality of documents.
-
-
14. A method for categorizing user input query content for matching user requests to electronic document content comprising the steps of:
-
parsing said query content into a plurality of items, each of said items comprising one of a word or a contiguous phrase of words located within said document;
assigning to said plurality of items at least one of a plurality of token IDs, each of said token IDs representing a plurality of items;
vectorizing said plurality of token IDs into a plurality of query vectors; and
calculating the cosine measure of each of said query vectors against each other of said query vectors to provide a plurality of similarity measures, one similarity measure for each query against each other of said plurality of queries.
-
Specification