System for categorizing documents in a linked collection of documents
First Claim
1. A system for categorizing documents contained in a linked collection of documents comprising:
- means for obtaining raw data from said linked collection of documents, said raw data including meta information for documents in said linked collection of documents;
means for creating a feature vector for documents in said linked collection of documents from said raw data, said feature vector comprising a plurality of elements;
means for defining classification criteria indicating particular categories of document types, said classification criteria comprising user defined weightings of the elements for said feature vector and a corresponding class threshold value;
processing means for applying said classification criteria to feature vectors to determine if a document is in a corresponding category.
4 Assignments
0 Petitions
Accused Products
Abstract
A system for extracting and analyzing information from a collection of linked documents at a locality to enable categorization of documents and prediction of documents relevant to a focus document. The system obtains and analyzes topology, usage and path information from for a collection at a locality, e.g. a web locality on the world wide web. For categorization, document meta information is represented as document vectors. Predefined criteria is applied to the document vectors to create lists of "similar" types of documents. For relevance prediction, networks representing topology, usage path and text similarity amongst the documents in the collection are created. A spreading activation technique is applied to the networks starting at a focus document to predict the documents relevant to the focus document. Using category and relevance prediction information, tools can be built to enable a user to more efficiently traverse through the collection of linked documents.
405 Citations
14 Claims
-
1. A system for categorizing documents contained in a linked collection of documents comprising:
-
means for obtaining raw data from said linked collection of documents, said raw data including meta information for documents in said linked collection of documents; means for creating a feature vector for documents in said linked collection of documents from said raw data, said feature vector comprising a plurality of elements; means for defining classification criteria indicating particular categories of document types, said classification criteria comprising user defined weightings of the elements for said feature vector and a corresponding class threshold value; processing means for applying said classification criteria to feature vectors to determine if a document is in a corresponding category. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A method for generating a list of web pages in a web locality that are contained in a user defined class comprising the steps of:
-
a) obtaining raw data for said web locality, said raw data including topology information and web locality usage information; b) generating page meta data for each web page in said web locality from said raw data; c) generating feature vectors for each web page in said web locality using said page meta data and said topology information, said feature vector comprised of a plurality of elements; d) obtaining a classification criteria for determining if a web page is a member of a category of web pages, said classification criteria comprising user defined weightings of the plurality of elements for said feature vector and a corresponding class threshold value; and e) applying said classification criteria to said feature vectors to obtain a list of pages in said category. - View Dependent Claims (7, 8, 9, 10)
-
-
11. A system for generating characteristic data for a linked collection of documents comprising:
-
means for obtaining raw data for said linked collection of documents, said raw data including usage data, topology data and content data; means for creating a feature vector for each document in said linked collection of documents from said raw data; and means for categorizing each of said documents in said linked collection of documents according to predetermined classification criteria, said predetermined classification criteria comprising user defined weightings of the elements for said feature vector and a corresponding class threshold value. - View Dependent Claims (12)
-
-
13. A system for categorizing documents contained in a linked collection of documents comprising:
-
means for obtaining raw data from said linked collection of documents, said raw data including meta information for documents in said linked collection of documents; means for creating a feature vector for documents in said linked collection of documents from said raw data, said feature vector having at least one element indicating a frequency of request for an associated document; means for defining classification criteria indicating particular categories of document types; processing means for applying said classification criteria to feature vectors to determine if a document is in a corresponding category.
-
-
14. A method for generating a list of web pages in a web locality that are contained in a user defined class comprising the steps of:
-
a) obtaining raw data for said web locality, said raw data including topology information and web locality usage information; b) generating page meta data for each web page in said web locality from said raw data, said meta data including data indicating a frequency of request for an associated document; c) generating feature vectors for each web page in said web locality using said page meta data and said topology information; d) obtaining a classification criteria for determining if a web page is a member of a category of web pages; and e) applying said classification criteria to said feature vectors to obtain a list of pages in said category.
-
Specification