Enhanced hypertext categorization using hyperlinks
First Claim
1. A method of classifying a new document containing citations to and from other documents, comprising the steps of:
- identifying documents within a multi-radius neighborhood of the new document, wherein the multi-radius neighborhood initially comprises a predetermined number of citations, irrespective of location, to and from the new document;
for each document and each class, determining an initial probability that indicates the probability that the document fits a particular class;
performing iterative relaxation to identify a class for each document using the initial probabilities, wherein contextual and link structures of an entire collection of document text and classes of all multi-radius neighbors are used; and
selecting a class into which the new document is to be classified based on the initial probabilities and identified classes, wherein the class of a document determines the text in the document as well as the document'"'"'s propensity to link to documents from a set of related classes.
1 Assignment
0 Petitions
Accused Products
Abstract
A method, apparatus, and article of manufacture for a computer implemented hypertext classifier. A new document containing citations to and from other documents is classified. Initially, documents within a neighborhood of the new document are identified. For each document and each class, an initial probability is determined that indicates the probability that the document fits a particular class. Next, iterative relaxation is performed to identify a class for each document using the initial probabilities. A class is selected into which the new document is to be classified based on the initial probabilities and identified classes.
-
Citations
51 Claims
-
1. A method of classifying a new document containing citations to and from other documents, comprising the steps of:
-
identifying documents within a multi-radius neighborhood of the new document, wherein the multi-radius neighborhood initially comprises a predetermined number of citations, irrespective of location, to and from the new document;
for each document and each class, determining an initial probability that indicates the probability that the document fits a particular class;
performing iterative relaxation to identify a class for each document using the initial probabilities, wherein contextual and link structures of an entire collection of document text and classes of all multi-radius neighbors are used; and
selecting a class into which the new document is to be classified based on the initial probabilities and identified classes, wherein the class of a document determines the text in the document as well as the document'"'"'s propensity to link to documents from a set of related classes. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
identifying classes for the documents within the radius of influence that are not pre-classified using a text-based classifier;
classifying the new document using the classes of the documents within the radius of influence and the initial probabilities; and
iteratively repeating the steps of identifying and classifying until a stopping criteria is achieved.
-
-
11. The method of claim 10, wherein, after each step, a class is assigned to each document within the radius of influence.
-
12. The method of claim 10, wherein, after each iteration, each document is assigned a probability vector containing estimated probabilities of that document being in a particular class, wherein the probability vector is assigned using the known classes of documents and the initial probabilities.
-
13. The method of claim 1, wherein the step of performing iterative relaxation further comprises the step of using information from a radius of influence of at least two citations from the new document.
-
14. The method of claim 13, wherein the step of performing iterative relaxation further comprises using bridges between documents to classify documents.
-
15. The method of claim 14, further comprising:
-
for each bridge of the new document, identifying documents linked to the bridge whose classes are known; and
assigning the new document to one of the known classes based on the number of occurrences of that class.
-
-
16. The method of claim 15, wherein the documents linked to the bridge are IO-bridges to the new document, further comprising:
-
determining class paths for the IO-bridged documents;
augmenting a document using the prefixes of the determined class paths; and
submitting the augmented document to a text-based classifier to determine the class of the new document.
-
-
17. The method of claim 14, wherein the bridge is not pure, further comprising the step of segmenting the bridge into segments, each of which is linked to one or more documents of a similar class.
-
18. An apparatus for classifying a new document containing citations to and from other documents, comprising:
-
a computer having a data storage device connected thereto; and
one or more computer programs, performed by the computer, for identifying documents within a multi-radius neighborhood of the new document, wherein the multi-radius neighborhood initially comprises a predetermined number of citations, irrespective of location, to and from the new document, for each document and each class, determining an initial probability that indicates the probability that the document fits a particular class, performing iterative relaxation to identify a class for each document using the initial probabilities, wherein contextual and link structures of an entire collection of document text and classes of all multi-radius neighbors are used, and selecting a class into which the new document is to be classified based on the initial probabilities and identified classes, wherein the class of a document determines the text in the document as well as the document'"'"'s propensity to link to documents from a set of related classes. - View Dependent Claims (19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34)
means for identifying classes for the documents within the radius of influence that are not pre-classified using a text-based classifier;
means for classifying the new document using the classes of the documents within the radius of influence and the initial probabilities; and
means for iteratively repeating the steps of identifying and classifying until a stopping criteria is achieved.
-
-
28. The apparatus of claim 27, wherein, after each step, a class is assigned to each document within the radius of influence.
-
29. The apparatus of claim 27, wherein, after each iteration, each document is assigned a probability vector containing estimated probabilities of that document being in a particular class, wherein the probability vector is assigned using the known classes of documents and the initial probabilities.
-
30. The apparatus of claim 22, wherein the means for performing iterative relaxation further comprises the means for using information from a radius of influence of at least two from the new document.
-
31. The apparatus of claim 30, wherein the means for performing iterative relaxation further comprises using bridges between documents to classify documents.
-
32. The apparatus of claim 31, further comprising:
- for each bridge of the new document,
means for identifying documents linked to the bridge whose classes are known; and
means for assigning the new document to one of the known classes based on the number of occurrences of that class.
- for each bridge of the new document,
-
33. The apparatus of claim 31, wherein the documents linked to the bridge are IO-bridges to the new document, further comprising:
-
means for determining class paths for the IO-bridged documents;
means for augmenting a document using the prefixes of the determined class paths; and
means for submitting the augmented document to a text-based classifier to determine the class of the new document.
-
-
34. The apparatus of claim 31, wherein the bridge is not pure, further comprising the means for segmenting the bridge into segments, each of which is linked to one or more documents of a similar class.
-
35. An article of manufacture comprising a computer program carrier readable by a computer and embodying one or more instructions executable by the computer to perform method steps for classifying a new document containing citations to and from other documents, the method comprising the steps of:
-
identifying documents within a multi-radius neighborhood of the new document, wherein the multi-radius neighborhood initially comprises a predetermined number of citations, irrespective of location, to and from the new document;
for each document and each class, determining an initial probability that indicates the probability that the document fits a particular class;
performing iterative relaxation to identify a class for each document using the initial probabilities, wherein contextual and link structures of an entire collection of document text and classes of all multi-radius neighbors are used; and
selecting a class into which the new document is to be classified based on the initial probabilities and identified classes, wherein the class of a document determines the text in the document as well as the document'"'"'s propensity to link to documents from a set of related classes. - View Dependent Claims (36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51)
identifying classes for the documents within the radius of influence that are not pre-classified using a text-based classifier;
classifying the new document using the classes of the documents within the radius of influence and the initial probabilities; and
iteratively repeating the steps of identifying and classifying until a stopping criteria is achieved.
-
-
45. The method of claim 44, wherein, after each step, a class is assigned to each document within the radius of influence.
-
46. The method of claim 44, wherein, after each iteration, each document is assigned a probability vector containing estimated probabilities of that document being in a particular class, wherein the probability vector is assigned using the known classes of documents and the initial probabilities.
-
47. The method of claim 39, wherein the step of performing iterative relaxation further comprises the step of using information from a radius of influence of at least two from the new document.
-
48. The method of claim 47, wherein the step of performing iterative relaxation further comprises using bridges between documents to classify documents.
-
49. The method of claim 48, further comprising:
-
for each bridge of the new document, identifying documents linked to the bridge whose classes are known; and
assigning the new document to one of the known classes based on the number of occurrences of that class.
-
-
50. The method of claim 49, wherein the documents linked to the bridge are IO-bridges to the new document, further comprising:
-
determining class paths for the IO-bridged documents;
augmenting a document using the prefixes of the determined class paths; and
submitting the augmented document to a text-based classifier to determine the class of the new document.
-
-
51. The method of claim 48, wherein the bridge is not pure, further comprising the step of segmenting the bridge into segments, each of which is linked to one or more documents of a similar class.
Specification