Enhanced hypertext categorization using hyperlinks

US 6,389,436 B1
Filed: 12/15/1997
Issued: 05/14/2002
Est. Priority Date: 12/15/1997
Status: Expired due to Fees

First Claim

Patent Images

1. A method of classifying a new document containing citations to and from other documents, comprising the steps of:

identifying documents within a multi-radius neighborhood of the new document, wherein the multi-radius neighborhood initially comprises a predetermined number of citations, irrespective of location, to and from the new document;

for each document and each class, determining an initial probability that indicates the probability that the document fits a particular class;

performing iterative relaxation to identify a class for each document using the initial probabilities, wherein contextual and link structures of an entire collection of document text and classes of all multi-radius neighbors are used; and

selecting a class into which the new document is to be classified based on the initial probabilities and identified classes, wherein the class of a document determines the text in the document as well as the document'"'"'s propensity to link to documents from a set of related classes.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method, apparatus, and article of manufacture for a computer implemented hypertext classifier. A new document containing citations to and from other documents is classified. Initially, documents within a neighborhood of the new document are identified. For each document and each class, an initial probability is determined that indicates the probability that the document fits a particular class. Next, iterative relaxation is performed to identify a class for each document using the initial probabilities. A class is selected into which the new document is to be classified based on the initial probabilities and identified classes.

Citations

51 Claims

1. A method of classifying a new document containing citations to and from other documents, comprising the steps of:
- identifying documents within a multi-radius neighborhood of the new document, wherein the multi-radius neighborhood initially comprises a predetermined number of citations, irrespective of location, to and from the new document;
  
  for each document and each class, determining an initial probability that indicates the probability that the document fits a particular class;
  
  performing iterative relaxation to identify a class for each document using the initial probabilities, wherein contextual and link structures of an entire collection of document text and classes of all multi-radius neighbors are used; and
  
  selecting a class into which the new document is to be classified based on the initial probabilities and identified classes, wherein the class of a document determines the text in the document as well as the document'"'"'s propensity to link to documents from a set of related classes.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
- - 2. The method of claim 1, wherein the step of identifying documents further comprises the step of growing a neighborhood around the new document to a selected radius from the new document.
  - 3. The method of claim 1, wherein the step of determining an initial probability further comprises the step of determining a probability vector, wherein each probability vector contains a component corresponding to a class into which a document can be classified.
  - 4. The method of claim 3, wherein the step of selecting a class further comprises the step of selecting the class corresponding to the component of the probability vector having a largest value.
  - 5. The method of claim 1, wherein the step of performing iterative relaxation further comprises the step of using information from a selected radius of influence from the new document.
  - 6. The method of claim 5, wherein the step of performing iterative relaxation further comprises using text from each document.
  - 7. The method of claim 6, wherein the step of using text further comprises the step of using text and classes of documents within the radius of influence to classify the new document.
  - 8. The method of claim 7, wherein the documents within the radius of influence are pre-classified.
  - 9. The method of claim 7, wherein one or more of the documents within the radius of influence are not pre-classified.
  - 10. The method of claim 9, further comprising:
11. The method of claim 10, wherein, after each step, a class is assigned to each document within the radius of influence.
12. The method of claim 10, wherein, after each iteration, each document is assigned a probability vector containing estimated probabilities of that document being in a particular class, wherein the probability vector is assigned using the known classes of documents and the initial probabilities.
13. The method of claim 1, wherein the step of performing iterative relaxation further comprises the step of using information from a radius of influence of at least two citations from the new document.
14. The method of claim 13, wherein the step of performing iterative relaxation further comprises using bridges between documents to classify documents.
15. The method of claim 14, further comprising:
- for each bridge of the new document, identifying documents linked to the bridge whose classes are known; and
  
  assigning the new document to one of the known classes based on the number of occurrences of that class.
16. The method of claim 15, wherein the documents linked to the bridge are IO-bridges to the new document, further comprising:
- determining class paths for the IO-bridged documents;
  
  augmenting a document using the prefixes of the determined class paths; and
  
  submitting the augmented document to a text-based classifier to determine the class of the new document.
17. The method of claim 14, wherein the bridge is not pure, further comprising the step of segmenting the bridge into segments, each of which is linked to one or more documents of a similar class.

18. An apparatus for classifying a new document containing citations to and from other documents, comprising:
- a computer having a data storage device connected thereto; and
  
  one or more computer programs, performed by the computer, for identifying documents within a multi-radius neighborhood of the new document, wherein the multi-radius neighborhood initially comprises a predetermined number of citations, irrespective of location, to and from the new document, for each document and each class, determining an initial probability that indicates the probability that the document fits a particular class, performing iterative relaxation to identify a class for each document using the initial probabilities, wherein contextual and link structures of an entire collection of document text and classes of all multi-radius neighbors are used, and selecting a class into which the new document is to be classified based on the initial probabilities and identified classes, wherein the class of a document determines the text in the document as well as the document'"'"'s propensity to link to documents from a set of related classes.
- View Dependent Claims (19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34)
- - 19. The apparatus of claim 18, wherein the step of identifying documents further comprises means for growing a neighborhood around the new document to a selected radius from the new document.
  - 20. The apparatus of claim 18, wherein the means for determining an initial probability further comprises the means for determining a probability vector, wherein each probability vector contains a component corresponding to a class into which a document can be classified.
  - 21. The apparatus of claim 20, wherein the means for selecting a class further comprises the means for selecting the class corresponding to the component of the probability vector having a largest value.
  - 22. The apparatus of claim 18, wherein the means for performing iterative relaxation further comprises the means for using information from a selected radius of influence from the new document.
  - 23. The apparatus of claim 22, wherein the means for performing iterative relaxation further comprises using text from each document.
  - 24. The apparatus of claim 23, wherein the means for using text further comprises the means for using text and classes of documents within the radius of influence to classify the new document.
  - 25. The apparatus of claim 24, wherein the documents within the radius of influence are pre-classified.
  - 26. The apparatus of claim 24, wherein one or more of the documents within the radius of influence are not pre-classified.
  - 27. The apparatus of claim 26, further comprising:
28. The apparatus of claim 27, wherein, after each step, a class is assigned to each document within the radius of influence.
29. The apparatus of claim 27, wherein, after each iteration, each document is assigned a probability vector containing estimated probabilities of that document being in a particular class, wherein the probability vector is assigned using the known classes of documents and the initial probabilities.
30. The apparatus of claim 22, wherein the means for performing iterative relaxation further comprises the means for using information from a radius of influence of at least two from the new document.
31. The apparatus of claim 30, wherein the means for performing iterative relaxation further comprises using bridges between documents to classify documents.
32. The apparatus of claim 31, further comprising:
- for each bridge of the new document,means for identifying documents linked to the bridge whose classes are known; and
  
  means for assigning the new document to one of the known classes based on the number of occurrences of that class.
33. The apparatus of claim 31, wherein the documents linked to the bridge are IO-bridges to the new document, further comprising:
- means for determining class paths for the IO-bridged documents;
  
  means for augmenting a document using the prefixes of the determined class paths; and
  
  means for submitting the augmented document to a text-based classifier to determine the class of the new document.
34. The apparatus of claim 31, wherein the bridge is not pure, further comprising the means for segmenting the bridge into segments, each of which is linked to one or more documents of a similar class.

35. An article of manufacture comprising a computer program carrier readable by a computer and embodying one or more instructions executable by the computer to perform method steps for classifying a new document containing citations to and from other documents, the method comprising the steps of:
- identifying documents within a multi-radius neighborhood of the new document, wherein the multi-radius neighborhood initially comprises a predetermined number of citations, irrespective of location, to and from the new document;
  
  for each document and each class, determining an initial probability that indicates the probability that the document fits a particular class;
  
  performing iterative relaxation to identify a class for each document using the initial probabilities, wherein contextual and link structures of an entire collection of document text and classes of all multi-radius neighbors are used; and
  
  selecting a class into which the new document is to be classified based on the initial probabilities and identified classes, wherein the class of a document determines the text in the document as well as the document'"'"'s propensity to link to documents from a set of related classes.
- View Dependent Claims (36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51)
- - 36. The method of claim 35, wherein the step of identifying documents further comprises the step of growing a neighborhood around the new document to a selected radius from the new document.
  - 37. The method of claim 35, wherein the step of determining an initial probability further comprises the step of determining a probability vector, wherein each probability vector contains a component corresponding to a class into which a document can be classified.
  - 38. The method of claim 37, wherein the step of selecting a class further comprises the step of selecting the class corresponding to the component of the probability vector having a largest value.
  - 39. The method of claim 35, wherein the step of performing iterative relaxation further comprises the step of using information from a selected radius of influence from the new document.
  - 40. The method of claim 38, wherein the step of performing iterative relaxation further comprises using text from each document.
  - 41. The method of claim 40, wherein the step of using text further comprises the step of using text and classes of documents within the radius of influence to classify the new document.
  - 42. The method of claim 41, wherein the documents within the radius of influence are pre-classified.
  - 43. The method of claim 41, wherein one or more of the documents within the radius of influence are not pre-classified.
  - 44. The method of claim 43, further comprising:
45. The method of claim 44, wherein, after each step, a class is assigned to each document within the radius of influence.
46. The method of claim 44, wherein, after each iteration, each document is assigned a probability vector containing estimated probabilities of that document being in a particular class, wherein the probability vector is assigned using the known classes of documents and the initial probabilities.
47. The method of claim 39, wherein the step of performing iterative relaxation further comprises the step of using information from a radius of influence of at least two from the new document.
48. The method of claim 47, wherein the step of performing iterative relaxation further comprises using bridges between documents to classify documents.
49. The method of claim 48, further comprising:
- for each bridge of the new document, identifying documents linked to the bridge whose classes are known; and
  
  assigning the new document to one of the known classes based on the number of occurrences of that class.
50. The method of claim 49, wherein the documents linked to the bridge are IO-bridges to the new document, further comprising:
- determining class paths for the IO-bridged documents;
  
  augmenting a document using the prefixes of the determined class paths; and
  
  submitting the augmented document to a text-based classifier to determine the class of the new document.
51. The method of claim 48, wherein the bridge is not pure, further comprising the step of segmenting the bridge into segments, each of which is linked to one or more documents of a similar class.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Indyk, Piotr, Chakrabarti, Soumen, Dom, Byron Edward
Primary Examiner(s)
Hong, Stephen S.
Assistant Examiner(s)
HUYNH, CONG LAC T

Application Number

US08/990,292
Time in Patent Office

1,611 Days
Field of Search

707/513, 707/1-7, 706/20
US Class Current

715/229
CPC Class Codes

G06F 16/355   Class or cluster creation o...

G06F 16/382   using citations hypermedia ...

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99934   Query formulation, input pr...

Y10S 707/99936   Pattern matching access

Enhanced hypertext categorization using hyperlinks

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

51 Claims

Specification

Solutions

Use Cases

Quick Links

Enhanced hypertext categorization using hyperlinks

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

51 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links