System for predicting documents relevant to focus documents by spreading activation through network representations of a linked collection of documents

US 5,835,905 A
Filed: 04/09/1997
Issued: 11/10/1998
Est. Priority Date: 04/09/1997
Status: Expired due to Term

First Claim

Patent Images

1. A system for identifying documents relevant to a focus document in a linked collection of documents, said system comprising:

means for obtaining raw data for said linked collection of documents, said raw data including usage data, topology data and content data;

means for creating usage, topology and text similarity maps for said linked collection of documents from said raw data; and

means for predicting a relevant set of documents for a subset of said linked collection of documents using one or more of said usage, topology and text similarity maps.

View all claims

4 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system for extracting and analyzing information from a collection of linked documents at a locality to enable categorization of documents and prediction of documents relevant to a focus document. The system obtains and analyzes topology, usage and path information from for a collection at a locality, e.g. a web locality on the world wide web. For categorization, document meta information is represented as document vectors. Predefined criteria is applied to the document vectors to create lists of "similar" types of documents. For relevance prediction, networks representing topology, usage path and text similarity amongst the documents in the collection are created. A spreading activation technique is applied to the networks starting at a focus document to predict the documents relevant to the focus document. Using category and relevance prediction information, tools can be built to enable a user to more efficiently traverse through the collection of linked documents.

317 Citations

13 Claims

1. A system for identifying documents relevant to a focus document in a linked collection of documents, said system comprising:
- means for obtaining raw data for said linked collection of documents, said raw data including usage data, topology data and content data;
  
  means for creating usage, topology and text similarity maps for said linked collection of documents from said raw data; and
  
  means for predicting a relevant set of documents for a subset of said linked collection of documents using one or more of said usage, topology and text similarity maps.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The system as recited in claim 1 wherein said means for obtaining raw data for said linked collection of documents is further comprised of a first agent for traversing said linked collection of documents to obtain topology information and a second agent for obtaining usage statistics for said linked collection of documents.
  - 3. The system as recited in claim 2 wherein said means for creating usage, topology and text similarity maps is comprised of:
    - means for creating a usage matrix wherein each row and column correspond to a document in said linked collection of documents and each intersection represents the number of times a user traversed from the document indicated by the row to the document indicated by said column;
      
      means for creating a topology matrix wherein each row and column correspond to a document in said linked collection of documents and each intersection indicates if there is a link between said document indicated by the row and the document indicated by the column;
      
      means for creating a text similarity matrix wherein each row and column correspond to a document in said linked collection of documents and each intersection indicates a measure of similarity between said document indicated by the row and the document indicated by the column.
  - 4. The system as recited in claim 3 wherein said means for predicting a relevant set of documents for a subset of said linked collection of documents is further comprised of means for spreading activation through one or more of said usage, topology and text similarity maps and means for identifying a predetermined number of corresponding documents having the most activation as said relevant set of documents.
  - 5. The system as recited in claim 4 wherein said means for predicting a relevant set of documents for a subset of said linked collection of documents is further comprised of means for spreading activation through one or more of said usage, topology and text similarity maps and means for identifying a predetermined number of corresponding documents having activation above a predetermined threshold as said relevant set of documents.
  - 6. The system as recited in claim 1 wherein said linked collection of documents is a Web locality.

7. A method for identifying documents relevant to a focus document in a linked collection of documents, said method comprising the steps of:
- a) obtaining raw data for said linked collection of documents, said raw data including topology information and usage information;
  
  b) generating text similarity information between documents in said linked collection of documents;
  
  c) creating a plurality of characteristic maps from said raw data and text similarity information, each of said plurality of characteristic maps indicating relationships between documents in said linked collection of documents;
  
  d) selecting one or more focus documents from said linked collection of documents;
  
  e) spreading activation starting at said one or more focus documents through one or more of said plurality of characteristic maps until activation settles into an asymptotic pattern; and
  
  f) identifying relevant documents as those meeting a predetermined activation criteria.
- View Dependent Claims (8, 9, 10, 11, 12, 13)
- - 8. The method as recited in claim 7 wherein said step of obtaining raw data for said linked collection of documentsa1) retrieving a web page;
    - a2) storing location information for said web page;
      
      a3) parsing said web page to identify links to other web pages; and
      
      a4) repeating steps a1)-a3) for each of said other web pages.
  - 9. The method as recited in claim 8 wherein said step of obtaining raw data for said linked collection of documents further comprising the steps of obtaining access data from said linked collection, said access data indicating when and from where documents in said linked collections have been accessed.
  - 10. The method as recited in claim 9 wherein said step of generating text similarity information is further comprised of the steps of:
    - b1) tokenizing said documents in said linked collection of documents, said tokens representing a word; and
      
      b2) comparing the similarity of documents by comparing the tokens contained in a document to obtain similarity measure information.
  - 11. The method as recited in claim 10 wherein said step of creating a plurality of characteristic maps from said raw data and text similarity information is further comprised of the steps of:
    - c1) for each characteristic creating a matrix where each row and column correspond to a document and the intersection of a row and column indicates a characteristic between documents;
      
      c2) generating topology characteristic information and usage path characteristic information from said raw data, said topology information for indicating if a document contains a link to another document and said usage path information indicating the number of times a document was accessed from another document; and
      
      c3) inserting the characteristic information between documents at each intersection point in the corresponding matrix.
  - 12. The method as recited in claim 7 wherein for said step of identifying relevant documents as those meeting a predetermined activation criteria, said predetermined criteria is an activation threshold value.
  - 13. The method as recited in claim 7 wherein for said step of identifying relevant documents as those meeting a predetermined activation criteria, said predetermined criteria is a predetermined number of most activated documents.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Xerox Corporation (Xerox Holdings Corp.)
Original Assignee
Xerox Corporation (Xerox Holdings Corp.)
Inventors
Pirolli, Peter L., Pitkow, James E., Rao, Ramana B.
Primary Examiner(s)
Lintz, Paul R.

Application Number

US08/831,807
Time in Patent Office

580 Days
Field of Search

707/3, 707/102, 707/5
US Class Current

1/1
CPC Class Codes

G06F 16/3334   Selection or weighting of t...

G06F 16/35   Clustering; Classification

Y10S 707/99933   Query processing, i.e. sear...

Y10S 707/99935   Query augmenting and refini...

Y10S 707/99943   Generating database or data...

System for predicting documents relevant to focus documents by spreading activation through network representations of a linked collection of documents

First Claim

4 Assignments

0 Petitions

Accused Products

Abstract

317 Citations

13 Claims

Specification

Solutions

Use Cases

Quick Links

System for predicting documents relevant to focus documents by spreading activation through network representations of a linked collection of documents

First Claim

4 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

317 Citations

13 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links