Method to efficiently partition large hyperlinked databases by hyperlink structure
First Claim
1. A method for partitioning a database containing a plurality of documents into desired and undesired type documents, the plurality of documents containing text and/or links to and from other documents in the database, the method comprising the steps of:
- providing a source document of the desired type, the source document including a collection of seed documents linked to more similar documents of the desired type than to dissimilar documents of the undesired type;
providing a sink document for providing access to the database, the sink document is generic and representative of the database;
identifying a cut-set of links which is the smallest set of links such that removing them from the database completely disconnects the source document and its linked documents from the sink document and its linked documents thereby defining first and second subsets of documents, respectively; and
defining the first subset of documents as desired type documents and the remaining documents as undesired type documents.
3 Assignments
0 Petitions
Accused Products
Abstract
Method for partitioning a database containing a plurality of documents into desired and undesired type documents, the plurality of documents containing text and/or links to and from other documents in the database, including: providing a source document of the desired type, the source document including a collection of seed documents linked to more similar documents of the desired type than to dissimilar documents of the undesired type; providing a sink document for providing access to the database, the sink document being generic and representative of the database; identifying a cut-set of links which is the smallest set of links such that removing them from the database completely disconnects the source document and its linked documents from the sink document and its linked documents into first and second subsets of documents, respectively; and defining the first subset of documents as desired type documents and the remaining documents as undesired type documents.
32 Citations
27 Claims
-
1. A method for partitioning a database containing a plurality of documents into desired and undesired type documents, the plurality of documents containing text and/or links to and from other documents in the database, the method comprising the steps of:
-
providing a source document of the desired type, the source document including a collection of seed documents linked to more similar documents of the desired type than to dissimilar documents of the undesired type;
providing a sink document for providing access to the database, the sink document is generic and representative of the database;
identifying a cut-set of links which is the smallest set of links such that removing them from the database completely disconnects the source document and its linked documents from the sink document and its linked documents thereby defining first and second subsets of documents, respectively; and
defining the first subset of documents as desired type documents and the remaining documents as undesired type documents. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
mapping at least a portion of the database into a graph structure; and
applying a maximum flow algorithm to the graph structure, the subset of the graph structure which remains after application of the maximum flow algorithm being the first subset of documents.
-
-
5. The method of claim 4, wherein the mapping step assigns all documents to have a corresponding vertex and all links to have a corresponding edge.
-
6. The method of claim 1, further comprising the step of applying a further search method to the first subset of documents to further partition the first subset of documents into a subset of more desired type documents.
-
7. The method of claim 1, wherein the desired type documents are those of interest to a user and the method further comprises the step of displaying the desired type documents to the user.
-
8. The method of claim 1, wherein the desired type documents are those to be filtered from a user and the method further comprises the step of prohibiting display of the desired type documents to the user.
-
9. The method of claim 1, wherein the sink document comprises a plurality of generic documents, each of which provides access to the database.
-
10. A program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps for partitioning a database containing a plurality of documents into desired and undesired type documents, the plurality of documents containing text and/or links to and from other documents in the database, the method comprising the steps of:
-
providing a source document of the desired type, the source document including a collection of seed documents linked to more similar documents of the desired type than to dissimilar documents of the undesired type;
providing a sink document for providing access to the database, the sink document is generic and representative of the database;
identifying a cut-set of links which is the smallest set of links such that removing them from the database completely disconnects the source document and its linked documents from the sink document and its linked documents into first and second subsets of documents, respectively; and
defining the first subset of documents as desired type documents and the remaining documents as undesired type documents. - View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
mapping at least a portion of the database into a graph structure; and
applying a maximum flow algorithm to the graph structure, the subset of the graph structure which remains after application of the maximum flow algorithm being the first subset of documents.
-
-
14. The method of claim 13, wherein the mapping step assigns all documents to have a corresponding vertex and all links to have a corresponding edge.
-
15. The method of claim 10, further comprising the step of applying a further search method to the first subset of documents to further partition the first subset of documents into a subset of more desired type documents.
-
16. The method of claim 10, wherein the desired type documents are those of interest to a user and the method further comprises the step of displaying the desired type documents to the user.
-
17. The method of claim 10, wherein the desired type documents are those to be filtered from a user and the method further comprises the step of prohibiting display of the desired type documents to the user.
-
18. The method of claim 10, wherein the sink document comprises a plurality of generic documents, each of which provides access to the database.
-
19. A computer program product embodied in a computer-readable medium for partitioning a database containing a plurality of documents into desired and undesired type documents, the plurality of documents containing text and/or links to and from other documents in the database, the computer program product comprising:
-
computer readable program code means for providing a source document of the desired type, the source document including a collection of seed documents linked to more similar documents of the desired type than to dissimilar documents of the undesired type;
computer readable program code means for providing a sink document for providing access to the database, the sink document is generic and representative of the database;
computer readable program code means for identifying a cut-set of links which is the smallest set of links such that removing them from the database completely disconnects the source document and its linked documents from the sink document and its linked documents into first and second subsets of documents, respectively; and
computer readable program code means for defining the first subset of documents as desired type documents and the remaining documents as undesired type documents. - View Dependent Claims (20, 21, 22, 23, 24, 25, 26, 27)
computer readable program code means for mapping at least a portion of the database into a graph structure; and
computer readable program code means for applying a maximum flow algorithm to the graph structure, the subset of the graph structure which remains after application of the maximum flow algorithm being the first subset of documents.
-
-
23. The computer program product of claim 22, wherein the computer readable program code means for mapping at least a portion of the database into a graph structure assigns all documents to have a corresponding vertex and all links to have a corresponding edge.
-
24. The computer program product of claim 19, further comprising computer readable program code means for applying a further search method to the first subset of documents to further partition the first subset of documents into a subset of more desired type documents.
-
25. The computer program product of claim 21, wherein the desired type documents are those of interest to a user and the computer program product further comprises computer readable program code means for displaying the desired type documents to the user.
-
26. The computer program product of claim 19, wherein the desired type documents are those to be filtered from a user and the computer program product further comprises computer readable program code means for prohibiting display of the desired type documents to the user.
-
27. The computer program product of claim 19, wherein the sink document comprises a plurality of generic documents, each of which provides access to the database.
Specification