Method and system for trawling the World-wide Web to identify implicitly-defined communities of web pages
First Claim
1. A method for pre-identifying implicitly defined communities including groups of pages of common interest from a collection of hyper-linked pages, wherein the communities have not been previously identified, comprising the steps of:
- identifying a collection of hyperlinked pages from a plurality of sites, wherein each of the sites includes one or more hyper-linked pages;
identifying hyper-links between any two pages on a same site, wherein the same site is included within the plurality of sites;
removing the identified hyper-links between the two pages on a same site;
identifying a plurality of (i,j)-cores within the identified collection, the (i,j)-cores including a first set of hyperlinked pages and a second set of hyper-linked pages, wherein each page in the first set of hyperlinked pages points to every page in the second set of hyperlinked pages, and where i and i are the numbers of hyper-linked pages in the first set and hyper-linked pages in the second set, respectively, that appear in each of the identified (i,j)-cores; and
expanding each of the identified (i,j)-cores into a fully community, the full community being a subset of the pages reading a particular topic.
1 Assignment
0 Petitions
Accused Products
Abstract
A method and system for identifying groups of pages of common interest from a collection of hyper-linked pages are disclosed. A plurality of community cores are identified from the collection where each core includes first and second sets of pages, and each page in the first set points to every page in the second set. Each identified core is expanded into a full community which is a subset of the pages regarding a particular topic. The identification community cores is based on the analysis of the Web graph in which the communities correspond to instances of Web subgraphs. Extraneous pages are then pruned to improve the quality of the resulting communities.
80 Citations
30 Claims
-
1. A method for pre-identifying implicitly defined communities including groups of pages of common interest from a collection of hyper-linked pages, wherein the communities have not been previously identified, comprising the steps of:
-
identifying a collection of hyperlinked pages from a plurality of sites, wherein each of the sites includes one or more hyper-linked pages;
identifying hyper-links between any two pages on a same site, wherein the same site is included within the plurality of sites;
removing the identified hyper-links between the two pages on a same site;
identifying a plurality of (i,j)-cores within the identified collection, the (i,j)-cores including a first set of hyperlinked pages and a second set of hyper-linked pages, wherein each page in the first set of hyperlinked pages points to every page in the second set of hyperlinked pages, and where i and i are the numbers of hyper-linked pages in the first set and hyper-linked pages in the second set, respectively, that appear in each of the identified (i,j)-cores; and
expanding each of the identified (i,j)-cores into a fully community, the full community being a subset of the pages reading a particular topic. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A computer program product for use with a computer system for pre-identifying implicitly defined communities including groups of pages of common interest from a collection of hyper-linked pages, wherein the communities have not been previously identified, the computer program product comprising:
-
a computer-readable medium;
means, provided on the computer-readable medium, for identifying a collection of hyperlinked pages from a plurality of sites, wherein each of the sites includes one or more hyper-linked pages;
means, provided on the computer-readable medium, for identifying hyper-links between any two pages on a same site, wherein the same site is included within the plurality of sites;
means, provided on the computer-readable medium, for removing the identified hyper-links between the two pages on a same site;
means, provided on the computer-readable medium, for identifying a plurality of (i,j)-cores within the identified collection, the (i,j)-cores including a first set of hyperlinked pages and a second set of hyper-linked pages, wherein each page in the first set of hyperlinked pages points to every page in the second set of hyperlinked pages, and where i and j are the numbers of hyper-linked pages in the first set and hyper-linked pages in the second set, respectively, that appear in each of the identified (i,j)-cores; and
means, provided on the computer-readable medium, for expanding each of the identified (i,j)-cores into a full community, the full community being a subset of the pages regarding a particular topic. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
-
-
21. A system for pre-identifying implicit defined
communities including groups of pages of common interest from a collection of hyper-linked pages, wherein the communities have not been previously identified, comprising: -
means for identifying a collection of hyperlinked pages from a plurality of sites, wherein each of the sites includes one or more hyper-linked pages;
means for identifying hyper-links between any two pages on a same site, wherein the same site is included within the plurality of sites;
means for removing the identified hyper-links between the two pages on a same site;
means for identifying a plurality of (i,j)-cores with the identified collection, the (i,j)-cores including a first set of hyperlinked pages and a second set of hyper-linked pages, wherein each page in the first set of hyperlinked pages points to every page in the second set of hyperlinked pages, and where i and j are the numbers of hyper-inked pages in the first set and hyper-linked pages in the second set, respectively, that appear in each of the identified (i,j)-cores; and
means for expanding each of the identified (i,j)-cores into a full community, the full community being a subset of the pages regarding a particular topic. - View Dependent Claims (22, 23, 24, 25, 26, 27, 28, 29, 30)
-
Specification