XML: finding authoritative pages for mining communities based on page structure criteria
First Claim
1. A method of determining a set of well-formed hyperlinked documents based upon an analysis of the links between and structure of documents within a larger set of hyperlinked documents, said well-formed hyperlinked documents being authorities on a specified topic, said method comprising:
- obtaining a base set of hyperlinked documents containing documents relevant to said specified topic and documents which are authorities on said specified topic;
determining a structure score for each document within said base set;
setting an authority weight and a hub weight of each document equal to said document'"'"'s corresponding structure score;
for each document, updating said authority weight of the document to equal a sum of hub weights of all documents within said base set pointing to the document;
for each document, updating said hub weight of the document to equal a sum of authority weights of all documents within said base set the document is pointing to;
identifying a predetermined number of documents having the highest valued authority weights as said authorities on said specified topic.
1 Assignment
0 Petitions
Accused Products
Abstract
A method of determining well-formed web pages which are authorities on a given topic utilizing link analysis. A root set of pages is first obtained by taking a given number of the highest ranked pages returned form a textual based searching and ranking system. Each page within the set is evaluated and given a structure score which reflects how well-formed the page is. The structure score is determined by evaluating each page within the set according to a set of parameters which relate to well-formed pages. For each parameter, the page is assigned a parameter score. These parameter scores are then weighted and summed to obtain the pages structure score. Each page within the set also has corresponding hub and authority weights which are updated and maintained to determine the strongest authorities. The initial hub and authority weights of a each page are set to the corresponding structure score of the page. An iterative algorithm is then utilized to determine the strongest authorities. For each round of the algorithm, the authority weights of a page are updated by summing the hub weights of each page pointing to the page, while the hub weights of a page are updated by summing the authority weights of each page which is pointed to by the page whose hub weight is being determined. After a series of iterations, the pages having the highest authority weights are identified as the strongest authorities, with the best structure, on the query topic.
-
Citations
29 Claims
-
1. A method of determining a set of well-formed hyperlinked documents based upon an analysis of the links between and structure of documents within a larger set of hyperlinked documents, said well-formed hyperlinked documents being authorities on a specified topic, said method comprising:
-
obtaining a base set of hyperlinked documents containing documents relevant to said specified topic and documents which are authorities on said specified topic;
determining a structure score for each document within said base set;
setting an authority weight and a hub weight of each document equal to said document'"'"'s corresponding structure score;
for each document, updating said authority weight of the document to equal a sum of hub weights of all documents within said base set pointing to the document;
for each document, updating said hub weight of the document to equal a sum of authority weights of all documents within said base set the document is pointing to;
identifying a predetermined number of documents having the highest valued authority weights as said authorities on said specified topic. - View Dependent Claims (2, 3, 4, 5, 6, 7)
determining a first set of documents utilizing a textual search and ranking system;
expanding said first set to form said base set by including documents which have links pointing to documents within said first set and documents which are pointed to by links contained in documents of said first set.
-
-
3. A method of determining a set of well-formed hyperlinked documents based upon an analysis of the links between and structure of documents within a larger set of hyperlinked documents, said well-formed hyperlinked documents being authorities on a specified topic, as per claim 2, wherein an amount of said documents which have links pointing to documents within said first set included to form said base set is limited to a predetermined number for each document within said first set.
-
4. A method of determining a set of well-formed hyperlinked documents based upon an analysis of the links between and structure of documents within a larger set of hyperlinked documents, said well-formed hyperlinked documents being authorities on a specified topic, as per claim 1, said determining step further comprising:
-
evaluating a document for conformance to a set of parameters;
determining a parameter score for each parameter of said set of parameters based upon said conformance;
performing a weighted sum of said parameters scores.
-
-
5. A method of determining a set of well-formed hyperlinked documents based upon an analysis of the links between and structure of documents within a larger set of hyperlinked documents, said well-formed hyperlinked documents being authorities on a specified topic, as per claim 1, wherein said updating steps are iterated a predetermined number of times and said authority and hub weights are weighted by said structure score during each iteration after an initial update of said authority and hub weights.
-
6. A method of determining a set of well-formed hyperlinked documents based upon an analysis of the links between and structure of documents within a larger set of hyperlinked documents, said well-formed hyperlinked documents being authorities on a specified topic, as per claim 5, wherein said authority and hub weights are normalized at an end of each iteration.
-
7. A method of determining a set of well-formed hyperlinked documents based upon an analysis of the links between and structure of documents within a larger set of hyperlinked documents, said well-formed hyperlinked documents being authorities on a specified topic, as per claim 1, wherein said documents are world wide web pages.
-
8. A method of determining a weight to be assigned to a hyperlinked document, said weight indicative of both the structure of a hyperlinked document and relevance of said document to a specified topic, relative to documents within a focused set of documents when said weight is compared to weights of other documents in said focused set, said method comprising:
-
calculating a structure score of said hyperlinked document;
setting an authority weight and a hub weight of said hyperlinked document equal to said structure score;
updating said authority weight of said hyperlinked document to equal a sum of hub weights of all hyperlinked documents within said focused set pointing to said hyperlinked document;
updating said hub weight of said hyperlinked document to equal a sum of authority weights of all hyperlinked documents within said focused set said hyperlinked document is pointing to;
wherein said updating steps are iterated a predetermined number of times, said authority and hub weights weighted by said structure score during each iteration after an initial update of said authority and hub weights, and said authority weight and said hub weight are normalized at the end of each iteration. - View Dependent Claims (9, 10, 11, 12, 13)
comparing said document to a set of parameters relevant to proper structure;
setting a parameter score for each of said parameters based upon said comparing;
performing a weighted add of said parameter scores.
-
-
10. A method of determining a weight to be assigned to a hyperlinked document, said weight indicative of both the structure of a hyperlinked document and relevance of said document to a specified topic, relative to documents within a focused set of documents when said weight is compared to weights of other documents in said focused set, as per claim 8, wherein said focused set is obtained via determining a root set of documents which contain web pages relevant to said specified topic and including documents which are pointed to by documents within said root set and documents which point to documents within said root set.
-
11. A method of determining a weight to be assigned to a hyperlinked document, said weight indicative of both the structure of a hyperlinked document and relevance of said document to a specified topic, relative to documents within a focused set of documents when said weight is compared to weights of other documents in said focused set, as per claim 10, wherein an amount of said documents which point to documents within said root set included to form said base set is limited to a predetermined number for each document within said root set.
-
12. A method of determining a weight to be assigned to a hyperlinked document, said weight indicative of both the structure of a hyperlinked document and relevance of said document to a specified topic, relative to documents within a focused set of documents when said weight is compared to weights of other documents in said focused set, as per claim 10, wherein said root set is determined utilizing a textual search and ranking system.
-
13. A method of determining a weight to be assigned to a hyperlinked document, said weight indicative of both the structure of a hyperlinked document and relevance of said document to a specified topic, relative to documents within a focused set of documents when said weight is compared to weights of other documents in said focused set, as per claim 8, wherein said document is a world wide web page.
-
14. A method of locating web pages which are authorities upon a specified topic and which are well-formed, said method comprising:
-
obtaining a root set of web pages which contains web pages relevant to said specified topic;
generating a base set of web pages by including pages which are pointed to by pages within said root set and pages which point to pages within said root set;
evaluating each page of said base set for conformance to a set of parameters, said set of parameters relevant to a proper structure of a web page;
determining a structure score for each page based upon said conformance to said set of parameters;
setting an authority weight and a hub weight of each page equal to said page'"'"'s corresponding structure score;
iteratively updating said authority and hub weights of each page a predetermined number of times, said updating for each page comprising;
setting said authority weight of the page equal to a sum of hub weights of all pages within said base set pointing to the page;
setting said hub weight of the page equal to a sum of authority weights of all pages within said base set the page is pointing to;
normalizing said authority and hub weight;
identifying a predetermined number of pages having the highest valued authority weights as said authorities upon completion of said iterative updating step, and wherein said authority and hub weights are weighted by said structure score during each iteration after an initial update of said authority and hub weights. - View Dependent Claims (15, 16, 17)
querying a text based search and ranking engine;
discarding all web pages returned by said query beyond a predetermined number of highest ranked web pages returned by said query.
-
-
16. A method of locating web pages which are authorities upon a specified topic and which are well-formed, as per claim 14, wherein said determining step further comprises:
-
setting a parameter score for each of said parameters of said set;
performing a weighted add of said parameter scores.
-
-
17. A method of locating web pages which are authorities upon a specified topic and which are well-formed, as per claim 14, wherein a n amount of said pages which point to pages within said root set included to form said base set is limited to a predetermined number for each page within said root set.
-
18. A method of searching the world wide web to locate authoritative web pages on a specified topic which are structured to be highly accessible regardless of limitations imposed upon a consumer of said web page, said method comprising:
-
searching said world wide web to obtain a root set of pages, a portion of said root set relevant to said specified topic;
expanding said root set to a base set of pages by including pages pointing to pages within said root set and pages pointed to by said pages in said root set;
determining a structure score and setting a hub weight and an authority weight equal to said structure score for each page within said base set, said structure score determined by evaluating each page according to a set of parameters, said parameters relevant to a proper structure of a web page;
iteratively updating said authority and hub weights of each page a predetermined number of times, said updating for each page comprising;
setting said authority weight of the page equal to a sum of hub weights of all pages within said base set pointing to the page;
setting said hub weight of the page equal to a sum of authority weights of all pages within said base set the page is pointing to;
normalizing said authority and hub weight;
identifying a predetermined number of pages having the highest valued authority weights as said authorities upon completion of said iterative updating step, and wherein said authority and hub weights are weighted by said structure score during each iteration after an initial update of said authority and hub weights. - View Dependent Claims (19, 20, 21)
comparing said pages to said set of parameters relevant to proper structure;
setting a parameter score for each of said parameters based upon said comparing;
performing a weighted add of said parameter scores to determine said structure score.
-
-
22. A system for determining a weight to be assigned to a hyperlinked document, said weight indicative of both the structure of a hyperlinked document and relevance of said document to a specified topic, relative to documents within a focused set of documents when said weight is compared to weights of other documents in said focused set, said method comprising:
-
a structure calculator, said structure calculator calculating a structure score of said hyperlinked document;
a weight initializer, said weight initializer setting an initial authority weight and an initial hub weight of said hyperlinked document equal to said structure score;
an authority weight updater, said authority weight updater updating said authority weight of said hyperlinked document to equal a sum of hub weights of all hyperlinked documents within said focused set pointing to said hyperlinked document;
a hub weight updater, said hub weight updater updating said hub weight of said hyperlinked document to equal a sum of authority weights of all hyperlinked documents within said focused set said hyperlinked document is pointing to;
wherein said updaters iteratively update said authority and hub weights a predetermined number of times and normalizes said authority weight and said hub weight at the end of each iteration and said authority and hub weights are weighted by said structure score during each iteration after an initial update of said authority and hub weights. - View Dependent Claims (23, 24, 25, 26, 27)
a comparator, said comparator comparing said document to a set of parameters relevant to proper structure;
a parameter score assigner, said parameter score assigner assigning a parameter score for each of said parameters based upon said comparing;
an adder, said adder performing a weighted add of said parameter scores.
-
-
24. A system for determining a weight to be assigned to a hyperlinked document, said weight indicative of both the structure of a hyperlinked document and relevance of said document to a specified topic, relative to documents within a focused set of documents when said weight is compared to weights of other documents in said focused set, as per claim 22, wherein said focused set is obtained via determining a root set of documents which contain web pages relevant to said specified topic and including documents which are pointed to by documents within said root set and documents which point to documents within said root set.
-
25. A system for determining a weight to be assigned to a hyperlinked document, said weight indicative of both the structure of a hyperlinked document and relevance of said document to a specified topic, relative to documents within a focused set of documents when said weight is compared to weights of other documents in said focused set, as per claim 24, wherein an amount of said documents which point to documents within said root set included to form said base set is limited to a predetermined number for each document within said root set.
-
26. A system for determining a weight to be assigned to a hyperlinked document, said weight indicative of both the structure of a hyperlinked document and relevance of said document to a specified topic, relative to documents within a focused set of documents when said weight is compared to weights of other documents in said focused set, as per claim 24, wherein said root set is determined utilizing a textual search and ranking system.
-
27. A system for determining a weight to be assigned to a hyperlinked document, said weight indicative of both the structure of a hyperlinked document and relevance of said document to a specified topic, relative to documents within a focused set of documents when said weight is compared to weights of other documents in said focused set, as per claim 22, wherein said document is a world wide web page.
-
28. An article of manufacture comprising a computer usable medium having computer readable program code embodied therein which determines a weight to be assigned to a hyperlinked document, said weight indicative of both the structure of a hyperlinked document and relevance of said document to a specified topic, relative to documents within a focused set of documents when said weight is compared to weights of other documents in said focused set, said computer readable program code comprising:
-
computer readable program code for calculating a structure score of said hyperlinked document;
computer readable program code for setting an authority weight and a hub weight of said hyperlinked document equal to said structure score;
computer readable program code for updating said authority weight of said hyperlinked document to equal a sum of hub weights of all hyperlinked documents within said focused set pointing to said hyperlinked document;
computer readable program code for updating said hub weight of said hyperlinked document to equal a sum of authority weights of all hyperlinked documents within said focused set said hyperlinked document is pointing to;
wherein said updating steps are iterated a predetermined number of times, said authority and hub weights are weighted by said structure score during each iteration after an initial update of said authority and hub weights and said authority weight and said hub weight are normalized at the end of each iteration.
-
-
29. An article of manufacture comprising a computer usable medium having computer readable program code embodied therein which locates web pages which are authorities upon a specified topic and which are well-formed, said computer readable program code comprising:
- ;
computer readable program code for obtaining a root set of web pages which contains web pages relevant to said specified topic;
computer readable program code for generating a base set of web pages by including pages which are pointed to by pages within said root set and pages which point to pages within said root set;
computer readable program code for evaluating each page of said base set for conformance to a set of parameters, said set of parameters relevant to a proper structure of a web page;
determining a structure score for each page based upon said conformance to said set of parameters;
computer readable program code for setting an authority weight and a hub weight of each page equal to said page'"'"'s corresponding structure score;
computer readable program code for iteratively updating said authority and hub weights of each page a predetermined number of times, said computer readable program code for updating comprising;
computer readable program code for setting said authority weight of the page equal to a sum of hub weights of all pages within said base set pointing to the page;
computer readable program code for setting said hub weight of the page equal to a sum of authority weights of all pages within said base set the page is pointing to;
computer readable program code for normalizing said authority and hub weight;
computer readable program code for weighting said authority and hub weights by said structure score during each iteration after an initial update of said authority and hub weights; and
computer readable program code for identifying a predetermined number of pages having the highest valued authority weights as said authorities upon completion of said iterative updating step.
- ;
Specification