Computer method and apparatus for determining content types of web pages
First Claim
1. A computer-implemented method of determining content type of contents of a subject Web page, comprising the steps of:
- providing a predefined set of potential content types, content types being exclusive of indicating formal language of the content;
for each potential content type, preparing a distinguishing series of tests, the distinguishing series of tests includes;
i) at least one binary test, andii) at least one non-binary test,the at least one binary test and the at least one non-binary test further including at least one test (a) examining syntax or grammar;
or (b) examining page format or style other than position of data or a keyword in the subject Web page;
for each potential content type, running the distinguishing series of tests of tests having test results which enable quantitative evaluation of at least some contents of the subject Web page being of the potential content type,mathematically combining the probabilities from all possible combinations of the test results and hypothesis values with respect to content of Web pages of determined content type with the test results of the subject Web page of undetermined content type using at least one Bayesian network; and
based on the combined test results, assigning a respective probability, for each potential content type, that some contents of that type exists on the subject Web page, and indicating content type, said indicating being exclusive of indicating language in which content is written.
7 Assignments
0 Petitions
Accused Products
Abstract
Computer method and apparatus determines content type of contents of a subject Web page. A predefined set of potential content types is first provided. For each potential content type, there are one or more tests having test results that enable quantitative evaluation of the contents of the subject Web page. A respective probability of each potential content type being detected in some contents of the subject Web page is determined. A Bayesian network combines the test results to provide indications of the types of contents detected on the subject Web page. A confidence level per detected content type is also provided. A database stores the determined probabilities and confidence levels, and thus provides a cross reference between Web pages and respective content types of contents found on the Web pages.
76 Citations
20 Claims
-
1. A computer-implemented method of determining content type of contents of a subject Web page, comprising the steps of:
-
providing a predefined set of potential content types, content types being exclusive of indicating formal language of the content; for each potential content type, preparing a distinguishing series of tests, the distinguishing series of tests includes; i) at least one binary test, and ii) at least one non-binary test, the at least one binary test and the at least one non-binary test further including at least one test (a) examining syntax or grammar;
or (b) examining page format or style other than position of data or a keyword in the subject Web page;for each potential content type, running the distinguishing series of tests of tests having test results which enable quantitative evaluation of at least some contents of the subject Web page being of the potential content type, mathematically combining the probabilities from all possible combinations of the test results and hypothesis values with respect to content of Web pages of determined content type with the test results of the subject Web page of undetermined content type using at least one Bayesian network; and based on the combined test results, assigning a respective probability, for each potential content type, that some contents of that type exists on the subject Web page, and indicating content type, said indicating being exclusive of indicating language in which content is written. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 17, 19)
-
-
9. Apparatus for determining content type of contents of a subject Web page, comprising:
-
a digital processor coupled to a memory; a predefined set of potential content types, each potential content type being exclusive of indicating formal language of the content and associated with a respective distinguishing series of tests, the distinguishing series of tests includes; i) at least one binary test, and ii) at least one non-binary test, the at least one binary test and the at least one non-binary test further including at least one test (a) examining syntax or grammar;
or (b) examining page format or style other than position of data or a keyword in the subject Web page;a test module utilizing the predefined set, the test module employing the distinguishing series of tests as a plurality of processor-executed tests having test results which enable, for each potential content type, quantitative evaluation of at least some contents of the subject Web page being of the potential content type, for each potential content type, the test module (i) running the respective distinguishing series of tests, (ii) combining the probabilities from all possible combinations of the test results and hypothesis values with respect to content of Web pages of determined content type with the test results of the subject Web page of undetermined content type using at least one Bayesian network and (iii) for each potential content type, assigning a respective probability that at least some contents of that type exists on the subject Web page being of the potential content type, and indicating content type exclusive of indicating language in which content is written. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16, 18, 20)
-
Specification