Using content analysis to detect spam web pages
First Claim
Patent Images
1. A method comprising:
- receiving content by crawling a web page;
analyzing the content for web spam using a content-based identification technique,wherein the content-based identification technique comprises at least one of;
determining a fraction of visible content to total content on the web page;
ordetermining a ratio of compressed visible content to uncompressed visible content on the web page; and
classifying the content according to said analysis.
3 Assignments
0 Petitions
Accused Products
Abstract
Evaluating content includes receiving content, analyzing the content for web spam using a content-based identification technique, and classifying the content according to the analysis. An index of analyzed contents may be created. A system for evaluating content includes a storage device configured to store data and a processor configured to analyze content for web spam using content-based identification techniques.
36 Citations
20 Claims
-
1. A method comprising:
-
receiving content by crawling a web page; analyzing the content for web spam using a content-based identification technique, wherein the content-based identification technique comprises at least one of; determining a fraction of visible content to total content on the web page;
ordetermining a ratio of compressed visible content to uncompressed visible content on the web page; and classifying the content according to said analysis. - View Dependent Claims (2, 3, 4, 5, 6, 15, 17, 18)
-
-
7. A system for identifying web spam, the system comprising:
-
a storage device configured to store an index; and a processor configured to; receive content from a crawled web page; analyze the content using a content-based identification technique to determine whether web spam is present, wherein the content-based identification technique comprises at least one of; determining a fraction of visible content to total content on the web page;
ordetermining a ratio of compressed visible content to uncompressed visible content on the web page; and classify the content according to said analysis. - View Dependent Claims (8, 9, 10, 11, 12, 16, 19)
-
-
13. A computer-readable storage medium comprising computer-executable instructions stored thereon, wherein the computer-executable instructions, when executed by a computer, cause performance of acts, the acts comprising:
-
receiving a set of web pages in response to a query; analyzing content of the set of web pages for web spam by using a content-based identification technique comprising at least one of; determining a fraction of visible content to total content on the web page;
ordetermining a ratio of compressed visible content to uncompressed visible content on the web page; and classifying the content according to said analysis. - View Dependent Claims (14, 20)
-
Specification