Avoiding masked web page content indexing errors for search engines
First Claim
1. A method comprising:
- accessing a first web page hosted at a network address using a server operating a web crawling application to create first index information for the first web page;
receiving, at the server, second index information generated from a cached copy of a second web page, wherein the cached copy is not hosted at the network address and the second web page is previously received by a client via a browser application operated at the client in response to a request directed to the network address;
comparing the first index information and the second index information to identify masked web page content; and
ranking the first web page in a search results list generated at the server, based on the comparison between the first index information and the second index information, to reduce errors caused by the masked web page content.
0 Assignments
0 Petitions
Accused Products
Abstract
Multiple non-host client sites provide cached user copies of web pages and/or web content, or summaries thereof, to a server. Obtaining data from non-host sources for indexing purposes avoids masked web page content indexing errors for search engines. The server aggregates, summarizes and indexes the web pages and/or web content in an index of cached content, in conjunction with updating, generating and storing a search index using an indexing agent such as a web crawler or spider. In response to receiving search requests from end users, the search engine uses comparisons between the index of cached content and the index of crawled content to identify potential page masking errors for specific search results and appropriately rank or omit results with a high risk of masking errors in a search result list.
-
Citations
25 Claims
-
1. A method comprising:
-
accessing a first web page hosted at a network address using a server operating a web crawling application to create first index information for the first web page; receiving, at the server, second index information generated from a cached copy of a second web page, wherein the cached copy is not hosted at the network address and the second web page is previously received by a client via a browser application operated at the client in response to a request directed to the network address; comparing the first index information and the second index information to identify masked web page content; and ranking the first web page in a search results list generated at the server, based on the comparison between the first index information and the second index information, to reduce errors caused by the masked web page content. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A system for indexing web pages comprising:
-
means for accessing a first web page hosted at a first network address to create a first index information of the first web page; means for receiving a cached copy of a second web page at a client, wherein the cached copy is not hosted at the network address, and the client receives the second web page in response to a browser request directed to the network address; means for generating a second index information of the second web page using the cached copy; means for comparing the first index information and the second index information to identify masked web page content; and means for ranking the first web page based on the comparison between the first index information and the second index information, to reduce errors caused by the masked web page content. - View Dependent Claims (12, 13, 14, 15, 16, 18, 19)
-
-
17. A system for indexing web pages comprising:
-
a crawler to access a first web page hosted at a network address to create a first index information of the first web page; a server to receive second index information generated from a cached copy of a second web page, wherein the cached copy is not hosted at the network address, and the second web page is obtained by a browser application operating on a remote client using the network address; an index generator to generate a second index information of the second web page using the cached copy; an application to compare the first index information and the second index information to identify masked web page content; and an analyzer to rank the first web page based on the comparison between the first index information and the second index information, to reduce errors caused by the masked web page content.
-
-
20. A method for assisting a server to index web pages, the method comprising:
-
distributing an application to a client, the application configured to operate on the client and to cause the client to periodically transmit a cached copy of a first web page to a server, wherein the client obtains the first web page using a browser application accessing a specified web address and does not host the first web page at the web address; generating a first index information based on the cached copy of the first web page received from the client; comparing the first index and a second index, generated from a sample of a second web page hosted at the specified web address and obtained by a web crawling application, to identify masked web page content; and updating a search index based on the comparison between the first index and the second index, to reduce errors caused by the masked web page content. - View Dependent Claims (21)
-
-
22. A method for avoiding masked web page content indexing errors, the method comprising:
-
receiving cached user copies of web pages from client sources at a server, wherein the cached user copies are identified by respective URLs that designate network addresses other than the client sources and are obtained by browser-initiated requests; comparing the cached user copies to corresponding crawler copies obtained by a web crawler application and also identified by the respective URLs to identify masked web page content; updating a search index with the cached user copies, to reduce errors caused by the masked web page content; and storing the updated search index, wherein the updated search index is used to generate search results for at least one client. - View Dependent Claims (23, 24, 25)
-
Specification