System and method for content selection for web page indexing
First Claim
1. A method for indexing a webpage comprising:
- retrieving, by an indexer server, a plurality webpages to be indexed from a webpage data store, wherein the indexer server comprises one or more computer systems configured to performing indexing of webpage documents within the webpage data store;
determining, by the indexer server, for each of the plurality of webpages, a document object model (DOM) containing one or more DOM elements within each webpage;
computing, by the indexer server, a DOM element identifier for each of the one or more DOM elements within each of the plurality of webpages, wherein each DOM element identifier is computed based on the content within the corresponding DOM element;
determining, by the indexer server, a first subset of the plurality of DOM elements having DOM element identifiers that satisfy a content similarity threshold to the DOM element identifiers of the other DOM elements;
retrieving attention history data associated with each of the first subset of DOM elements, wherein the attention history data for each particular DOM element is based on previous user interface events detected within the particular DOM element;
combining the attention history data associated with each of the first subset of DOM elements, and comparing the combined attention history data to an attention history threshold level; and
in response to a determination that the combined attention history data associated with the first subset of DOM elements meets the attention history threshold level, indexing, by the indexer server, each of the first subset of DOM elements.
5 Assignments
0 Petitions
Accused Products
Abstract
An indexing system for documents such as web pages divides a document into elements, such as document object model elements. User attention data from prior interactions with the document are analyzed to determine those elements of a document that satisfy a threshold requirement of user attention. Elements meeting the user attention threshold requirement are added to a set of indexable content for the document. Furthermore, document sections are determined based on attention data and each section is indexed separately. Indexing is per section and based only on the indexable content, thereby enhancing the index relevance, increasing the efficiency of search engines and reducing spamdexing.
49 Citations
20 Claims
-
1. A method for indexing a webpage comprising:
-
retrieving, by an indexer server, a plurality webpages to be indexed from a webpage data store, wherein the indexer server comprises one or more computer systems configured to performing indexing of webpage documents within the webpage data store; determining, by the indexer server, for each of the plurality of webpages, a document object model (DOM) containing one or more DOM elements within each webpage; computing, by the indexer server, a DOM element identifier for each of the one or more DOM elements within each of the plurality of webpages, wherein each DOM element identifier is computed based on the content within the corresponding DOM element; determining, by the indexer server, a first subset of the plurality of DOM elements having DOM element identifiers that satisfy a content similarity threshold to the DOM element identifiers of the other DOM elements; retrieving attention history data associated with each of the first subset of DOM elements, wherein the attention history data for each particular DOM element is based on previous user interface events detected within the particular DOM element; combining the attention history data associated with each of the first subset of DOM elements, and comparing the combined attention history data to an attention history threshold level; and in response to a determination that the combined attention history data associated with the first subset of DOM elements meets the attention history threshold level, indexing, by the indexer server, each of the first subset of DOM elements. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A system for indexing web pages comprising:
an indexer server comprising one or more processors, and memory storing computer-readable instructions that, when executed by the one or more processors, cause the indexer server to execute a content selection module programmed to; retrieve a plurality of webpages from a webpage data store to be indexed; determine, for each of the plurality of webpages to be indexed, a document object model (DOM) containing one or more DOM elements within each webpage; compute a DOM element identifier for each of the one or more DOM elements within each of the plurality of webpages, wherein each DOM element identifier is computed based on the content within the corresponding DOM element; determine a first subset of the plurality of DOM elements having DOM element identifiers that satisfy a content similarity threshold to the DOM element identifiers of the other DOM elements; retrieve attention history data associated with each of the first subset of DOM elements, wherein the attention history data for each particular DOM element is based on previous user interface events detected within the particular DOM element; combine the attention history data associated with each of the first subset of DOM elements, and compare the combined attention history data to an attention history threshold level; and in response to a determination that the combined attention history data associated with the first subset of DOM elements meets the attention history threshold level, provide each of the first subset of DOM elements to an indexing module programmed to index the received first subset of DOM elements. - View Dependent Claims (11, 12, 13, 14, 15)
-
16. A non-transitory computer-readable medium comprising computer-executable instructions for execution by a processor, that, when executed, cause the processor to:
-
retrieve a plurality of webpages from a webpage data store to be indexed; determine, for each of the plurality of webpages to be indexed, a document object model (DOM) containing one or more DOM elements within each webpage; compute a DOM element identifier for each of the one or more DOM elements within each of the plurality of webpages, wherein each DOM element identifier is computed based on the content within the corresponding DOM element; determine a first subset of the plurality of DOM elements having DOM element identifiers that satisfy a content similarity threshold to the DOM element identifiers of the other DOM elements; retrieve attention history data associated with each of the first subset of DOM elements, wherein the attention history data for each particular DOM element is based on previous user interface events detected within the particular DOM element; combine the attention history data associated with each of the first subset of DOM elements, and compare the combined attention history data to an attention history threshold level; and in response to a determination that the combined attention history data associated with the first subset of DOM elements meets the attention history threshold level, indexing each of the first subset of DOM elements. - View Dependent Claims (17, 18, 19, 20)
-
Specification