System and method for content selection for web page indexing

US 10,303,722 B2
Filed: 05/05/2009
Issued: 05/28/2019
Est. Priority Date: 05/05/2009
Status: Active Grant

First Claim

Patent Images

1. A method for indexing a webpage comprising:

retrieving, by an indexer server, a plurality webpages to be indexed from a webpage data store, wherein the indexer server comprises one or more computer systems configured to performing indexing of webpage documents within the webpage data store;

determining, by the indexer server, for each of the plurality of webpages, a document object model (DOM) containing one or more DOM elements within each webpage;

computing, by the indexer server, a DOM element identifier for each of the one or more DOM elements within each of the plurality of webpages, wherein each DOM element identifier is computed based on the content within the corresponding DOM element;

determining, by the indexer server, a first subset of the plurality of DOM elements having DOM element identifiers that satisfy a content similarity threshold to the DOM element identifiers of the other DOM elements;

retrieving attention history data associated with each of the first subset of DOM elements, wherein the attention history data for each particular DOM element is based on previous user interface events detected within the particular DOM element;

combining the attention history data associated with each of the first subset of DOM elements, and comparing the combined attention history data to an attention history threshold level; and

in response to a determination that the combined attention history data associated with the first subset of DOM elements meets the attention history threshold level, indexing, by the indexer server, each of the first subset of DOM elements.

View all claims

5 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

An indexing system for documents such as web pages divides a document into elements, such as document object model elements. User attention data from prior interactions with the document are analyzed to determine those elements of a document that satisfy a threshold requirement of user attention. Elements meeting the user attention threshold requirement are added to a set of indexable content for the document. Furthermore, document sections are determined based on attention data and each section is indexed separately. Indexing is per section and based only on the indexable content, thereby enhancing the index relevance, increasing the efficiency of search engines and reducing spamdexing.

49 Citations

20 Claims

1. A method for indexing a webpage comprising:
- retrieving, by an indexer server, a plurality webpages to be indexed from a webpage data store, wherein the indexer server comprises one or more computer systems configured to performing indexing of webpage documents within the webpage data store;
  
  determining, by the indexer server, for each of the plurality of webpages, a document object model (DOM) containing one or more DOM elements within each webpage;
  
  computing, by the indexer server, a DOM element identifier for each of the one or more DOM elements within each of the plurality of webpages, wherein each DOM element identifier is computed based on the content within the corresponding DOM element;
  
  determining, by the indexer server, a first subset of the plurality of DOM elements having DOM element identifiers that satisfy a content similarity threshold to the DOM element identifiers of the other DOM elements;
  
  retrieving attention history data associated with each of the first subset of DOM elements, wherein the attention history data for each particular DOM element is based on previous user interface events detected within the particular DOM element;
  
  combining the attention history data associated with each of the first subset of DOM elements, and comparing the combined attention history data to an attention history threshold level; and
  
  in response to a determination that the combined attention history data associated with the first subset of DOM elements meets the attention history threshold level, indexing, by the indexer server, each of the first subset of DOM elements.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method according to claim 1 wherein retrieving attention history data associated with each of the first subset of DOM elements comprises retrieving the attention history data from an interaction data store.
  - 3. The method according to claim 1 wherein the attention history data comprises interaction data that associates a document object model element with attention that the document object model element received from a user during an interaction with the webpage.
  - 4. The method according to claim 1 wherein the attention history threshold level comprises a requirement that the combination of the first subset of document object model elements has at least one associated attention event.
  - 5. The method according to claim 1 wherein the attention history threshold level comprises a requirement that the combination of the first subset of document object model elements has at least a threshold number of associated attention events.
  - 6. The method according to claim 1, further comprising retrieving section analysis data from a database, for each of the plurality of webpages.
  - 7. The method according to claim 1, further comprising:
    - identifying a first particular document object model element within the first subset of DOM elements, wherein the first particular document object model element is included within multiple of the plurality of webpages, and wherein the comparing the combined attention history data associated with the first subset of DOM elements to the attention history threshold level comprises summing the previous user interface events detected within the particular document object model element over the multiple webpages.
  - 8. The method according to claim 1,wherein each of the plurality of DOM element identifiers is computed using a hash function on the content of the DOM element.
  - 9. The method according to claim 8,wherein retrieving the attention history data for each of the first subset of DOM elements comprises excluding outclick events from the attention history data prior to the comparison to the attention history threshold level.

10. A system for indexing web pages comprising:
- an indexer server comprising one or more processors, and memory storing computer-readable instructions that, when executed by the one or more processors, cause the indexer server to execute a content selection module programmed to;
  
  retrieve a plurality of webpages from a webpage data store to be indexed;
  
  determine, for each of the plurality of webpages to be indexed, a document object model (DOM) containing one or more DOM elements within each webpage;
  
  compute a DOM element identifier for each of the one or more DOM elements within each of the plurality of webpages, wherein each DOM element identifier is computed based on the content within the corresponding DOM element;
  
  determine a first subset of the plurality of DOM elements having DOM element identifiers that satisfy a content similarity threshold to the DOM element identifiers of the other DOM elements;
  
  retrieve attention history data associated with each of the first subset of DOM elements, wherein the attention history data for each particular DOM element is based on previous user interface events detected within the particular DOM element;
  
  combine the attention history data associated with each of the first subset of DOM elements, and compare the combined attention history data to an attention history threshold level; and
  
  in response to a determination that the combined attention history data associated with the first subset of DOM elements meets the attention history threshold level, provide each of the first subset of DOM elements to an indexing module programmed to index the received first subset of DOM elements.
- View Dependent Claims (11, 12, 13, 14, 15)
- - 11. The system according to claim 10, further comprising an interaction data store that stores interaction data for the plurality of web pages.
  - 12. The system according to claim 10 wherein the content selection module adds content of the first subset of DOM elements to the indexable content if the interaction data associated with the first subset of DOM elements comprises a plurality of attention events associated with the content and if the number of attention events are above a threshold number.
  - 13. The system according to claim 12 wherein the interaction data comprises at least one association between the first subset of document object model elements and one or more attention events generated by a human user during an interaction with the first subset of document object model elements.
  - 14. The system according to claim 13 wherein the one or more attention events indicate regions within the plurality of web pages of interest to a human user during an interaction with the plurality of web pages.
  - 15. The system according to claim 10, wherein retrieving the attention history data for each of the first subset of DOM elements comprises excluding outclick events from the attention history data prior to the comparison to the attention history threshold level.

16. A non-transitory computer-readable medium comprising computer-executable instructions for execution by a processor, that, when executed, cause the processor to:
- retrieve a plurality of webpages from a webpage data store to be indexed;
  
  determine, for each of the plurality of webpages to be indexed, a document object model (DOM) containing one or more DOM elements within each webpage;
  
  compute a DOM element identifier for each of the one or more DOM elements within each of the plurality of webpages, wherein each DOM element identifier is computed based on the content within the corresponding DOM element;
  
  determine a first subset of the plurality of DOM elements having DOM element identifiers that satisfy a content similarity threshold to the DOM element identifiers of the other DOM elements;
  
  retrieve attention history data associated with each of the first subset of DOM elements, wherein the attention history data for each particular DOM element is based on previous user interface events detected within the particular DOM element;
  
  combine the attention history data associated with each of the first subset of DOM elements, and compare the combined attention history data to an attention history threshold level; and
  
  in response to a determination that the combined attention history data associated with the first subset of DOM elements meets the attention history threshold level, indexing each of the first subset of DOM elements.
- View Dependent Claims (17, 18, 19, 20)
- - 17. The non-transitory computer-readable medium of claim 16, wherein retrieving attention history data associated with each of the first subset of DOM elements comprises retrieving the attention history data from an interaction data store.
  - 18. The non-transitory computer-readable medium of claim 16, wherein the attention history data comprises interaction data that associates a document object model element with attention that the document object model element received from a user during an interaction with the webpage.
  - 19. The non-transitory computer-readable medium of claim 16, wherein the attention history threshold level comprises a requirement that the combination of the first subset of document object model elements has at least one associated attention event.
  - 20. The non-transitory computer-readable medium of claim 16, wherein the attention history threshold level comprises a requirement that the combination of the first subset of document object model elements has at least a threshold number of associated attention events.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Oracle America, Inc. (Oracle Corporation)
Original Assignee
Oracle America, Inc. (Oracle Corporation)
Inventors
Hauser, Robert R
Primary Examiner(s)
Savla, Arpan P.
Assistant Examiner(s)
Davanlou, Soheila (Gina)

Application Number

US12/435,777
Publication Number

US 20100287462A1
Time in Patent Office

3,675 Days
Field of Search

707711, 707715, 707732
US Class Current
CPC Class Codes

G06F 16/81 Indexing, e.g. XML tags; Da...

G06F 16/951 Indexing; Web crawling tech...

System and method for content selection for web page indexing

First Claim

5 Assignments

0 Petitions

Accused Products

Abstract

49 Citations

20 Claims

Specification

Use Cases

Quick Links

Others

System and method for content selection for web page indexing

First Claim

5 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

49 Citations

20 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others