Pattern recognition in web search engine result pages
First Claim
Patent Images
1. A non-transitory computer readable medium comprising computer readable instructions, which, when executed by a computer, cause the computer to perform a method, the method comprising:
- receiving a result page from a web search engine, the result page comprising text fields and markup tags, and an integer number for the results on the result page;
generating simplified variations of the result page, the generating comprising;
determining noisy markup tags in the result page;
generating a first variation of the result page by removing the noisy markup tags from the result page;
generating a plurality of other variations of the result page by preserving one noisy markup tag from the noisy markup tags and removing the rest of the noisy markup tags;
stripping inside of remaining markup tags in the first variation and the plurality of other variations; and
simplifying the text fields by marking the text fields with free text markers;
parsing the simplified variations of the result page to determine one or more repeating patterns, the one or more repeating patterns comprising a substring of the simplified variations of the result page, the substring beginning at a start of a remaining markup tag or a free text marker and ending at a close of the remaining markup tag or the free text marker;
selecting the one or more repeating patterns that are repeated the integer number of times in the result page as result patterns;
selecting one of the one or more result patterns as a highest rated result pattern according to predefined rating criteria; and
generating a regular expression from the highest rated result pattern as an output that matches the results on the result page.
1 Assignment
0 Petitions
Accused Products
Abstract
Described herein are methods and systems for pattern recognition in web search engine result pages. The input data is a result page from a web search engine as well as an integer number for the results on the page. The output is a regular expression that matches all the results on the page, capturing each result and its individual fields.
23 Citations
11 Claims
-
1. A non-transitory computer readable medium comprising computer readable instructions, which, when executed by a computer, cause the computer to perform a method, the method comprising:
-
receiving a result page from a web search engine, the result page comprising text fields and markup tags, and an integer number for the results on the result page; generating simplified variations of the result page, the generating comprising; determining noisy markup tags in the result page; generating a first variation of the result page by removing the noisy markup tags from the result page; generating a plurality of other variations of the result page by preserving one noisy markup tag from the noisy markup tags and removing the rest of the noisy markup tags; stripping inside of remaining markup tags in the first variation and the plurality of other variations; and simplifying the text fields by marking the text fields with free text markers; parsing the simplified variations of the result page to determine one or more repeating patterns, the one or more repeating patterns comprising a substring of the simplified variations of the result page, the substring beginning at a start of a remaining markup tag or a free text marker and ending at a close of the remaining markup tag or the free text marker; selecting the one or more repeating patterns that are repeated the integer number of times in the result page as result patterns; selecting one of the one or more result patterns as a highest rated result pattern according to predefined rating criteria; and generating a regular expression from the highest rated result pattern as an output that matches the results on the result page. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A computer implemented method for pattern recognition in web search engine result pages, comprising:
-
receiving an HTML result page from a web search engine, the HTML result page comprising text fields and markup tags, and an integer value with a number of results on the HTML result page; generating simplified variations of the HTML result page, the generating comprising; determining noisy markup tags in the result page; generating a first variation of the result page by removing the noisy markup tags from the result page; generating a plurality of other variations of the result page by preserving one noisy markup tag from the noisy markup tags and removing the rest of the noisy markup tags; stripping inside of remaining markup tags in the first variation and the plurality of other variations; and simplifying the text fields by marking the text fields with free text markers; parsing the simplified variations of the HTML result pages to determine one or more repeating patterns, the one or more repeating patterns comprising a substring of the simplified variations of the HTML result page, the substring beginning at a start of a remaining markup tag or a free text marker and ending at a close of the remaining markup tag or the free text marker; selecting the one or more repeating patterns that are repeated the integer value number of times in the HTML result page as result patterns; selecting one of the one or more result patterns as a highest rated result pattern according to predefined rating criteria; and generating a regular expression from the highest rated result pattern as an output that matches the results on the HTML result page. - View Dependent Claims (7, 8, 9, 10, 11)
-
Specification