Systems and methods for identifying and extracting data from HTML pages
First Claim
Patent Images
1. A computer implemented method of identifying desired content in HTML formatted web pages, comprising the steps of:
- selecting a model page, wherein the model page includes content data and a plurality of HTML tags for formatting the content data;
identifying a first area of interest in the model page;
parsing the model page to generate a first string of symbols for the plurality of HTML tags, the generated symbols in the first string representing only HTML tags, wherein the first area of interest is identified by a first portion of the first string of symbols;
retrieving a second web page associated with a different URL than the model page;
parsing the second web page to generate a second string of symbols for a plurality of HTML tags of the second web page, the generated symbols in the second string representing only HTML tags; and
comparing the first and second symbol strings to determine whether the second string includes a second portion similar to the first portion of the first string, wherein the second portion corresponds to a second area of interest in the second page.
9 Assignments
0 Petitions
Accused Products
Abstract
Systems and methods for analyzing HTML formatted web pages to automatically identify and extract desired information. A computer algorithm identifies and extracts different pieces of information from different web pages automatically after minimal manual setup. The algorithm automatically analyzes pages with different content if they have the same, or similar, formats.
-
Citations
27 Claims
-
1. A computer implemented method of identifying desired content in HTML formatted web pages, comprising the steps of:
-
selecting a model page, wherein the model page includes content data and a plurality of HTML tags for formatting the content data;
identifying a first area of interest in the model page;
parsing the model page to generate a first string of symbols for the plurality of HTML tags, the generated symbols in the first string representing only HTML tags, wherein the first area of interest is identified by a first portion of the first string of symbols;
retrieving a second web page associated with a different URL than the model page;
parsing the second web page to generate a second string of symbols for a plurality of HTML tags of the second web page, the generated symbols in the second string representing only HTML tags; and
comparing the first and second symbol strings to determine whether the second string includes a second portion similar to the first portion of the first string, wherein the second portion corresponds to a second area of interest in the second page. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
-
-
12. A computer readable medium containing instructions for controlling a computer system to automatically identify desired content in a retrieved HTML formatted web page, by automatically:
-
parsing the HTML code of a manually selected model web page to generate a first string of symbols for a first plurality of HTML tags, the generated symbols in the first string representing only HTML tags;
retrieving a second web page associated with a different URL than the model web page;
parsing the HTML code of the second web page to generate a second string of symbols for HTML tags of the second page, the generated symbols in the second string representing only HTML tags; and
comparing the first and second symbol strings to determine whether the second page includes a second plurality of HTML tags substantially matching the first plurality of HTML tags. - View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20)
-
-
21. A computer system for identifying and extracting content from HTML formatted web pages, the system comprising:
-
means for retrieving web pages including content data and HTML tags for formatting the content data, wherein a model web page is retrieved;
means for manually identifying a first area of interest in the model page, wherein the first area of interest corresponds to a first plurality of HTML tags; and
a processor including;
means for parsing a page, wherein the parsing means parses the model page and generates a first string of symbols for the first plurality of HTML tags, the generated symbols in the first string representing only HTML tags, and wherein the parsing means thereafter parses an automatically retrieved second web page associated with a different URL than the model page and generates a second string of symbols for HTML tags of the second web page, the generated symbols in the second string representing only HTML tags;
means for comparing the first and second symbol strings to determine whether the second string includes a second portion similar to the first portion of the first string, wherein the second portion corresponds to a second area of interest in the second page; and
means for extracting content data in the second area of interest from the second page.
-
-
22. A computer implemented method of identifying desired content in web pages formatted using a markup language, comprising the steps of:
-
selecting a model page, wherein the model page includes a plurality of tokens, wherein tokens include HTML tag elements and content elements;
identifying a first area of interest in the model page;
parsing the model page to generate a first string of symbols for the plurality of tokens in the model page, the generated symbols in the first string representing only tag elements, wherein the first area of interest is identified by a first portion of the first string of symbols;
retrieving a second web page associated with a different URL than the model page;
parsing the second web page to generate a second string of symbols for a plurality of tokens of the second web page, the generated symbols in the second string representing only tag elements; and
comparing the first and second symbol strings to determine whether the second string includes a second portion similar to the first portion of the first string, wherein the second portion corresponds to a second area of interest in the second page. - View Dependent Claims (23, 24)
-
-
25. A computer-implemented method of identifying similar content in HTML formatted web pages, the method comprising:
-
selecting a model page, wherein the model page includes content data and a plurality of HTML tags for formatting the content data;
identifying a first area of interest in the model page;
generating a first string of symbols for the plurality of HTML tags associated with the first area of interest, the generated symbols in the first string representing only HTML tags;
retrieving a second web page associated with a different URL than the model page;
generating a second string of symbols for the HTML tags of the second web page, the generated symbols in the second string representing only HTML tags; and
comparing the first and second symbol strings to determine whether the second string includes a portion similar to the first string, wherein the portion corresponds to a second area of interest in the second page. - View Dependent Claims (26, 27)
-
Specification