Systems and methods for identifying and extracting data from HTML pages

US 6,920,609 B1
Filed: 08/24/2000
Issued: 07/19/2005
Est. Priority Date: 08/24/2000
Status: Expired due to Term

First Claim

Patent Images

1. A computer implemented method of identifying desired content in HTML formatted web pages, comprising the steps of:

selecting a model page, wherein the model page includes content data and a plurality of HTML tags for formatting the content data;

identifying a first area of interest in the model page;

parsing the model page to generate a first string of symbols for the plurality of HTML tags, the generated symbols in the first string representing only HTML tags, wherein the first area of interest is identified by a first portion of the first string of symbols;

retrieving a second web page associated with a different URL than the model page;

parsing the second web page to generate a second string of symbols for a plurality of HTML tags of the second web page, the generated symbols in the second string representing only HTML tags; and

comparing the first and second symbol strings to determine whether the second string includes a second portion similar to the first portion of the first string, wherein the second portion corresponds to a second area of interest in the second page.

View all claims

9 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods for analyzing HTML formatted web pages to automatically identify and extract desired information. A computer algorithm identifies and extracts different pieces of information from different web pages automatically after minimal manual setup. The algorithm automatically analyzes pages with different content if they have the same, or similar, formats.

Citations

27 Claims

1. A computer implemented method of identifying desired content in HTML formatted web pages, comprising the steps of:
- selecting a model page, wherein the model page includes content data and a plurality of HTML tags for formatting the content data;
  
  identifying a first area of interest in the model page;
  
  parsing the model page to generate a first string of symbols for the plurality of HTML tags, the generated symbols in the first string representing only HTML tags, wherein the first area of interest is identified by a first portion of the first string of symbols;
  
  retrieving a second web page associated with a different URL than the model page;
  
  parsing the second web page to generate a second string of symbols for a plurality of HTML tags of the second web page, the generated symbols in the second string representing only HTML tags; and
  
  comparing the first and second symbol strings to determine whether the second string includes a second portion similar to the first portion of the first string, wherein the second portion corresponds to a second area of interest in the second page.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
- - 2. The method of claim 1, wherein the step of comparing includes applying an approximate pattern matching algorithm to the first and second strings.
  - 3. The method of claim 1, further comprising the step of storing the first and second areas of interest in a database.
  - 4. The method of claim 1, further comprising the step of extracting content data in the second area of interest from the second page.
  - 5. The method of claim 4, further comprising the step of applying a regular expression matching algorithm to the extracted second area of interest.
  - 6. The method of claim 1, wherein the first and second areas of interest each include two or more distinct sub-areas of the respective page.
  - 7. The method of claim 1, wherein the step of identifying a first area of interest includes the step of identifying portions of the HTML tags of the model page.
  - 8. The method of claim 1, wherein the step of identifying a first area of interest is performed using a manual pointing and selecting device.
  - 9. The method of claim 1, wherein the steps of selecting and identifying are performed manually and wherein the remaining steps are performed automatically.
  - 10. The method of claim 1, wherein the second web page is retrieved from a remote website over the Internet.
  - 11. The method of claim 1, wherein the HTML tags include attributes and attribute values.

12. A computer readable medium containing instructions for controlling a computer system to automatically identify desired content in a retrieved HTML formatted web page, by automatically:
- parsing the HTML code of a manually selected model web page to generate a first string of symbols for a first plurality of HTML tags, the generated symbols in the first string representing only HTML tags;
  
  retrieving a second web page associated with a different URL than the model web page;
  
  parsing the HTML code of the second web page to generate a second string of symbols for HTML tags of the second page, the generated symbols in the second string representing only HTML tags; and
  
  comparing the first and second symbol strings to determine whether the second page includes a second plurality of HTML tags substantially matching the first plurality of HTML tags.
- View Dependent Claims (13, 14, 15, 16, 17, 18, 19, 20)
- - 13. The computer readable medium of claim 12, wherein the first plurality of HTML tags are identified by an operator using a pointing and selection device coupled to the computer system.
  - 14. The computer readable medium of claim 12, wherein the second web page is retrieved from a remote website over the Internet.
  - 15. The computer readable medium of claim 12, further including instructions for extracting a portion of the second page corresponding to the second plurality of HTML tags.
  - 16. The computer readable medium of claim 15, wherein the instructions further control the computer system to store the extracted portion of the second page in a database.
  - 17. The computer readable medium of claim 15, further including instructions for controlling the computer system to apply a regular expression matching algorithm to the extracted portion of the second page.
  - 18. The computer readable medium of claim 15, wherein the extracted portion of the second page includes two or more distinct sub-areas.
  - 19. The computer readable medium of claim 12, wherein the instructions for comparing include instructions for applying an approximate string matching algorithm to the first and second strings.
  - 20. The computer readable medium of claim 12, wherein the HTML tags include attributes and attribute values.

21. A computer system for identifying and extracting content from HTML formatted web pages, the system comprising:
- means for retrieving web pages including content data and HTML tags for formatting the content data, wherein a model web page is retrieved;
  
  means for manually identifying a first area of interest in the model page, wherein the first area of interest corresponds to a first plurality of HTML tags; and
  
  a processor including;
  
  means for parsing a page, wherein the parsing means parses the model page and generates a first string of symbols for the first plurality of HTML tags, the generated symbols in the first string representing only HTML tags, and wherein the parsing means thereafter parses an automatically retrieved second web page associated with a different URL than the model page and generates a second string of symbols for HTML tags of the second web page, the generated symbols in the second string representing only HTML tags;
  
  means for comparing the first and second symbol strings to determine whether the second string includes a second portion similar to the first portion of the first string, wherein the second portion corresponds to a second area of interest in the second page; and
  
  means for extracting content data in the second area of interest from the second page.

22. A computer implemented method of identifying desired content in web pages formatted using a markup language, comprising the steps of:
- selecting a model page, wherein the model page includes a plurality of tokens, wherein tokens include HTML tag elements and content elements;
  
  identifying a first area of interest in the model page;
  
  parsing the model page to generate a first string of symbols for the plurality of tokens in the model page, the generated symbols in the first string representing only tag elements, wherein the first area of interest is identified by a first portion of the first string of symbols;
  
  retrieving a second web page associated with a different URL than the model page;
  
  parsing the second web page to generate a second string of symbols for a plurality of tokens of the second web page, the generated symbols in the second string representing only tag elements; and
  
  comparing the first and second symbol strings to determine whether the second string includes a second portion similar to the first portion of the first string, wherein the second portion corresponds to a second area of interest in the second page.
- View Dependent Claims (23, 24)
- - 23. The method of claim 22, further comprising the step of extracting content elements in the second area of interest from the second page.
  - 24. The method of claim 22, wherein the markup language is selected from the group consisting of HTML, XML, WML, DHTML and HDML.

25. A computer-implemented method of identifying similar content in HTML formatted web pages, the method comprising:
- selecting a model page, wherein the model page includes content data and a plurality of HTML tags for formatting the content data;
  
  identifying a first area of interest in the model page;
  
  generating a first string of symbols for the plurality of HTML tags associated with the first area of interest, the generated symbols in the first string representing only HTML tags;
  
  retrieving a second web page associated with a different URL than the model page;
  
  generating a second string of symbols for the HTML tags of the second web page, the generated symbols in the second string representing only HTML tags; and
  
  comparing the first and second symbol strings to determine whether the second string includes a portion similar to the first string, wherein the portion corresponds to a second area of interest in the second page.
- View Dependent Claims (26, 27)
- - 26. The method of claim 25, further comprising extracting content data in the second area of interest from the second page.
  - 27. The method of claim 25, wherein identifying is performed manually using a user-input device.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
R2 Solutions LLC (Acacia Research Corporation)
Original Assignee
Yahoo! Inc. (Apollo Global Management, Inc.)
Inventors
Manber, Udi, Lu, Qi
Primary Examiner(s)
Hong, Stephen
Assistant Examiner(s)
Basehoar, Adam L.

Application Number

US09/645,479
Time in Patent Office

1,790 Days
Field of Search

715/513, 715/501.1, 715/503, 715/514, 707/6, 707/1, 707/100, 707/102
US Class Current

715/205
CPC Class Codes

G06F 16/951   Indexing; Web crawling tech...

Y10S 707/99931   Database or file accessing

Y10S 707/99936   Pattern matching access

Systems and methods for identifying and extracting data from HTML pages

First Claim

9 Assignments

0 Petitions

Accused Products

Abstract

Citations

27 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods for identifying and extracting data from HTML pages

First Claim

9 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

27 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links