Method, system, and computer-readable medium for filtering harmful HTML in an electronic document

US 7,308,648 B1
Filed: 11/27/2002
Issued: 12/11/2007
Est. Priority Date: 11/27/2002
Status: Expired due to Fees

First Claim

Patent Images

1. A computer-implemented method for filtering harmful markup language from an electronic document in a computer system, comprising:

receiving the electronic document, wherein the electronic document comprises data including a plurality of markup language content;

parsing the data received from the electronic document for the plurality of markup language content;

comparing the plurality of markup language content with a list of known markup language content in a library, wherein the library comprises a set of security rules for identifying safe and harmful markup language content, wherein the harmful markup language content comprises malicious markup language as determined by the set of security rules, wherein the set of security rules comprises flags which describe various levels of security and wherein the flags are passed to an application program interface;

determining a presence of safe markup language content and harmful markup language content in the electronic document based on the content library wherein determining the presence of harmful markup content comprises;

determining whether any of the plurality of markup language content is in the library; and

if any of the plurality of markup language content is not listed in the library, then determining that the any of the plurality of markup language content not listed in the library comprises the presence of harmful markup language content; and

removing the harmful markup language content from the electronic document.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and system are provided for filtering harmful HTML content from an electronic document. An application program interface (API) examines the fundamental structure of the HTML content in the document. The HTML content in the electronic document is parsed into HTML elements and attributes by a tokenizer and compared to a content library by a filter in the API. The filter removes unknown HTML content as well as known content that is listed as harmful in the content library. After the harmful HTML content has removed, a new document is encoded which includes the remaining safe HTML content for viewing in a web browser.

Citations

25 Claims

1. A computer-implemented method for filtering harmful markup language from an electronic document in a computer system, comprising:
- receiving the electronic document, wherein the electronic document comprises data including a plurality of markup language content;
  
  parsing the data received from the electronic document for the plurality of markup language content;
  
  comparing the plurality of markup language content with a list of known markup language content in a library, wherein the library comprises a set of security rules for identifying safe and harmful markup language content, wherein the harmful markup language content comprises malicious markup language as determined by the set of security rules, wherein the set of security rules comprises flags which describe various levels of security and wherein the flags are passed to an application program interface;
  
  determining a presence of safe markup language content and harmful markup language content in the electronic document based on the content library wherein determining the presence of harmful markup content comprises;
  
  determining whether any of the plurality of markup language content is in the library; and
  
  if any of the plurality of markup language content is not listed in the library, then determining that the any of the plurality of markup language content not listed in the library comprises the presence of harmful markup language content; and
  
  removing the harmful markup language content from the electronic document.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The method of claim 1, further comprising:
    - decoding the data in the electronic document;
      
      after removing the harmful markup language content, creating a new document including only the safe markup language content; and
      
      saving the new document.
  - 3. The method of claim 1, wherein receiving the electronic document comprises receiving an HTML file.
  - 4. The method of claim 1, wherein receiving the electronic document comprises receiving an HTML fragment.
  - 5. The method of claim 1, wherein removing the harmful markup language content comprises removing at least one HTML element.
  - 6. The method of claim 1, wherein removing the harmful markup language content comprises removing at least one HTML attribute.
  - 7. The method of claim 1, wherein determining the presence of safe markup language content comprises determining the presence of plain text.
  - 8. The method of claim 1, wherein determining the presence of safe markup language content comprises determining the presence of a set of HTML elements and attributes.
  - 9. The method of claim 1, wherein the malicious markup language comprises an HTML script.
  - 10. The method of claim 1, wherein the malicious markup language obtains user information.
  - 11. The method of claim 1, wherein the malicious markup language is embedded in an HTML image element.
  - 12. The method of claim 1, wherein the set of security rules include at least one of the following:
    - a security rule requiring a filter to accept an HTML ADDRESS tag whenever the HTML ADDRESS tag is encountered in a file;
      
      a security rule requiring the filter to remove an HTML SCRIPT tag whenever the HTML SCRIPT tag is encountered in a file;
      
      a rule that all unknown HTML tags must be removed by the filter whenever any unknown tag is encountered in a file; and
      
      different levels of security associated with the set of security rules.

13. A computer-readable medium having computer-executable components for filtering harmful markup language from an electronic document in a computer system, comprising:
- a parser component for parsing data received from the electronic document for the plurality of markup language content; and
  
  a filter component for comparing the plurality of markup language content with a list of known markup language content in a library, wherein the filtering component determines the presence of harmful markup language content in the electronic document by;
  
  determining whether any of the plurality of markup language content is in the library; and
  
  if any of the plurality of markup language content is not listed in the library, then determining that the any of the plurality of markup language content not listed in the library includes the harmful markup language content; and
  
  wherein the library comprises a set of security rules for identifying safe and harmful markup language content, wherein the harmful markup language content comprises malicious markup language as determined by the set of security rules;
  
  an application program interface for receiving the set of security rules, wherein the set of security rules comprises flags which describe various levels of security;
  
  determining the presence of safe markup language content in the electronic document as included on the list of known markup language content in the library; and
  
  removing the harmful markup language content from the electronic document.
- View Dependent Claims (14, 15, 16, 17, 18, 19)
- - 14. The computer-readable medium of claim 13 having further computer-executable components comprising:
    - a decoder component for decoding the data in the electronic document; and
      
      an encoder component for creating a new document including only the safe markup language content.
  - 15. The computer-readable medium of claim 13, wherein the electronic document comprises an HTML file.
  - 16. The computer-readable medium of claim 13, wherein the electronic document comprises an HTML fragment.
  - 17. The computer-readable medium of claim 13, wherein the harmful markup language content comprises at least one HTML element.
  - 18. The computer-readable medium of claim 13, wherein the harmful markup language content comprises at least one HTML attribute.
  - 19. The computer-readable medium of claim 13, wherein the plurality of markup language content further comprises cascading stylesheet constructs.

20. A system for filtering harmful markup language content from an electronic document comprising:
- a computer for receiving the electronic document, wherein the electronic document comprises data including a plurality of markup language content;
  
  an application program interface, stored on the computer, for filtering the electronic document and receiving a set of security rules, wherein the set of security rules comprises flags which describe various levels of security;
  
  a decoder for decoding the data in the received electronic document;
  
  a tokenizer for parsing the data into a plurality of tokens defining the structure of the electronic document;
  
  a filter for;
  
  receiving the tokens from the tokenizer;
  
  determining whether any of the tokens represent harmful markup language content by;
  
  comparing each token to a list of tokens in a content library stored in the filter; and
  
  determining that a token represents harmful markup language content when the token is not in the list of tokens in the content library, wherein the harmful markup language content comprises malicious markup language as determined by the set of security rules; and
  
  removing the tokens representing the harmful content;
  
  a detokenizer for regenerating the tokens not representing harmful markup language content into new data; and
  
  an encoder for encoding the new data into a new document.
- View Dependent Claims (21, 22, 23, 24, 25)
- - 21. The system of claim 20, wherein the file comprises an HTML file.
  - 22. The system of claim 20, wherein the file comprises an HTML fragment.
  - 23. The system of claim 20, wherein the tokens are HTML elements.
  - 24. The system of claim 20, wherein the tokens are HTML attributes.
  - 25. The system of claim 20, wherein the tokens are cascading stylesheet constructs.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Pullen, Walter David, Turski, Andrzej, Buchthal, David Michael, Forschler, Lucas Jason, Gallagher, Thomas Patrick, Loisey, Christophe Rene
Primary Examiner(s)
Hong, Stephen
Assistant Examiner(s)
LUDWIG, MATTHEW J

Application Number

US10/305,544
Time in Patent Office

1,840 Days
Field of Search

715/513, 715/532, 715/500.1, 715/501.1, 715/526, 715/539, 715/530, 715/531, 715/533, 704/251, 704/10
US Class Current

715/234
CPC Class Codes

G06F 16/986 Document structures and sto...

Method, system, and computer-readable medium for filtering harmful HTML in an electronic document

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

25 Claims

Specification

Solutions

Use Cases

Quick Links

Method, system, and computer-readable medium for filtering harmful HTML in an electronic document

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

25 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links