Systems and methods for parsing user-generated content to prevent attacks

US 9,098,722 B2
Filed: 03/15/2013
Issued: 08/04/2015
Est. Priority Date: 03/15/2013
Status: Active Grant

First Claim

Patent Images

1. A method for parsing a token stream symbolizing user generated content, using a computer implemented security system, the method comprising:

removing tokens, using a processor, from the token stream to generate a sanitized token stream, wherein the removal of tokens is performed by;

iterating over the token stream while filtering for nodes that are hypertext markup language tags, and cross referencing the tag against a whitelist;

if the tag is in the whitelist, then iterating through the attributes of the tag and cross referencing the attributes against the whitelist;

iterating through protocol-based hypertext markup language attributes to identify a valid URL, and cross referencing the valid URL with the whitelist;

iterating through cascade style sheet selectors within <

style> and

<

link>

tags and cross referencing the cascade style sheet selector with the whitelist;

if the cascade style sheet selector is in the whitelist, then iterating through properties for the cascade style sheet selector in <

style>

/<

link>

tags or as “

style”

attributes on a specific hypertext markup language tag, and cross referencing the properties against the whitelist; and

removing any token which is not found in the whitelist when cross-referenced.

View all claims

10 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention relates to systems and methods for parsing of a token stream for user generated content in order to prevent attacks on the user generated content. The systems and methods include a database which stores one or more whitelists, and a parser. The parser removes tokens from the token stream by comparing the tokens against the whitelist. Next, the parser validates CSS property values, encodes data within attribute values and text nodes, reconciles closing HTML tags, and coerces media tags into safe variants. The tokens removed may be any of HTML tags, HTML attributes, HTML protocols, CSS selectors and CSS properties.

Citations

18 Claims

1. A method for parsing a token stream symbolizing user generated content, using a computer implemented security system, the method comprising:
- removing tokens, using a processor, from the token stream to generate a sanitized token stream, wherein the removal of tokens is performed by;
  
  iterating over the token stream while filtering for nodes that are hypertext markup language tags, and cross referencing the tag against a whitelist;
  
  if the tag is in the whitelist, then iterating through the attributes of the tag and cross referencing the attributes against the whitelist;
  
  iterating through protocol-based hypertext markup language attributes to identify a valid URL, and cross referencing the valid URL with the whitelist;
  
  iterating through cascade style sheet selectors within <
  
  style> and
  
  <
  
  link>
  
  tags and cross referencing the cascade style sheet selector with the whitelist;
  
  if the cascade style sheet selector is in the whitelist, then iterating through properties for the cascade style sheet selector in <
  
  style>
  
  /<
  
  link>
  
  tags or as “
  
  style”
  
  attributes on a specific hypertext markup language tag, and cross referencing the properties against the whitelist; and
  
  removing any token which is not found in the whitelist when cross-referenced.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1, further comprising:
    - validating cascade style sheet property values within the sanitized token stream;
      
      encoding data within attribute values and text nodes within the sanitized token stream;
      
      reconciling closing hypertext markup language tags within the sanitized token stream; and
      
      coercing media tags into safe variants within the sanitized token stream.
  - 3. The method of claim 1, further comprising removing matching opening/closing nodes to the removed token.
  - 4. The method of claim 2, wherein validating cascade style sheet property value includes iterating through each property/value combination and verifying that the cascade style sheet property value meets the requirements set forth in cascade style sheet specification.
  - 5. The method of claim 4, further comprising removing any cascade style sheet property value which does not meet requirements set forth in cascade style sheet specification.
  - 6. The method of claim 2, wherein the encoding includes running an encoder for text within attribute values, and running the encoder for text nodes.
  - 7. The method of claim 2, wherein the reconciling closing hypertext markup language tags includes removing closing hypertext markup language tags that have missing opening tags, and inserting matching closing hypertext markup language tags for opening tags that are not closed.
  - 8. The method of claim 7, further comprising updating opening/closing positions throughout the entire token stream when new matching nodes are added.
  - 9. The method of claim 2, wherein the coercing media tags into safe variants includes iterating through rich media tags and coercing the rich media tags based on type, class identifier, and URL endpoints for the destination file.

10. A security system for parsing a token stream symbolizing user generated content comprising:
- a database, stored in a memory storage device, including a whitelist; and
  
  a parser configured by a processor device to generate a sanitized token stream by;
  
  iterating over the token stream while filtering for nodes that are hypertext markup language tags, and cross referencing the tag against the whitelist;
  
  if the tag is in the whitelist, then iterating through the attributes of the tag and cross referencing the attributes against the whitelist;
  
  iterating through protocol-based hypertext markup language attributes to identify a valid URL, and cross referencing the valid URL with the whitelist;
  
  iterating through cascade style sheet selectors within and tags and cross referencing the cascade style sheet selector with the whitelist;
  
  if the cascade style sheet selector is in the whitelist, then iterating through properties for the cascade style sheet selector in <
  
  style>
  
  /<
  
  link>
  
  tags or as “
  
  style”
  
  attributes on a specific hypertext markup language tag, and cross referencing the properties against the whitelist; and
  
  removing any token which is not found in the whitelist when cross-referenced.
- View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
- - 11. The system of claim 10, wherein the parser is further configured to:
    - validate cascade style sheet property values within the sanitized token stream;
      
      encode data within attribute values and text nodes within the sanitized token stream;
      
      reconcile closing hypertext markup language tags within the sanitized token stream; and
      
      coerce media tags into safe variants within the sanitized token stream.
  - 12. The system of claim 10, wherein the parser is further configured to remove matching opening/closing nodes to the removed token.
  - 13. The system of claim 11, wherein the parser iterates through each property/value combination and verifies that the cascade style sheet property value meets the requirements set forth in cascade style sheet specification.
  - 14. The system of claim 13, wherein the parser is further configured to remove any CSS property value which does not meet requirements set forth in cascade style sheet specification.
  - 15. The system of claim 11, wherein the parser is further configured to run an encoder for text within attribute values, and running the encoder for text nodes.
  - 16. The system of claim 11, wherein the parser is further configured to remove closing hypertext markup language tags that have missing opening tags, and insert matching closing hypertext markup language tags for opening tags that are not closed.
  - 17. The system of claim 16, wherein the parser is further configured to update opening/closing positions throughout the entire token stream when new matching nodes are added.
  - 18. The system of claim 11, wherein the parser iterates through rich media tags and coerces the rich media tags based on type, class identifier, and URL endpoints for the destination file.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Imperva Incorporated (Thales SA)
Original Assignee
Prevoty, Inc. (Thales SA)
Inventors
Anand, Kunal
Primary Examiner(s)
Zecher, Dede
Assistant Examiner(s)
DOAN, TRANG T

Application Number

US13/839,807
Publication Number

US 20140283139A1
Time in Patent Office

872 Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 21/54   by adding security routines...

G06F 21/554   involving event detection a...

G06F 21/6218   to a system of files or obj...

H04L 63/14   for detecting or protecting...

H04L 67/02   based on web technology, e....

Systems and methods for parsing user-generated content to prevent attacks

First Claim

10 Assignments

0 Petitions

Accused Products

Abstract

Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods for parsing user-generated content to prevent attacks

First Claim

10 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links