Systems and methods for tokenizing user-generated content to enable the prevention of attacks

US 9,313,223 B2
Filed: 03/15/2013
Issued: 04/12/2016
Est. Priority Date: 03/15/2013
Status: Active Grant

First Claim

Patent Images

1. A method for tokenizing user-generated content, using a computer implemented security system, the method comprising:

pre-processing a user-generated content input string utilizing a secondary input of target language, wherein the pre-preprocessing converts existing text into a token stream text node at the start of an HTML tag for insertion into a token stream; and

extracting tokens, using a processor, from the pre-processed user-generated content string to generate the token stream, wherein the token stream is yielded to a caller rather than the user-generated content to prevent attacks on the user-generated content;

wherein the extraction of tokens from the pre-processed user-generated content requires scanning the pre-processed user-generated content string by individual runes, and sending each rune to a specific buffer based upon signaling individual finite state machines.

View all claims

10 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention relates to systems and methods for the tokenization of user-generated content in order to prevent attacks on the user-generated content. The systems and methods initially pre-process the user-generated content string utilizing a secondary input of target language. Pre-processing may also include initialization of finite state machines, token markers and string buffers (text, HTML tag name, HTML attribute name, HTML attribute value, CSS selector, CSS property name, and CSS property value). The user-generated content string is scanned by rune, and the system sends each rune to a specific buffer based upon signaling by individual finite state machine states. Buffers are then converted to token stream nodes to be inserted into the token stream. The tokens represent a string of characters and are symbolically categorized according to activated finite state machine states.

Citations

18 Claims

1. A method for tokenizing user-generated content, using a computer implemented security system, the method comprising:
- pre-processing a user-generated content input string utilizing a secondary input of target language, wherein the pre-preprocessing converts existing text into a token stream text node at the start of an HTML tag for insertion into a token stream; and
  
  extracting tokens, using a processor, from the pre-processed user-generated content string to generate the token stream, wherein the token stream is yielded to a caller rather than the user-generated content to prevent attacks on the user-generated content;
  
  wherein the extraction of tokens from the pre-processed user-generated content requires scanning the pre-processed user-generated content string by individual runes, and sending each rune to a specific buffer based upon signaling individual finite state machines.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method of claim 1, wherein the tokens represent a string of characters or equivalent in bytes.
  - 3. The method of claim 2, wherein the tokens are categorized according to rules as a symbol.
  - 4. The method of claim 3, wherein the extraction is performed through an activation of finite state machines.
  - 5. The method of claim 4, wherein the tokens belong to one of a plurality of buffers, wherein the plurality of buffers include at least one of text, HTML tag name, HTML attribute name, HTML attribute value, CSS selector, CSS property name, and CSS property value.
  - 6. The method of claim 5, further comprising a conversion of buffers into nodes and subsequently appending those nodes to the token stream.
  - 7. The method of claim 6, further comprising inserting remaining buffers that did not activate finite state machine states into the token stream as token stream nodes.
  - 8. The method of claim 7, further comprising returning the token stream, comprised of token stream nodes.
  - 9. The method of claim 5, further comprising clearing finite state machine variables, token markers, and string buffers at the end of the tokenizing.

10. A security system for tokenizing user-generated content comprising:
- a pre-processor configured to process a user-generated content input string utilizing a secondary input of target language, wherein the pre-processor converts existing text into a token stream text node at the start of an HTML tag for insertion into a token stream; and
  
  a tokenizer, including a processor, configured to extract tokens from the pre-processed user-generated content string to generate the token stream, wherein the token stream is yielded to a caller rather than the user-generated content to prevent attacks on the user-generated content;
  
  wherein the tokenizer scans the pre-processed user-generated content by individual runes, and sends each rune to a specific buffer based upon signaling individual finite state machine states.
- View Dependent Claims (11, 12, 13, 14, 15, 16, 17, 18)
- - 11. The system of claim 10, wherein the tokens represent a string of characters or equivalent in bytes.
  - 12. The system of claim 11, wherein the tokens are categorized according to rules as a symbol.
  - 13. The system of claim 12, wherein the extraction of tokens is performed through activation of finite state machines.
  - 14. The system of claim 13, wherein the tokens belong to one of a plurality of buffers, wherein the plurality of buffers include at least one of text, HTML tag name, HTML attribute name, HTML attribute value, CSS selector, CSS property name, and CSS property value.
  - 15. The system of claim 14, wherein the tokenizer is further configured to convert buffers into strings which are HTML encoded, and appending the strings to the token stream as token stream nodes.
  - 16. The system of claim 15, wherein the tokenizer is further configured to insert remaining buffers that did not activate finite state machines into the token stream as token stream nodes.
  - 17. The system of claim 16, wherein the tokenizer is further configured to return the token stream full of token stream nodes.
  - 18. The system of claim 14, wherein the tokenizer is further configured to clear finite state machine variables, tokenizer markers, and string buffers at the end of tokenizing.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Imperva Incorporated (Thales SA)
Original Assignee
Prevoty, Inc. (Thales SA)
Inventors
Anand, Kunal
Primary Examiner(s)
Zecher, Dede
Assistant Examiner(s)
AVERY, JEREMIAH L

Application Number

US13/839,622
Publication Number

US 20140283033A1
Time in Patent Office

1,124 Days
Field of Search

None
US Class Current

1/1
CPC Class Codes

G06F 21/554   involving event detection a...

G06F 2221/031   Protect user input by softw...

G06F 2221/2119   Authenticating web pages, e...

H04L 63/1416   Event detection, e.g. attac...

H04L 63/1441   Countermeasures against mal...

Systems and methods for tokenizing user-generated content to enable the prevention of attacks

First Claim

10 Assignments

0 Petitions

Accused Products

Abstract

Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods for tokenizing user-generated content to enable the prevention of attacks

First Claim

10 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links