Method, system, and computer-readable medium for filtering harmful HTML in an electronic document
First Claim
1. A computer-implemented method for filtering harmful markup language from an electronic document in a computer system, comprising:
- receiving the electronic document, wherein the electronic document comprises data including a plurality of markup language content;
parsing the data received from the electronic document for the plurality of markup language content;
comparing the plurality of markup language content with a list of known markup language content in a library, wherein the library comprises a set of security rules for identifying safe and harmful markup language content, wherein the harmful markup language content comprises malicious markup language as determined by the set of security rules, wherein the set of security rules comprises flags which describe various levels of security and wherein the flags are passed to an application program interface;
determining a presence of safe markup language content and harmful markup language content in the electronic document based on the content library wherein determining the presence of harmful markup content comprises;
determining whether any of the plurality of markup language content is in the library; and
if any of the plurality of markup language content is not listed in the library, then determining that the any of the plurality of markup language content not listed in the library comprises the presence of harmful markup language content; and
removing the harmful markup language content from the electronic document.
3 Assignments
0 Petitions
Accused Products
Abstract
A method and system are provided for filtering harmful HTML content from an electronic document. An application program interface (API) examines the fundamental structure of the HTML content in the document. The HTML content in the electronic document is parsed into HTML elements and attributes by a tokenizer and compared to a content library by a filter in the API. The filter removes unknown HTML content as well as known content that is listed as harmful in the content library. After the harmful HTML content has removed, a new document is encoded which includes the remaining safe HTML content for viewing in a web browser.
-
Citations
25 Claims
-
1. A computer-implemented method for filtering harmful markup language from an electronic document in a computer system, comprising:
-
receiving the electronic document, wherein the electronic document comprises data including a plurality of markup language content; parsing the data received from the electronic document for the plurality of markup language content; comparing the plurality of markup language content with a list of known markup language content in a library, wherein the library comprises a set of security rules for identifying safe and harmful markup language content, wherein the harmful markup language content comprises malicious markup language as determined by the set of security rules, wherein the set of security rules comprises flags which describe various levels of security and wherein the flags are passed to an application program interface; determining a presence of safe markup language content and harmful markup language content in the electronic document based on the content library wherein determining the presence of harmful markup content comprises; determining whether any of the plurality of markup language content is in the library; and if any of the plurality of markup language content is not listed in the library, then determining that the any of the plurality of markup language content not listed in the library comprises the presence of harmful markup language content; and removing the harmful markup language content from the electronic document. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A computer-readable medium having computer-executable components for filtering harmful markup language from an electronic document in a computer system, comprising:
-
a parser component for parsing data received from the electronic document for the plurality of markup language content; and a filter component for comparing the plurality of markup language content with a list of known markup language content in a library, wherein the filtering component determines the presence of harmful markup language content in the electronic document by; determining whether any of the plurality of markup language content is in the library; and if any of the plurality of markup language content is not listed in the library, then determining that the any of the plurality of markup language content not listed in the library includes the harmful markup language content; and wherein the library comprises a set of security rules for identifying safe and harmful markup language content, wherein the harmful markup language content comprises malicious markup language as determined by the set of security rules; an application program interface for receiving the set of security rules, wherein the set of security rules comprises flags which describe various levels of security; determining the presence of safe markup language content in the electronic document as included on the list of known markup language content in the library; and removing the harmful markup language content from the electronic document. - View Dependent Claims (14, 15, 16, 17, 18, 19)
-
-
20. A system for filtering harmful markup language content from an electronic document comprising:
-
a computer for receiving the electronic document, wherein the electronic document comprises data including a plurality of markup language content; an application program interface, stored on the computer, for filtering the electronic document and receiving a set of security rules, wherein the set of security rules comprises flags which describe various levels of security; a decoder for decoding the data in the received electronic document; a tokenizer for parsing the data into a plurality of tokens defining the structure of the electronic document; a filter for; receiving the tokens from the tokenizer; determining whether any of the tokens represent harmful markup language content by; comparing each token to a list of tokens in a content library stored in the filter; and determining that a token represents harmful markup language content when the token is not in the list of tokens in the content library, wherein the harmful markup language content comprises malicious markup language as determined by the set of security rules; and removing the tokens representing the harmful content; a detokenizer for regenerating the tokens not representing harmful markup language content into new data; and an encoder for encoding the new data into a new document. - View Dependent Claims (21, 22, 23, 24, 25)
-
Specification