System and method of obfuscating data
First Claim
1. A method of obfuscating the text of an electronic document in a computing environment for indexing by search engine systems, the method comprising:
- receiving a request from a search engine spider for a source electronic document, wherein said source electronic document is protected by a digital rights management system such that said digital rights management system does not enable said search engine spider to have full-access to said source electronic document;
retrieving source text from said source electronic document;
parsing said source text into tokens using a delimiter based upon indexing characteristics of a search engine system for which said search engine spider requested said source electronic document, wherein for search engine systems that recognize phrases, said tokens include groups of words;
providing a stop list comprising a predefined list of tokensremoving tokens, from parsed source text, that are listed within said stop list;
inserting randomly selected tokens into said parsed source text, wherein said random tokens are selected from said stop list;
randomizing an order of adjacent tokens of said parsed source text when said stop list has a small number of tokens;
generating a second electronic document after said inserting random tokens and said removing tokens thereby said second electronic document of obfuscated index information, such that it is difficult to reconstruct said source electronic document from said second electronic document, and such that said second electronic document adequately represents said source electronic document for use by said search engine system andtransmitting the second electronic document to said search engine spider.
19 Assignments
0 Petitions
Accused Products
Abstract
A system and method of generating index information for electronic documents. The system includes a client, one or more information retrieval (IR) engines, such as a search engine, which are each in communication with each other via a network. In one embodiment of the invention, the server maintains a plurality or data objects that are protected by digital rights management (DRM) software. Upon receiving a network request from one of the IR systems, the server dynamically generates an electronic document that provides index information that is associated with one of the data objects. In one embodiment of the invention, the server dynamically generates the contents of the electronic document based upon the indexing characteristics of the IR system. Furthermore, upon receiving a network request from one of the client, the server determines whether the client is authorized to access the data object that is associated with the network request. If the client is authorized to access the data object, the server transmits the data object to the user. Alternatively, if the client is not authorized to access the data object, the server dynamically prepares instructions to the client, the instructions describing additional steps the user at the client may perform to get authorized to access the data object.
125 Citations
12 Claims
-
1. A method of obfuscating the text of an electronic document in a computing environment for indexing by search engine systems, the method comprising:
-
receiving a request from a search engine spider for a source electronic document, wherein said source electronic document is protected by a digital rights management system such that said digital rights management system does not enable said search engine spider to have full-access to said source electronic document; retrieving source text from said source electronic document; parsing said source text into tokens using a delimiter based upon indexing characteristics of a search engine system for which said search engine spider requested said source electronic document, wherein for search engine systems that recognize phrases, said tokens include groups of words; providing a stop list comprising a predefined list of tokens removing tokens, from parsed source text, that are listed within said stop list; inserting randomly selected tokens into said parsed source text, wherein said random tokens are selected from said stop list; randomizing an order of adjacent tokens of said parsed source text when said stop list has a small number of tokens; generating a second electronic document after said inserting random tokens and said removing tokens thereby said second electronic document of obfuscated index information, such that it is difficult to reconstruct said source electronic document from said second electronic document, and such that said second electronic document adequately represents said source electronic document for use by said search engine system and transmitting the second electronic document to said search engine spider. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. An obfuscating system for obfuscating electronic documents for indexing by a search engine system, the obfuscating system comprising:
-
a web server programmed for receiving a request from a search engine spider for a source HTML document at a selected URL, wherein said source HTML document is protected by a digital rights management system such that said digital rights management system does not enable said search engine spider to have full access to said source HTML document; a tokenizer that locates tokens from said source HTML document based upon indexing characteristics of a search engine system for which said search engine spider requested said source HTML document; a token replacer that replaces selected tokens in said source HTML document with randomly selected tokens from a reserved token list, resulting in a set of obfuscated index information; a token order randomizer for randomizing an order of adjacent tokens in said set of obfuscated index information; said web server further programmed for creating a second HTML document comprising a description metatag containing an intelligible excerpt regarding the content of said second HTML document, a hyperlink that links to said source HTML document, and said second HTML document comprising obfuscated index information, wherein the intelligibility of the contents of said second HTML document is reduced without interfering with the ability of said search engine system properly index and retrieve said source HTML document; and said web server further programmed to transmit said second HTML of obfuscated index information to said search engine spider. - View Dependent Claims (8, 9)
-
-
10. A method of obfuscating the text of an electronic document for indexing by search engine systems, the method comprising:
-
receiving a request from a search engine spider for a first electronic document, wherein said first electronic document is protected by a digital rights management system such that said digital rights management system does not enable said search engine spider to have full-access to said source electronic document; retrieving text from said first electronic document; parsing said text into tokens using a delimiter based upon indexing characteristics of a search engine system for which said search engine spider requested said first electronic document, wherein when said search engine systems that recognize phrases said tokens include groups of words, thereby said parsing creating an initial set of index information; identifying one or more words from said initial set of index information that are each a member of a selected classification of words; discarding any identified words from said initial set of index information so as to retain index words, thereby modifying said set of index information; generating a second electronic document from said modified set of index information thereby said second electronic document is an electronic document of obfuscated index information, such that it is difficult to reconstruct said first electronic document from said second electronic document, and such that said second electronic document adequately represents said first electronic document for use by said search engine system; and transmitting the second electronic document to a search engine spider. - View Dependent Claims (11, 12)
-
Specification