METHODS AND SYSTEMS TO FINGERPRINT TEXTUAL INFORMATION USING WORD RUNS
First Claim
1. A system to prevent unauthorized disclosure of secure information, the system comprising:
- a processor;
a memory;
a processing component configured to;
receive information including a first text, wherein the first text includes a plurality of words;
normalize the first text into a first canonical text expression, the first canonical text expression including a plurality of normalized words;
generate a first word hash list for the first canonical text expression, wherein the first word hash list is generated at a word level; and
generate one or more fingerprints for the first word hash list, wherein the generation of one or more fingerprints includes;
assigning a sliding window of size W, wherein W specifies a number of word-value hashes to read from the first word hash list;
using the sliding window to read the W word-value hashes from the first word hash list;
designating an anchor word-value hash for the sliding window by selecting a distinct-valued word-value hash among the W word-value hashes; and
applying a fingerprint hash function to all words starting from a first word-value hash to the anchor word value-hash, wherein applying the fingerprint hash function generates the one or more fingerprints.
3 Assignments
0 Petitions
Accused Products
Abstract
The present invention provides methods and systems to enable fast, efficient, and scalable means for fingerprinting textual information using word runs. The present system receives textual information and provides algorithms to convert the information into representative fingerprints. In one embodiment, the fingerprints are recorded in a repository to maintain a database of an organization'"'"'s secure data. In another embodiment, textual information entered by a user is verified against the repository of fingerprints to prevent unauthorized disclosure of secure data. This invention provides approaches to allow derivative works (e.g., different ordering of words, substitution of words with synonyms, etc.) of the original information to be detected at the sentence level or even at the paragraph level. This invention also provides methods and systems for enhancing storage and resource efficiencies by providing approaches to optimize the number of fingerprints generated for the textual information.
-
Citations
20 Claims
-
1. A system to prevent unauthorized disclosure of secure information, the system comprising:
-
a processor; a memory; a processing component configured to; receive information including a first text, wherein the first text includes a plurality of words; normalize the first text into a first canonical text expression, the first canonical text expression including a plurality of normalized words; generate a first word hash list for the first canonical text expression, wherein the first word hash list is generated at a word level; and generate one or more fingerprints for the first word hash list, wherein the generation of one or more fingerprints includes; assigning a sliding window of size W, wherein W specifies a number of word-value hashes to read from the first word hash list; using the sliding window to read the W word-value hashes from the first word hash list; designating an anchor word-value hash for the sliding window by selecting a distinct-valued word-value hash among the W word-value hashes; and applying a fingerprint hash function to all words starting from a first word-value hash to the anchor word value-hash, wherein applying the fingerprint hash function generates the one or more fingerprints. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
-
-
14. A computer implemented method for preventing unauthorized disclosure of secure information, the computer implemented method comprising:
-
storing a plurality of secure text fingerprints for a given organization, wherein each of the plurality of secure text fingerprints is generated using a fixed window word run hashing; receiving a first text that a user desires to transmit outside of the given organization; generating a first set of fingerprints for the first text using the fixed window word run hashing, wherein generating a first set of fingerprints includes; converting a plurality of normalized words into a plurality of word-value hashes to create an original word hash list, wherein each word-value hash represents a specific normalized word; assigning a sliding window of size W, wherein W specifies a number of word-value hashes to read from the original word hash list; using said sliding window to read said W word-value hashes from the original word hash list; designating an anchor word-value hash for the sliding window by selecting a distinct-valued word-value hash among said W word-value hashes; and applying a fingerprint hash function to all words starting from a first word-value hash to the anchor word-value hash, wherein applying the fingerprint hash function generates the first set of fingerprints; determining whether any of the first set of fingerprints is identical to any of the plurality of secure text fingerprints; and taking a security action when any of the first set of fingerprints is identical to any of the plurality of secure text fingerprints. - View Dependent Claims (16, 17, 18, 19, 20)
-
-
15. A computer implemented method for preventing unauthorized disclosure of secure information as recited in 14, wherein the fixed window word run hashing comprises:
-
receiving information including an original text, the original text including a plurality of words; normalizing said original text into an original canonical text expression, the original canonical text expression including a plurality of normalized words; generating an original word hash list for the original canonical text expression, wherein the original word hash list is generated at a word level, wherein the original word hash list includes a plurality of word-value hashes; and generating an original set of fingerprints for the original word hash list.
-
Specification