Methods and systems to fingerprint textual information using word runs
First Claim
1. A computer implemented method for preventing unauthorized disclosure of secure information, the computer implemented method comprising:
- receiving information including a first text, by a computer system having at least a processor for executing instructions, said first text including a plurality of words;
normalizing, by said computer system, said first text into a first canonical text expression, said first canonical text expression including a plurality of normalized words;
generating, at said computer system, a first word hash list for said first canonical text expression, where said first word hash list is generated at a word level;
generating, at said computer system, a first set of fingerprints for said first word hash list;
wherein generating said first word hash list includes converting said plurality of normalized words into a plurality of word-value hashes, each specific one of said word-value hashes representing a specific normalized word; and
wherein said generating said first set of fingerprints includes;
assigning a sliding window of size W, wherein said sliding window is used for reading a W number of said word-value hashes from said first word hash list;
using said sliding window to read said W number of said word-level hashes from said first word hash list;
designating said word-value hash with a distinct value within said sliding window as an anchor; and
generating a fingerprint using a fingerprint hash function, wherein said fingerprint hash function is applied over all said word-value hashes contained within a start of said sliding window to where said anchor resides in said sliding window.
5 Assignments
0 Petitions
Accused Products
Abstract
The present invention provides methods and systems to enable fast, efficient, and scalable means for fingerprinting textual information using word runs. The present system receives textual information and provides algorithms to convert the information into representative fingerprints. In one embodiment, the fingerprints are recorded in a repository to maintain a database of an organization'"'"'s secure data. In another embodiment, textual information entered by a user is verified against the repository of fingerprints to prevent unauthorized disclosure of secure data. This invention provides approaches to allow derivative works (e.g., different ordering of words, substitution of words with synonyms, etc.) of the original information to be detected at the sentence level or even at the paragraph level. This invention also provides means for enhancing storage and resource efficiencies by providing approaches to optimize the number of fingerprints generated for the textual information.
-
Citations
31 Claims
-
1. A computer implemented method for preventing unauthorized disclosure of secure information, the computer implemented method comprising:
-
receiving information including a first text, by a computer system having at least a processor for executing instructions, said first text including a plurality of words; normalizing, by said computer system, said first text into a first canonical text expression, said first canonical text expression including a plurality of normalized words; generating, at said computer system, a first word hash list for said first canonical text expression, where said first word hash list is generated at a word level; generating, at said computer system, a first set of fingerprints for said first word hash list; wherein generating said first word hash list includes converting said plurality of normalized words into a plurality of word-value hashes, each specific one of said word-value hashes representing a specific normalized word; and wherein said generating said first set of fingerprints includes; assigning a sliding window of size W, wherein said sliding window is used for reading a W number of said word-value hashes from said first word hash list; using said sliding window to read said W number of said word-level hashes from said first word hash list; designating said word-value hash with a distinct value within said sliding window as an anchor; and generating a fingerprint using a fingerprint hash function, wherein said fingerprint hash function is applied over all said word-value hashes contained within a start of said sliding window to where said anchor resides in said sliding window. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31)
-
Specification