High-accuracy confidential data detection
First Claim
Patent Images
1. A computer-implemented method comprising:
- storing, in memory, a plurality of classified data patterns for personal identifiers, the plurality of classified data patterns corresponding to variations of personal identifier formats, the personal identifiers including confidential information of a plurality of entities;
identifying a text document to be searched;
searching the text document for data expressed in a format that matches any of the plurality of classified data patterns corresponding to the variations of personal identifier formats;
finding, in the text document, one or more sets of data having the format that matches any of the plurality of classified data patterns, each found set of data representing a possible candidate of a personal identifier; and
validating each of the found sets of data from the text document using one or more personal identifier validators to determine whether the found set of data is a personal identifier or a false positive, wherein validating each of the found sets of data from the text document comprises;
eliminating false positives based on data immediately preceding or following each of the personal identifier candidates, wherein the data immediately preceding or following a personal identifier candidate indicates a false positive if the personal identifier candidate is immediately preceded, or immediately followed by, a number or a letter.
2 Assignments
0 Petitions
Accused Products
Abstract
A method and apparatus for providing accurate detection of confidential information is described. In one embodiment, the method includes searching a text document for multiple classified data patterns associated with confidential information that is represented as personal identifiers. The method further includes finding, in the text document, one or more personal identifier candidates matching any of the classified data patterns, and validating each of the personal identifier candidates using one or more personal identifier validators to provide accurate detection of the confidential information in the text document.
-
Citations
19 Claims
-
1. A computer-implemented method comprising:
-
storing, in memory, a plurality of classified data patterns for personal identifiers, the plurality of classified data patterns corresponding to variations of personal identifier formats, the personal identifiers including confidential information of a plurality of entities; identifying a text document to be searched; searching the text document for data expressed in a format that matches any of the plurality of classified data patterns corresponding to the variations of personal identifier formats; finding, in the text document, one or more sets of data having the format that matches any of the plurality of classified data patterns, each found set of data representing a possible candidate of a personal identifier; and validating each of the found sets of data from the text document using one or more personal identifier validators to determine whether the found set of data is a personal identifier or a false positive, wherein validating each of the found sets of data from the text document comprises; eliminating false positives based on data immediately preceding or following each of the personal identifier candidates, wherein the data immediately preceding or following a personal identifier candidate indicates a false positive if the personal identifier candidate is immediately preceded, or immediately followed by, a number or a letter. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
-
-
10. A system comprising:
-
a memory to store a plurality of classified data patterns for personal identifiers, the plurality of classified data patterns corresponding to variations of personal identifier formats, the personal identifiers including confidential information of a plurality of entities; a processor, coupled to the memory; a search engine, executed from the memory by the processor, to identify a text document to be searched, to search the text document for data expressed in a format that matches any of the plurality of classified data patterns corresponding to the variations of personal identifier formats, and to find, in the text document, one or more sets of data having the format that matches any of the plurality of classified data patterns, each found set of data representing a possible candidate of a personal identifier; and a validation engine, coupled to the search engine, to validate each of the found sets of data from the text document using one or more personal identifier validators to determine whether the found set of data is a personal identifier or a false positive, wherein the validation engine is to validate each of the found sets of data from the text document based on eliminating false positives based on data immediately preceding or following each of the personal identifier candidates, wherein the data immediately preceding or following a personal identifier candidate indicates a false positive if the personal identifier candidate is immediately preceded, or immediately followed by, a number or a letter. - View Dependent Claims (11, 12, 13, 14)
-
-
15. A non-transitory computer readable storage medium that provides instructions, which when executed on a processing system cause the processing system to perform a method comprising:
-
storing, in memory, a plurality of classified data patterns for personal identifiers, the plurality of classified data patterns corresponding to variations of personal identifier formats, the personal identifiers including confidential information of a plurality of entities; identifying a text document to be searched; searching the text document for data expressed in a format that matches any of the plurality of classified data patterns corresponding to the variations of personal identifier formats; finding, in the text document, one or more sets of data having the format that matches any of the plurality of classified data patterns, each found set of data representing a possible candidate of a personal identifier; and validating each of the found sets of data from the text document using one or more personal identifier validators to determine whether the found set of data is a personal identifier or a false positive, wherein validating each of the found sets of data from the text document comprises; eliminating false positives based on data immediately preceding or following each of the personal identifier candidates, wherein the data immediately preceding or following a personal identifier candidate indicates a false positive if the personal identifier candidate is immediately preceded, or immediately followed by, a number or a letter. - View Dependent Claims (16, 17, 18, 19)
-
Specification