Layered masking of content

US 10,546,154 B2
Filed: 10/27/2017
Issued: 01/28/2020
Est. Priority Date: 03/28/2017
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

receiving content including a token;

storing one or more regular expressions for the content, wherein each regular expression comprises a sequence of symbols and characters expressing a string or pattern;

determining, by a computer system based on matching the one or more regular expressions with the token, a first confidence score indicating a probability that the token includes personally identifiable information (PII), the first confidence score being associated with the regular expression;

storing a lookup table that includes one or more tokens for known PII;

determining, by the computer system based on matching the token with tokens in the lookup table, a second confidence score indicating a probability that the token includes PII, the second confidence score being associated with a term in the lookup table that is an exact match of the token;

storing a model for determining a third confidence score indicating a probability that the token includes PII, wherein the model is generated using a machine learning training algorithm;

determining, by the computer system based on inputting the token into the model, the third confidence score;

masking the token by the computer system based on the first confidence score, the second confidence score and the third confidence score; and

providing, by the computer system as data of improved privacy, the content including the masked token to a content consuming device.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Methods, systems and computer program products for layered masking of data are described. A system receives content including personally identifiable information (PII). The system redacts the content by masking the PII. The system identifies the PII in multi-layer processing, where in each layer, the system determines a respective confidence score indicating a probability that a token is PII. If the confidence score is sufficiently high, the system masks the token. Otherwise, the system provides the token to a next layer for processing. The layers can include regular expression based processing, lookup table based processing, and machine learning based processing.

17 Citations

View as Search Results

20 Claims

1. A method comprising:
- receiving content including a token;
  
  storing one or more regular expressions for the content, wherein each regular expression comprises a sequence of symbols and characters expressing a string or pattern;
  
  determining, by a computer system based on matching the one or more regular expressions with the token, a first confidence score indicating a probability that the token includes personally identifiable information (PII), the first confidence score being associated with the regular expression;
  
  storing a lookup table that includes one or more tokens for known PII;
  
  determining, by the computer system based on matching the token with tokens in the lookup table, a second confidence score indicating a probability that the token includes PII, the second confidence score being associated with a term in the lookup table that is an exact match of the token;
  
  storing a model for determining a third confidence score indicating a probability that the token includes PII, wherein the model is generated using a machine learning training algorithm;
  
  determining, by the computer system based on inputting the token into the model, the third confidence score;
  
  masking the token by the computer system based on the first confidence score, the second confidence score and the third confidence score; and
  
  providing, by the computer system as data of improved privacy, the content including the masked token to a content consuming device.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1, wherein:
    - the content includes transaction data, the transaction data including one or more transaction records each corresponding to a respective transaction and each including a description of the transaction, andthe token is a part of the description and includes at least one of a number or a word.
  - 3. The method of claim 1, wherein the one or more regular expressions are derived from historical transaction data, the first confidence score is derived from a ratio between a number of terms in the historical transaction data that match the regular expression and that are PII and a total number of terms in the historical transaction data that match the regular expression.
  - 4. The method of claim 1, wherein the lookup table is derived from historical transaction data, the second confidence score is derived from a ratio between a number of appearances of the token in the historical transaction data as PII and a total number of appearances of the token in the historical transaction data.
  - 5. The method of claim 1, wherein the model is based on a conditional random field (CRF) algorithm.
  - 6. The method of claim 5, comprising providing training data for the CRF algorithm, the training data including first data marked as PII and second data marked as non-PII, wherein the first data is labeled as noun forms for the CRF algorithm, the second data is labeled as non-noun forms.
  - 7. The method of claim 1, wherein masking the token based on the first confidence score, the second confidence score and the third confidence score comprises:
    - determining that at least one of the first confidence score, the second confidence score or the third confidence score satisfies a respective confidence threshold; and
      
      in response to the determining, masking the token.
  - 8. The method of claim 1, wherein masking the token based on the first confidence score, the second confidence score and the third confidence score comprises:
    - determining that none of the first confidence score, the second confidence score or the third confidence score satisfies a respective confidence threshold;
      
      determining whether a weighted combination of the first confidence score, the second confidence score and the third confidence score satisfies a combined confidence threshold; and
      
      masking the token upon determining that the weighted combination satisfies the combined confidence threshold.
  - 9. The method of claim 1, wherein the masking the content is performed in a micro batching mode, wherein the content is divided into a plurality of sets of strings, each set having a size limit, and the masking is performed on each set.
  - 10. The method of claim 1, wherein the masking the content is performed in a batch mode or a real time mode.

11. A system comprising:
- one or more processors; and
  
  a non-transitory computer-readable medium storing instructions that, when executed by the one or more processors, cause the one or more processors to perform operations comprising;
  
  receiving content including a token;
  
  storing one or more regular expressions for the content, wherein each regular expression comprises a sequence of symbols and characters expressing a string or pattern;
  
  determining, based on matching the one or more regular expressions with the token, a first confidence score indicating a probability that the token includes personally identifiable information (PII), the first confidence score being associated with the regular expression;
  
  storing a lookup table that includes one or more tokens for known PII;
  
  determining, based on matching the token with tokens in the lookup table, a second confidence score indicating a probability that the token includes PII, the second confidence score being associated with a term in the lookup table that is an exact match of the token;
  
  storing a model for determining a third confidence score indicating a probability that the token includes PII, wherein the model is generated using a machine learning training algorithm;
  
  determining, based on inputting the token into the model, the third confidence score;
  
  masking the token based on the first confidence score, the second confidence score and the third confidence score; and
  
  providing, as data of improved privacy, the content including the masked token to a content consuming device.
- View Dependent Claims (12, 13, 14, 15, 16)
- - 12. The system of claim 11, wherein:
    - the content includes transaction data, the transaction data including one or more transaction records each corresponding to a respective transaction and each including a description of the transaction, andthe token is a part of the description and includes at least one of a number or a word.
  - 13. The system of claim 11, wherein the one or more regular expressions are derived from historical transaction data, the first confidence score is derived from a ratio between a number of terms in the historical transaction data that match the regular expression and that are PII and a total number of terms in the historical transaction data that match the regular expression.
  - 14. The system of claim 11, wherein the lookup table is derived from historical transaction data, the second confidence score is derived from a ratio between a number of appearances of the token in the historical transaction data as PII and a total number of appearances of the token in the historical transaction data.
  - 15. The system of claim 11, wherein the model is based on a conditional random field (CRF) algorithm.
  - 16. The system of claim 15, comprising providing training data for the CRF algorithm, the training data including first data marked as PII and second data marked as non-PII, wherein the first data is labeled as noun forms for the CRF algorithm, the second data is labeled as non-noun forms.

17. A non-transitory computer-readable medium storing instructions that, when executed by one or more processors, cause the one or more processors to perform operations comprising:
- receiving content including a token;
  
  storing one or more regular expressions for the content, wherein each regular expression comprises a sequence of symbols and characters expressing a string or pattern;
  
  determining, based on matching the one or more regular expressions with the token, a first confidence score indicating a probability that the token includes personally identifiable information (PII), the first confidence score being associated with the regular expression;
  
  storing a lookup table that includes one or more tokens for known PII;
  
  determining, based on matching the token with tokens in the lookup table, a second confidence score indicating a probability that the token includes PII, the second confidence score being associated with a term in the lookup table that is an exact match of the token;
  
  storing a model for determining a third confidence score indicating a probability that the token includes PII, wherein the model is generated using a machine learning training algorithm;
  
  determining, based on inputting the token into the model, a third confidence score;
  
  masking the token based on the first confidence score, the second confidence score and the third confidence score; and
  
  providing, as data of improved privacy, the content including the masked token to a content consuming device.
- View Dependent Claims (18, 19, 20)
- - 18. The non-transitory computer-readable medium of claim 17, wherein the lookup table is derived from historical transaction data, the second confidence score is derived from a ratio between a number of appearances of the token in the historical transaction data as PII and a total number of appearances of the token in the historical transaction data.
  - 19. The non-transitory computer-readable medium of claim 17, wherein masking the token based on the first confidence score, the second confidence score and the third confidence score comprises:
    - determining that at least one of the first confidence score, the second confidence score or the third confidence score satisfies a respective confidence threshold; and
      
      in response to the determining, masking the token.
  - 20. The non-transitory computer-readable medium of claim 17, wherein masking the token based on the first confidence score, the second confidence score and the third confidence score comprises:
    - determining that none of the first confidence score, the second confidence score or the third confidence score satisfies a respective confidence threshold;
      
      determining whether a weighted combination of the first confidence score, the second confidence score and the third confidence score satisfies a combined confidence threshold; and
      
      masking the token upon determining that the weighted combination satisfies the combined confidence threshold.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Yodlee Incorporated (Envestnet Incorporated)
Original Assignee
Yodlee Incorporated (Envestnet Incorporated)
Inventors
Praveen, Vunnava, Hussain, Syed Abid
Primary Examiner(s)
Siddiqi, Mohammad A

Application Number

US15/796,652
Publication Number

US 20180285599A1
Time in Patent Office

823 Days
Field of Search
US Class Current
CPC Class Codes

G06F 18/22   Matching criteria, e.g. pro...

G06F 18/23   Clustering techniques

G06F 18/2415   based on parametric or prob...

G06F 21/6245   Protecting personal data, e...

G06F 21/6254   by anonymising data, e.g. d...

G06F 21/6263   during internet communicati...

G06F 2218/08   Feature extraction

G06F 2218/12   Classification; Matching

G06F 40/166   Editing, e.g. inserting or ...

G06N 20/00   Machine learning

G06N 20/20   Ensemble learning

G06N 7/01   Probabilistic graphical mod...

G06Q 20/383   Anonymous user system

G06V 10/768   using context analysis, e.g...

H04L 63/0421   Anonymous communication, i....

H04W 12/02   Protecting privacy or anony...

Layered masking of content

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

17 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Layered masking of content

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

17 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links