Classification-based redaction in natural language text

US 8,938,386 B2
Filed: 03/15/2011
Issued: 01/20/2015
Est. Priority Date: 03/15/2011
Status: Active Grant

First Claim

Patent Images

1. A method for redacting natural language text, the method comprising:

receiving, by a processing device and via a user input device operatively connected to the processing device, one or more user inputs indicating sensitive concepts and utility concepts based on a user interface that includes a visual representation of a plurality of concepts in the natural language text,the plurality of concepts including the sensitive concepts and the utility concepts, andthe natural language text being in an electronic format;

determining, by the processing device, the sensitive concepts based on the one or more user inputs;

determining, by the processing device, the utility concepts based on the one or more user inputs;

determining, by the processing device and for at least one feature in the natural language text, a sensitive concepts implication factor based on class-conditional probabilities of the at least one feature according to the sensitive concepts;

determining, by the processing device and for the at least one feature, a utility concepts implication factor based on class-conditional probabilities of the at least one feature according to the utility concepts;

determining, by the processing device and for the at least one feature, a feature score based on a difference between the sensitive concepts implication factor and the utility concepts implication factor;

identifying, by the processing device and to obtain identified features, the at least one feature based on the feature score satisfying a threshold,the at least one feature implicating at least one identified sensitive concept, of the sensitive concepts, more than at least one identified utility concept of the utility concepts; and

perturbing, by the processing device, at least some of the identified features in at least a portion of the natural language text.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

When redacting natural language text, a classifier is used to provide a sensitive concept model according to features in natural language text and in which the various classes employed are sensitive concepts reflected in the natural language text. Similarly, the classifier is used to provide an utility concepts model based on utility concepts. Based on these models, and for one or more identified sensitive concept and identified utility concept, at least one feature in the natural language text is identified that implicates the at least one identified sensitive topic more than the at least one identified utility concept. At least some of the features thus identified may be perturbed such that the modified natural language text may be provided as at least one redacted document. In this manner, features are perturbed to maximize classification error for sensitive concepts while simultaneously minimizing classification error in the utility concepts.

Citations

23 Claims

1. A method for redacting natural language text, the method comprising:
- receiving, by a processing device and via a user input device operatively connected to the processing device, one or more user inputs indicating sensitive concepts and utility concepts based on a user interface that includes a visual representation of a plurality of concepts in the natural language text,the plurality of concepts including the sensitive concepts and the utility concepts, andthe natural language text being in an electronic format;
  
  determining, by the processing device, the sensitive concepts based on the one or more user inputs;
  
  determining, by the processing device, the utility concepts based on the one or more user inputs;
  
  determining, by the processing device and for at least one feature in the natural language text, a sensitive concepts implication factor based on class-conditional probabilities of the at least one feature according to the sensitive concepts;
  
  determining, by the processing device and for the at least one feature, a utility concepts implication factor based on class-conditional probabilities of the at least one feature according to the utility concepts;
  
  determining, by the processing device and for the at least one feature, a feature score based on a difference between the sensitive concepts implication factor and the utility concepts implication factor;
  
  identifying, by the processing device and to obtain identified features, the at least one feature based on the feature score satisfying a threshold,the at least one feature implicating at least one identified sensitive concept, of the sensitive concepts, more than at least one identified utility concept of the utility concepts; and
  
  perturbing, by the processing device, at least some of the identified features in at least a portion of the natural language text.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
- - 2. The method of claim 1, further comprising:
    - providing, by the processing device and via a display operatively connected to the processing device, the user interface.
  - 3. The method of claim 1, wherein the feature score is determined according to at least one mathematical function, where the at least one mathematical function is at least one of:
    - ScoreLO(x_i), ScoreOR(x_i), ScoreFL(x_i), or ScoreIG(x_i),where;
4. The method of claim 1, wherein identifying the at least one feature comprises:
- determining, by the processing device and for a document forming a part of the natural language text, selected features of the document that numerically optimize a constrained objective function established to ensure that the selected features of the document implicate at least one identified sensitive concept for the document more than at least one utility identified concept for the document; and
  
  providing, by the processing device, the selected features as the identified features.
5. The method of claim 4,where the constrained objective function is:
6. The method of claim 1, further comprising:
- determining, by the processing device and for a document forming a part of the natural language text, selected features of the document that numerically optimize a constrained objective function established to ensure that the selected features of the document implicate a sensitive concept, of the sensitive concepts, for the document more than at least k−
  
  1 other sensitive concepts, of the sensitive concepts, for the document,the constrained objective function being based on class-conditional probabilities of the selected features according to the at least one utility concept; and
  
  providing, by the processing device, the selected features as part of the identified features.
7. The method of claim 6,where the constrained objective function is:
8. The method of claim 1, where perturbing the at least some of the identified features comprises:
- suppressing the at least some of the identified features.
9. The method of claim 1, where perturbing the at least some of the identified features comprises:
- generalizing the at least some of the identified features.
10. The method of claim 1, further comprising:
- providing, by the processing device, the portion of the natural language text in which the at least some of the identified features have been perturbed as at least one redacted document.

11. An apparatus for redacting natural language text comprising a plurality of features comprising:
- a storage;
  
  a processor to;
  
  receive, via a user input device operatively connected to the processor, one or more user inputs indicating sensitive concepts and utility concepts based on a user interface that includes a visual representation of a plurality of concepts in natural language text,the plurality of concepts including the sensitive concepts and the utility concepts, andthe natural language text being in an electronic format;
  
  determine the sensitive concepts based on the one or more user inputs;
  
  determine the utility concepts based on the one or more user inputs;
  
  determine, for at least one feature in the natural language text, a sensitive concepts implication factor based on class-conditional probabilities of the at least one feature according to the sensitive concepts;
  
  determine, for the at least one feature in the natural language text, a utility concepts implication factor based on class-conditional probabilities of the at least one feature according to the utility concepts;
  
  determine, for the at least one feature in the natural language text, a feature score based on a difference between the sensitive concepts implication factor and the utility concepts implication factor;
  
  identify features of the natural language text based on the feature score satisfying a threshold,the identified features including the at least one feature, andthe at least one feature implicating at least one identified sensitive concept, of the sensitive concepts, more than at least one utility concept of the utility concepts; and
  
  perturb at least some of the identified features in at least a portion of the natural language text.
- View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
- - 12. The apparatus of claim 11, where the processor is further to:
    - provide, via a display operatively connected to the processor, the user interface.
  - 13. The apparatus of claim 11, where the feature score is determined according to at least one mathematical function, where the at least one mathematical function is at least one of:
    - ScoreLO(x_i), ScoreOR(x_i), ScoreFL(x_i), or ScoreIG(x_i),
  - 14. The apparatus of claim 11, where, when identifying the features, the processor is to:
    - determine, for a document forming a part of the natural language text, selected features of the document that numerically optimize a constrained objective function established to ensure that the selected features of the document implicate the at least one sensitive concept for the document more than the at least one utility concept for the document; and
      
      provide the selected features as the identified features.
  - 15. The apparatus of claim 14,where the constrained objective function is:
  - 16. The apparatus of claim 11, where, when identifying the features, the processor is to:
    - determine, for a document forming a part of the natural language text, selected features of the document that numerically optimize a constrained objective function established to ensure that the selected features of the document implicate the at least one identified sensitive concept for the document more than at least k−
      
      1 other sensitive concepts, of the sensitive concepts, for the document; and
      
      provide the selected features as the identified features.
  - 17. The apparatus of claim 16,where the constrained objective function is:
  - 18. The apparatus of claim 11, where, when perturbing the at least some of the identified features, the processor is to:
    - suppress the at least some of the identified features.
  - 19. The apparatus of claim 11, where, when perturbing the at least some of the identified features, the processor is to:
    - generalize the at least some of the identified features.
  - 20. The apparatus of claim 11, where the processor is further to:
    - provide the portion of the natural language text in which the at least some of the identified features have been perturbed as at least one redacted document.

21. A non-transitory computer-readable medium storing instructions, the instructions comprising:
- one or more instructions that, when executed by at least one processor, cause the at least one processor to;
  
  receive, via a user input device operatively connected to the at least one processor, one or more user inputs indicating sensitive concepts and utility concepts based on a user interface that includes a visual representation of a plurality of concepts in natural language text,the plurality of concepts including the sensitive concepts and the utility concepts, andthe natural language text being in an electronic format;
  
  determine the sensitive concepts based on the one or more user inputs, the sensitive concepts being concepts that are to be obscured;
  
  determine the utility concepts based on the one or more user inputs, the utility concepts being concepts are desirable to be preserved;
  
  determine, for at least one feature in the natural language text, a sensitive concepts implication factor based on class-conditional probabilities of the at least one feature according to the sensitive concepts;
  
  determine, for the at least one feature in the natural language text, a utility concepts implication factor based on class-conditional probabilities of the at least one feature according to the utility concepts;
  
  determine, for the at least one feature in the natural language text, a feature score based on a difference between the sensitive concepts implication factor and the utility concepts implication factor; and
  
  perturb the at least one feature based on the feature score satisfying a threshold.
- View Dependent Claims (22, 23)
- - 22. The non-transitory computer-readable medium of claim 21,where the one or more user inputs includes a topic, andwhere the sensitive concepts are associated with the topic.
  - 23. The non-transitory computer-readable medium of claim 21, where the one or more instructions to perturb the at least one feature comprise:
    - one or more instructions that, when executed by the at least one processor, cause the at least one processor to;
      
      suppress the at least one feature.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Accenture Global Services Limited (Accenture PLC)
Original Assignee
Accenture Global Services Limited (Accenture PLC)
Inventors
Cumby, Chad, Ghani, Rayid
Primary Examiner(s)
Shah, Paras D

Application Number

US13/048,003
Publication Number

US 20120239380A1
Time in Patent Office

1,407 Days
Field of Search

704/1, 704/9, 704/257, 704/231, 715/255, 715/256, 707/705, 707/757, 707/776, 707780-783, 707/813, 706/12, 706/52
US Class Current

704/9
CPC Class Codes

G06F 40/10   Text processing natural lan...

G06F 40/279   Recognition of textual enti...

G06F 40/30   Semantic analysis

Classification-based redaction in natural language text

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

23 Claims

Specification

Solutions

Use Cases

Quick Links

Classification-based redaction in natural language text

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

23 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links