Classification-Based Redaction in Natural Language Text

US 20120239380A1
Filed: 03/15/2011
Published: 09/20/2012
Est. Priority Date: 03/15/2011
Status: Active Grant

First Claim

Patent Images

1. A method for redacting natural language text comprising a plurality of features, the method comprising:

providing, by a processing device, a sensitive concepts model according to a classification algorithm operating upon the plurality of features, wherein sensitive concepts are classes used by the classification algorithm when providing the sensitive concepts model;

providing, by the processing device, a utility concepts model according to the classification algorithm operating upon the plurality of features, wherein utility concepts are classes used by the classification algorithm when providing the utility concepts model;

for at least one identified sensitive concept and at least one identified utility concept, and based on the sensitive concepts model and the utility concepts model, identifying, by the processing device, at least one feature in the natural language text that implicates the at least one identified sensitive topic more than the at least one identified utility concept to provide identified features; and

perturbing, by the processing device, at least some of the identified features in at least a portion of the natural language text.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

When redacting natural language text, a classifier is used to provide a sensitive concept model according to features in natural language text and in which the various classes employed are sensitive concepts reflected in the natural language text. Similarly, the classifier is used to provide an utility concepts model based on utility concepts. Based on these models, and for one or more identified sensitive concept and identified utility concept, at least one feature in the natural language text is identified that implicates the at least one identified sensitive topic more than the at least one identified utility concept. At least some of the features thus identified may be perturbed such that the modified natural language text may be provided as at least one redacted document. In this manner, features are perturbed to maximize classification error for sensitive concepts while simultaneously minimizing classification error in the utility concepts.

Citations

24 Claims

1. A method for redacting natural language text comprising a plurality of features, the method comprising:
- providing, by a processing device, a sensitive concepts model according to a classification algorithm operating upon the plurality of features, wherein sensitive concepts are classes used by the classification algorithm when providing the sensitive concepts model;
  
  providing, by the processing device, a utility concepts model according to the classification algorithm operating upon the plurality of features, wherein utility concepts are classes used by the classification algorithm when providing the utility concepts model;
  
  for at least one identified sensitive concept and at least one identified utility concept, and based on the sensitive concepts model and the utility concepts model, identifying, by the processing device, at least one feature in the natural language text that implicates the at least one identified sensitive topic more than the at least one identified utility concept to provide identified features; and
  
  perturbing, by the processing device, at least some of the identified features in at least a portion of the natural language text.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The method of claim 1, further comprising:
    - analyzing, by the processing device, the natural language text to identify at least some of the sensitive concepts, at least some of the utility concepts or both.
  - 3. The method of claim 2, further comprising:
    - providing, by the processing device via a display operatively connected to the processing device, a user interface comprising a visual representation of a plurality of concepts in the natural language text; and
      
      receiving, by the processing device via a user input device operatively connected to the processing device, user inputs indicating the at least some of the sensitive concepts, the at least some of the utility concepts or both based on the user interface and the visual representation of the plurality of concepts.
  - 4. The method of claim 1, wherein identifying the identified features further comprises:
    - determining, by the processing device for each of at least some features in the natural language document, a sensitive concepts implication factor based on class-conditional probabilities of the feature according to the sensitive concepts;
      
      determining, by the processing device for each of the at least some features, a utility concepts implication factor based on class-conditional probabilities of the feature according to the utility concepts;
      
      determining, by the processing device for each of the at least some features, a feature score based on a difference between the sensitive concepts implication factor and the utility concepts implication factor; and
      
      providing, by the processing device, those features of the at least some features having feature scores above a threshold as the identified features.
  - 5. The method of claim 4, where the feature score is determined according at least one of:
    - ScoreLO(x_i), ScoreOR(x_i), ScoreFL(x_i) and ScoreIG(x_i)where;
  - 6. The method of claim 1, wherein identifying the identified features further comprises:
    - determining, by the processing device for a document forming a part of the natural language text, selected features of the document that numerically optimize a constrained objective function established to ensure that the selected features of the document implicate a sensitive concept for the document more than at least one utility concept for the document, wherein the constrained objective function is based on class-conditional probabilities of features of the document according to the sensitive concept and class-conditional probabilities of the features according to the at least one utility concept; and
      
      providing, by the processing device, the selected features as the identified features.
  - 7. The method of claim 6, wherein the constrained objective function is:
  - 8. The method of claim 1, wherein identifying the identified features further comprises:
    - determining, by the processing device for a document forming a part of the natural language text, selected features of the document that numerically optimize a constrained objective function established to ensure that the selected features of the document implicate a sensitive concept for the document more than at least k−
      
      1 other sensitive concepts for the document, and wherein the constrained objective function is based on class-conditional probabilities of the features according to the at least one utility concept; and
      
      providing, by the processing device, the selected features as the identified features.
  - 9. The method of claim 8, wherein the constrained objective function is:
  - 10. The method of claim 1, wherein perturbing the at least some of the identified features further comprises suppressing the at least some of the identified features.
  - 11. The method of claim 1, wherein perturbing the at least some of the identified features further comprises generalizing the at least some of the identified features.
  - 12. The method of claim 1, further comprising:
    - providing, by the processing device, the portion of the natural language text in which the at least some of the identified features have been perturbed as at least one redacted document.

13. An apparatus for redacting natural language text comprising a plurality of features, comprising:
- a processor; and
  
  storage, operatively connected to the processor and having stored thereon instructions that, when executed by the processor, cause the processor to;
  
  provide a sensitive concepts model according to a classification algorithm operating upon the plurality of features, wherein sensitive concepts are classes used by the classification algorithm when providing the sensitive concepts model;
  
  provide a utility concepts model according to the classification algorithm operating upon the plurality of features, wherein utility concepts are classes used by the classification algorithm when providing the utility concepts model;
  
  for at least one identified sensitive concept and at least one identified utility concept, and based on the sensitive concepts model and the utility concepts model, identify at least one feature in the natural language text that implicates the at least one identified sensitive topic more than the at least one identified utility concept to provide identified features; and
  
  perturb at least some of the identified features in at least a portion of the natural language text.
- View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24)
- - 14. The apparatus of claim 13, the storage further comprising instructions that, when executed by the processor, cause the processor to:
    - analyze the natural language text to identify at least some of the sensitive concepts, at least some of the utility concepts or both.
  - 15. The method of claim 14, the storage further comprising instructions that, when executed by the processor, cause the processor to:
    - provide, via a display operatively connected to the processor, a user interface comprising a visual representation of a plurality of concepts in the natural language text; and
      
      receive, via a user input device operatively connected to the processor, user inputs indicating the at least some of the sensitive concepts, the at least some of the utility concepts or both based on the user interface and the visual representation of the plurality of concepts.
  - 16. The method of claim 13, wherein those instructions that, when executed by the processor, cause the processor to identify the identified features are further operative to cause the processor to:
    - determine, for each of at least some features in the natural language document, a sensitive concepts implication factor based on class-conditional probabilities of the feature according to the sensitive concepts;
      
      determine, for each of the at least some features, a utility concepts implication factor based on class-conditional probabilities of the feature according to the utility concepts;
      
      determine, for each of the at least some features, a feature score based on a difference between the sensitive concepts implication factor and the utility concepts implication factor; and
      
      provide those features of the at least some features having feature scores above a threshold as the identified features.
  - 17. The apparatus of claim 16, wherein those instructions that, when executed by the processor, cause the processor to determine the feature score are further operative to determine the feature score according at least one of:
    - ScoreLO(x_i), ScoreOR(x_i), ScoreFL(x_i) and ScoreIG(x_i)
  - 18. The apparatus of claim 13, wherein those instructions that, when executed by the processor, cause the processor to identify the identified features are further operative to cause the processor to:
    - determine, for a document forming a part of the natural language text, selected features of the document that numerically optimize a constrained objective function established to ensure that the selected features of the document implicate a sensitive concept for the document more than at least one utility concept for the document, wherein the constrained objective function is based on class-conditional probabilities of features of the document according to the sensitive concept and class-conditional probabilities of the features according to the at least one utility concept; and
      
      provide the selected features as the identified features.
  - 19. The apparatus of claim 18, wherein the constrained objective function is:
  - 20. The apparatus of claim 13, wherein those instructions that, when executed by the processor, cause the processor to identify the identified features are further operative to cause the processor to:
    - determine, for a document forming a part of the natural language text, selected features of the document that numerically optimize a constrained objective function established to ensure that the selected features of the document implicate a sensitive concept for the document more than at least k−
      
      1 other sensitive concepts for the document, and wherein the constrained objective function is based on class-conditional probabilities of the features according to the at least one utility concept; and
      
      provide the selected features as the identified features.
  - 21. The apparatus of claim 20, wherein the constrained objective function is:
  - 22. The apparatus of claim 13, wherein those instructions that, when executed by the processor, cause the processor to perturb the at least some of the identified features are further operative to cause the processor to suppress the at least some of the identified features.
  - 23. The apparatus of claim 13, wherein those instructions that, when executed by the processor, cause the processor to perturb the at least some of the identified features further operative to cause the processor to generalize the at least some of the identified features.
  - 24. The apparatus of claim 13, the storage further comprising instructions that, when executed by the processor, cause the processor to:
    - provide the portion of the natural language text in which the at least some of the identified features have been perturbed as at least one redacted document.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Accenture Global Services Limited (Accenture PLC)
Original Assignee
Accenture Global Services Limited (Accenture PLC)
Inventors
CUMBY, Chad, Ghani, Rayid

Granted Patent

US 8,938,386 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/9
CPC Class Codes

G06F 40/10   Text processing natural lan...

G06F 40/279   Recognition of textual enti...

G06F 40/30   Semantic analysis

Classification-Based Redaction in Natural Language Text

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

24 Claims

Specification

Solutions

Use Cases

Quick Links

Classification-Based Redaction in Natural Language Text

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

24 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links