Classification-based redaction in natural language text
First Claim
1. A method for redacting natural language text, the method comprising:
- receiving, by a processing device and via a user input device operatively connected to the processing device, one or more user inputs indicating sensitive concepts and utility concepts based on a user interface that includes a visual representation of a plurality of concepts in the natural language text,the plurality of concepts including the sensitive concepts and the utility concepts, andthe natural language text being in an electronic format;
determining, by the processing device, the sensitive concepts based on the one or more user inputs;
determining, by the processing device, the utility concepts based on the one or more user inputs;
determining, by the processing device and for at least one feature in the natural language text, a sensitive concepts implication factor based on class-conditional probabilities of the at least one feature according to the sensitive concepts;
determining, by the processing device and for the at least one feature, a utility concepts implication factor based on class-conditional probabilities of the at least one feature according to the utility concepts;
determining, by the processing device and for the at least one feature, a feature score based on a difference between the sensitive concepts implication factor and the utility concepts implication factor;
identifying, by the processing device and to obtain identified features, the at least one feature based on the feature score satisfying a threshold,the at least one feature implicating at least one identified sensitive concept, of the sensitive concepts, more than at least one identified utility concept of the utility concepts; and
perturbing, by the processing device, at least some of the identified features in at least a portion of the natural language text.
1 Assignment
0 Petitions
Accused Products
Abstract
When redacting natural language text, a classifier is used to provide a sensitive concept model according to features in natural language text and in which the various classes employed are sensitive concepts reflected in the natural language text. Similarly, the classifier is used to provide an utility concepts model based on utility concepts. Based on these models, and for one or more identified sensitive concept and identified utility concept, at least one feature in the natural language text is identified that implicates the at least one identified sensitive topic more than the at least one identified utility concept. At least some of the features thus identified may be perturbed such that the modified natural language text may be provided as at least one redacted document. In this manner, features are perturbed to maximize classification error for sensitive concepts while simultaneously minimizing classification error in the utility concepts.
-
Citations
23 Claims
-
1. A method for redacting natural language text, the method comprising:
-
receiving, by a processing device and via a user input device operatively connected to the processing device, one or more user inputs indicating sensitive concepts and utility concepts based on a user interface that includes a visual representation of a plurality of concepts in the natural language text, the plurality of concepts including the sensitive concepts and the utility concepts, and the natural language text being in an electronic format; determining, by the processing device, the sensitive concepts based on the one or more user inputs; determining, by the processing device, the utility concepts based on the one or more user inputs; determining, by the processing device and for at least one feature in the natural language text, a sensitive concepts implication factor based on class-conditional probabilities of the at least one feature according to the sensitive concepts; determining, by the processing device and for the at least one feature, a utility concepts implication factor based on class-conditional probabilities of the at least one feature according to the utility concepts; determining, by the processing device and for the at least one feature, a feature score based on a difference between the sensitive concepts implication factor and the utility concepts implication factor; identifying, by the processing device and to obtain identified features, the at least one feature based on the feature score satisfying a threshold, the at least one feature implicating at least one identified sensitive concept, of the sensitive concepts, more than at least one identified utility concept of the utility concepts; and perturbing, by the processing device, at least some of the identified features in at least a portion of the natural language text. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
4. The method of claim 1, wherein identifying the at least one feature comprises:
-
determining, by the processing device and for a document forming a part of the natural language text, selected features of the document that numerically optimize a constrained objective function established to ensure that the selected features of the document implicate at least one identified sensitive concept for the document more than at least one utility identified concept for the document; and providing, by the processing device, the selected features as the identified features.
-
-
5. The method of claim 4,
where the constrained objective function is: -
6. The method of claim 1, further comprising:
-
determining, by the processing device and for a document forming a part of the natural language text, selected features of the document that numerically optimize a constrained objective function established to ensure that the selected features of the document implicate a sensitive concept, of the sensitive concepts, for the document more than at least k−
1 other sensitive concepts, of the sensitive concepts, for the document,the constrained objective function being based on class-conditional probabilities of the selected features according to the at least one utility concept; and providing, by the processing device, the selected features as part of the identified features.
-
-
7. The method of claim 6,
where the constrained objective function is: -
8. The method of claim 1, where perturbing the at least some of the identified features comprises:
suppressing the at least some of the identified features.
-
9. The method of claim 1, where perturbing the at least some of the identified features comprises:
generalizing the at least some of the identified features.
-
10. The method of claim 1, further comprising:
providing, by the processing device, the portion of the natural language text in which the at least some of the identified features have been perturbed as at least one redacted document.
-
11. An apparatus for redacting natural language text comprising a plurality of features comprising:
-
a storage; a processor to; receive, via a user input device operatively connected to the processor, one or more user inputs indicating sensitive concepts and utility concepts based on a user interface that includes a visual representation of a plurality of concepts in natural language text, the plurality of concepts including the sensitive concepts and the utility concepts, and the natural language text being in an electronic format; determine the sensitive concepts based on the one or more user inputs; determine the utility concepts based on the one or more user inputs; determine, for at least one feature in the natural language text, a sensitive concepts implication factor based on class-conditional probabilities of the at least one feature according to the sensitive concepts; determine, for the at least one feature in the natural language text, a utility concepts implication factor based on class-conditional probabilities of the at least one feature according to the utility concepts; determine, for the at least one feature in the natural language text, a feature score based on a difference between the sensitive concepts implication factor and the utility concepts implication factor; identify features of the natural language text based on the feature score satisfying a threshold, the identified features including the at least one feature, and the at least one feature implicating at least one identified sensitive concept, of the sensitive concepts, more than at least one utility concept of the utility concepts; and perturb at least some of the identified features in at least a portion of the natural language text. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
-
-
21. A non-transitory computer-readable medium storing instructions, the instructions comprising:
-
one or more instructions that, when executed by at least one processor, cause the at least one processor to; receive, via a user input device operatively connected to the at least one processor, one or more user inputs indicating sensitive concepts and utility concepts based on a user interface that includes a visual representation of a plurality of concepts in natural language text, the plurality of concepts including the sensitive concepts and the utility concepts, and the natural language text being in an electronic format; determine the sensitive concepts based on the one or more user inputs, the sensitive concepts being concepts that are to be obscured; determine the utility concepts based on the one or more user inputs, the utility concepts being concepts are desirable to be preserved; determine, for at least one feature in the natural language text, a sensitive concepts implication factor based on class-conditional probabilities of the at least one feature according to the sensitive concepts; determine, for the at least one feature in the natural language text, a utility concepts implication factor based on class-conditional probabilities of the at least one feature according to the utility concepts; determine, for the at least one feature in the natural language text, a feature score based on a difference between the sensitive concepts implication factor and the utility concepts implication factor; and perturb the at least one feature based on the feature score satisfying a threshold. - View Dependent Claims (22, 23)
-
Specification