Classification-Based Redaction in Natural Language Text
First Claim
1. A method for redacting natural language text comprising a plurality of features, the method comprising:
- providing, by a processing device, a sensitive concepts model according to a classification algorithm operating upon the plurality of features, wherein sensitive concepts are classes used by the classification algorithm when providing the sensitive concepts model;
providing, by the processing device, a utility concepts model according to the classification algorithm operating upon the plurality of features, wherein utility concepts are classes used by the classification algorithm when providing the utility concepts model;
for at least one identified sensitive concept and at least one identified utility concept, and based on the sensitive concepts model and the utility concepts model, identifying, by the processing device, at least one feature in the natural language text that implicates the at least one identified sensitive topic more than the at least one identified utility concept to provide identified features; and
perturbing, by the processing device, at least some of the identified features in at least a portion of the natural language text.
1 Assignment
0 Petitions
Accused Products
Abstract
When redacting natural language text, a classifier is used to provide a sensitive concept model according to features in natural language text and in which the various classes employed are sensitive concepts reflected in the natural language text. Similarly, the classifier is used to provide an utility concepts model based on utility concepts. Based on these models, and for one or more identified sensitive concept and identified utility concept, at least one feature in the natural language text is identified that implicates the at least one identified sensitive topic more than the at least one identified utility concept. At least some of the features thus identified may be perturbed such that the modified natural language text may be provided as at least one redacted document. In this manner, features are perturbed to maximize classification error for sensitive concepts while simultaneously minimizing classification error in the utility concepts.
-
Citations
24 Claims
-
1. A method for redacting natural language text comprising a plurality of features, the method comprising:
-
providing, by a processing device, a sensitive concepts model according to a classification algorithm operating upon the plurality of features, wherein sensitive concepts are classes used by the classification algorithm when providing the sensitive concepts model; providing, by the processing device, a utility concepts model according to the classification algorithm operating upon the plurality of features, wherein utility concepts are classes used by the classification algorithm when providing the utility concepts model; for at least one identified sensitive concept and at least one identified utility concept, and based on the sensitive concepts model and the utility concepts model, identifying, by the processing device, at least one feature in the natural language text that implicates the at least one identified sensitive topic more than the at least one identified utility concept to provide identified features; and perturbing, by the processing device, at least some of the identified features in at least a portion of the natural language text. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. An apparatus for redacting natural language text comprising a plurality of features, comprising:
-
a processor; and storage, operatively connected to the processor and having stored thereon instructions that, when executed by the processor, cause the processor to; provide a sensitive concepts model according to a classification algorithm operating upon the plurality of features, wherein sensitive concepts are classes used by the classification algorithm when providing the sensitive concepts model; provide a utility concepts model according to the classification algorithm operating upon the plurality of features, wherein utility concepts are classes used by the classification algorithm when providing the utility concepts model; for at least one identified sensitive concept and at least one identified utility concept, and based on the sensitive concepts model and the utility concepts model, identify at least one feature in the natural language text that implicates the at least one identified sensitive topic more than the at least one identified utility concept to provide identified features; and perturb at least some of the identified features in at least a portion of the natural language text. - View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24)
-
Specification