Method and apparatus for detecting sensitive content in a document
First Claim
1. A computer-executed method for detecting sensitive content in a document, the method comprising:
- receiving a document;
identifying a set of terms in the document;
generating a combination of terms, based on the identified terms, that is associated with a semantic meaning and is potentially sensitive;
performing a first search through a public corpus for the combination of terms and determining a first search hit count returned for the combination;
performing additional searches through the public corpus for individual terms in the combination and determining search hit counts returned for each term in the combination;
computing a search hit ratio between the first search hit count and the average search hit count for the individual terms in the combination;
in response to the computed search hit ratio being smaller than a predetermined value, labeling the combination of terms as sensitive; and
generating a result that indicates portions of the document which contain sensitive combinations;
determining a relationship between the identified terms by processing the document with a syntactic parser to determine relationships between the terms;
wherein determining the relationship between the identified terms further comprises scoring a respective relationship based on a determined relevance to the document and a likelihood of the terms being sensitive.
6 Assignments
0 Petitions
Accused Products
Abstract
One embodiment of the present invention provides a system that detects sensitive content in a document. In doing so, the system receives a document, identifies a set of terms in the document that are candidate sensitive terms, and generates a combination of terms based on the identified terms that is associated with a semantic meaning. Next, the system performs searches through a corpus based on the combination of terms and determines hit counts returned for each term in the combination and for the combination. The system then determines whether the combination of terms is sensitive based on the hit count for the combination and the hit counts for the individual terms in the combination, and generates a result that indicates portions of the document which contain sensitive combinations.
16 Citations
12 Claims
-
1. A computer-executed method for detecting sensitive content in a document, the method comprising:
-
receiving a document; identifying a set of terms in the document; generating a combination of terms, based on the identified terms, that is associated with a semantic meaning and is potentially sensitive; performing a first search through a public corpus for the combination of terms and determining a first search hit count returned for the combination; performing additional searches through the public corpus for individual terms in the combination and determining search hit counts returned for each term in the combination; computing a search hit ratio between the first search hit count and the average search hit count for the individual terms in the combination; in response to the computed search hit ratio being smaller than a predetermined value, labeling the combination of terms as sensitive; and generating a result that indicates portions of the document which contain sensitive combinations; determining a relationship between the identified terms by processing the document with a syntactic parser to determine relationships between the terms; wherein determining the relationship between the identified terms further comprises scoring a respective relationship based on a determined relevance to the document and a likelihood of the terms being sensitive. - View Dependent Claims (2, 3, 4)
-
-
5. A non-transitory computer-readable storage medium storing instructions that when executed by a computer cause the computer to perform a method for detecting sensitive content in a document, the method comprising:
-
receiving a document; identifying a set of terms in the document; generating a combination of terms, based on the identified terms, that is associated with a semantic meaning and is potentially sensitive; performing a first search through a public corpus for the combination of terms and determining a first search hit count returned for the combination; performing additional searches through the public corpus for the individual terms in the combination and determining search hit counts returned for each term in the combination; computing a search hit ratio between the first search hit count and the average search hit count for the individual terms in the combination; in response to the computed search hit ratio being smaller than a predetermined value, labeling the combination of terms as sensitive; and generating a result that indicates portions of the document which contain sensitive combinations; determining a relationship between the identified terms by processing the document with a syntactic parser to determine relationships between the terms; wherein determining the relationship between the identified terms further comprises scoring a respective relationship based on a determined relevance to the document and a likelihood of the terms being sensitive. - View Dependent Claims (6, 7, 8)
-
-
9. An apparatus for detecting sensitive content in a document, comprising:
-
a processor; a receiving mechanism configured to receive a document; an analysis mechanism configured to; identify a set of terms in the document; and generate a combination of terms, based on the identified terms, that is associated with a semantic meaning and is potentially sensitive; and
a search engine interface configured to;perform a first search through a public corpus for the combination of terms and determine a first search hit counts returned for the combination; and perform additional searches through the public corpus for the individual terms and determine search hit counts returned for each term in the combination; wherein the analysis mechanism is further configured to; compute a search hit ratio between the first search hit count and the average search hit count for the individual terms in the combination; in response to the computed search hit ratio being smaller than a predetermined value, label the combination of terms as sensitive; and generate a result that indicates portions of the document which contain sensitive combinations; determine a relationship between the identified terms by processing the document with a syntactic parser to determine relationships between the terms; wherein while determining the relationship between the identified terms, the analysis mechanism is further configured to score a respective relationship based on a determined relevance to the document and a likelihood of being sensitive. - View Dependent Claims (10, 11, 12)
-
Specification