×

Method and apparatus for detecting sensitive content in a document

  • US 8,271,483 B2
  • Filed: 09/10/2008
  • Issued: 09/18/2012
  • Est. Priority Date: 09/10/2008
  • Status: Active Grant
First Claim
Patent Images

1. A computer-executed method for detecting sensitive content in a document, the method comprising:

  • receiving a document;

    identifying a set of terms in the document;

    generating a combination of terms, based on the identified terms, that is associated with a semantic meaning and is potentially sensitive;

    performing a first search through a public corpus for the combination of terms and determining a first search hit count returned for the combination;

    performing additional searches through the public corpus for individual terms in the combination and determining search hit counts returned for each term in the combination;

    computing a search hit ratio between the first search hit count and the average search hit count for the individual terms in the combination;

    in response to the computed search hit ratio being smaller than a predetermined value, labeling the combination of terms as sensitive; and

    generating a result that indicates portions of the document which contain sensitive combinations;

    determining a relationship between the identified terms by processing the document with a syntactic parser to determine relationships between the terms;

    wherein determining the relationship between the identified terms further comprises scoring a respective relationship based on a determined relevance to the document and a likelihood of the terms being sensitive.

View all claims
  • 6 Assignments
Timeline View
Assignment View
    ×
    ×