×

Automatic lexicon generation system for detection of suspicious e-mails from a mail archive

  • US 8,321,204 B2
  • Filed: 03/16/2009
  • Issued: 11/27/2012
  • Est. Priority Date: 08/26/2008
  • Status: Active Grant
First Claim
Patent Images

1. An automatic lexicon generation system to identify an information leak in an archive of e-mails, said system comprises:

  • a database;

    a user console; and

    a hardware tool in communication with the user console and configured to perform the following;

    receive a set of example e-mails leaking information and documents leaking information via the user console wherein, portions of the e-mails leaking information and documents leaking information are identified as critical or not critical;

    identify a first set of important key phrases from the set of example e-mails leaking information and documents leaking information using frequency analysis, word stemming, and removal of common words and domain specific words based on the identified critical or not critical portions of the e-mails leaking information and documents leaking information;

    receive a set of example e-mails not leaking information and documents not leaking information via the user console;

    identify a second set of important key phrases from the set of example e-mails not leaking information and documents not leaking information using frequency analysis, word stemming, and removal of common words;

    create a set of relevant phrases based on the first set of important key phrases and the second set of important key phrases;

    create a frequency table of the relevant phrases;

    identify words from the frequency table which occur with higher frequency in documents or e-mails leaking information and with higher frequency in documents or e-mails not leaking information and assign a label of “

    very highly sensitive”

    , “

    highly sensitive”

    , “

    sensitive”

    , “

    not sensitive”

    , “

    safe”

    , or “

    sensitive”

    to each of the phrases of said set of relevant phrases;

    assign weights to each of said relevant phrases;

    build multiple key phrase lists and weights from said relevant phrases;

    display said multiple key phrase lists on the user console for simulation;

    receive approval of said multiple key phrase lists from the user console;

    store the approved said multiple key phrase lists as weighted category lexicon in the database; and

    apply said multiple key phrase lists on an archive of e-mails and documents for identifying any e-mail leaking information.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×