Automatic lexicon generation system for detection of suspicious e-mails from a mail archive
First Claim
1. An automatic lexicon generation system to identify an information leak in an archive of e-mails, said system comprises:
- a database;
a user console; and
a hardware tool in communication with the user console and configured to perform the following;
receive a set of example e-mails leaking information and documents leaking information via the user console wherein, portions of the e-mails leaking information and documents leaking information are identified as critical or not critical;
identify a first set of important key phrases from the set of example e-mails leaking information and documents leaking information using frequency analysis, word stemming, and removal of common words and domain specific words based on the identified critical or not critical portions of the e-mails leaking information and documents leaking information;
receive a set of example e-mails not leaking information and documents not leaking information via the user console;
identify a second set of important key phrases from the set of example e-mails not leaking information and documents not leaking information using frequency analysis, word stemming, and removal of common words;
create a set of relevant phrases based on the first set of important key phrases and the second set of important key phrases;
create a frequency table of the relevant phrases;
identify words from the frequency table which occur with higher frequency in documents or e-mails leaking information and with higher frequency in documents or e-mails not leaking information and assign a label of “
very highly sensitive”
, “
highly sensitive”
, “
sensitive”
, “
not sensitive”
, “
safe”
, or “
sensitive”
to each of the phrases of said set of relevant phrases;
assign weights to each of said relevant phrases;
build multiple key phrase lists and weights from said relevant phrases;
display said multiple key phrase lists on the user console for simulation;
receive approval of said multiple key phrase lists from the user console;
store the approved said multiple key phrase lists as weighted category lexicon in the database; and
apply said multiple key phrase lists on an archive of e-mails and documents for identifying any e-mail leaking information.
2 Assignments
0 Petitions
Accused Products
Abstract
A system for generating a lexicon of words, organized into weighted categories, from a user defined set of example documents for detecting suspicious e-mails from a mail archive is provided. The system uses a set of example documents and e-mails given by the user to probabilistically find possible lists of critical words. The obtained list is now applied on an archive of e-mails. The system generates an inverted index on the mails from the archive to facilitate search for the key phrases. User feedback is taken on the results obtained and corrections in the lexicon made if necessary. Thus, the mails are scanned based on user feedback, user defined words and automatically generated word list. These lists constantly adapt as e-mails in the archive change. The system then combines all these to present the user with several possible sets of keywords and their relative importance that can be used as a policy for a desired level of accuracy. The system also shows the user any change if the set is modified. Finally, the system searches through the entire mail archive to find suspicious e-mails.
21 Citations
6 Claims
-
1. An automatic lexicon generation system to identify an information leak in an archive of e-mails, said system comprises:
-
a database; a user console; and a hardware tool in communication with the user console and configured to perform the following; receive a set of example e-mails leaking information and documents leaking information via the user console wherein, portions of the e-mails leaking information and documents leaking information are identified as critical or not critical; identify a first set of important key phrases from the set of example e-mails leaking information and documents leaking information using frequency analysis, word stemming, and removal of common words and domain specific words based on the identified critical or not critical portions of the e-mails leaking information and documents leaking information; receive a set of example e-mails not leaking information and documents not leaking information via the user console; identify a second set of important key phrases from the set of example e-mails not leaking information and documents not leaking information using frequency analysis, word stemming, and removal of common words; create a set of relevant phrases based on the first set of important key phrases and the second set of important key phrases; create a frequency table of the relevant phrases; identify words from the frequency table which occur with higher frequency in documents or e-mails leaking information and with higher frequency in documents or e-mails not leaking information and assign a label of “
very highly sensitive”
, “
highly sensitive”
, “
sensitive”
, “
not sensitive”
, “
safe”
, or “
sensitive”
to each of the phrases of said set of relevant phrases;assign weights to each of said relevant phrases; build multiple key phrase lists and weights from said relevant phrases; display said multiple key phrase lists on the user console for simulation; receive approval of said multiple key phrase lists from the user console; store the approved said multiple key phrase lists as weighted category lexicon in the database; and apply said multiple key phrase lists on an archive of e-mails and documents for identifying any e-mail leaking information. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A method of detecting suspicious documents from an archive comprising:
-
receiving via a user console a plurality of example e-mails leaking information and documents leaking information via the user console wherein, portions of the e-mails leaking information and documents leaking information are indicated as critical or not critical; identifying, by a hardware tool, a first set of important key phrases from the plurality of example e-mails leaking information and documents leaking information using frequency analysis, word stemming, and removal of common words and domain specific words, wherein the identifying is based on the critical or not critical indications of the e-mails leaking information and documents; receiving a plurality of example e-mails not leaking information and documents not leaking information via the user console; identifying a second set of important key phrases from the plurality of example e-mails not leaking information and documents not leaking information using frequency analysis, word stemming, and removal of common words by the hardware tool; creating a set of relevant phrases based on the first set of important key phrases and the second set of important key phrases by the hardware tool; creating a frequency table of the relevant phrases by the hardware tool; identifying words from the frequency table which occur with higher frequency in documents or e-mails leaking information and with higher frequency in documents or e-mails not leaking information by the hardware tool; assigning one of “
very highly sensitive,”
“
highly sensitive,”
“
sensitive,”
“
not sensitive,”
“
safe,”
or “
sensitive”
to the phrases of the set of relevant phrases based on the words from the frequency table which occur with higher frequency in documents or e-mails leaking information and with higher frequency in documents or e-mails not leaking information by the hardware tool;assigning weights to the relevant phrases after assigning a label by the hardware tool; building multiple key phrase lists and weights from the relevant phrases by the hardware tool; displaying the multiple key phrase lists on the user console; receiving from the user console an indication of at least one of a modification, a deletion, or an addition to the multiple key phrase lists and creating modified multiple key phrase lists; simulating the effect of the indication of at least one of the modification, the deletion, or the addition to the multiple key phrase lists on a pre-defined set of emails by the hardware tool; receiving approval of the indication from the user console; modifying the multiple key phrase lists based on the received indication from the user console; storing the multiple key phrase lists as a weighted category lexicon by the hardware tool; and applying the multiple key phrase lists on an archive of e-mails and documents for identifying any e-mail leaking information by the hardware tool.
-
Specification