AUTOMATIC LEXICON GENERATION SYSTEM FOR DETECTION OF SUSPICIOUS E-MAILS FROM A MAIL ARCHIVE
First Claim
1. An automatic lexicon generation system to identify and construct a list of English phrases from a user specified set of example e-mails and documents written in English, said phrases being a set of relevant key phrases useful for identifying information leak in an archive of e-mails, said system comprises:
- a) means (102) to identify a set of important key phrases from a user specified set of example e-mails leaking information and documents leaking information, written in English, using frequency analysis, word stemming, and removal of common words and domain specific words (FIG. 1, FIG. 4);
b) means (102, 506, 511) to identify a set of important key phrases from a user specified set of example e-mails not leaking information and documents not leaking information, written in English, using frequency analysis, word stemming, and removal of common words (FIG. 1, FIG. 5);
c) means (405, 511) to identify a set of relevant phrases and to assign a label, one of “
very highly sensitive”
, “
highly sensitive”
, “
sensitive”
, “
not sensitive”
or “
sensitive”
to each of the phrases of said set (FIG. 4, FIG. 5);
(d) means (613) for assigning weights to each of said key phrases (FIG. 6);
e) means (614) for building multiple key phrase lists and weights from said important key phrases (FIG. 6);
f) means (615) for presenting said lists of key phrases to the user for simulation and for storing the final approved list as weighted category lexicon (FIG. 6)g) means (716, 717) for using said list of phrases on an archive of e-mails and documents written in English for identifying any e-mail leaking information (FIG. 7).
2 Assignments
0 Petitions
Accused Products
Abstract
A system for generating a lexicon of words, organized into weighted categories, from a user defined set of example documents for detecting suspicious e-mails from a mail archive is provided. The system uses a set of example documents and e-mails given by the user to probabilistically find possible lists of critical words. The obtained list is now applied on an archive of e-mails. The system generates an inverted index on the mails from the archive to facilitate search for the key phrases. User feedback is taken on the results obtained and corrections in the lexicon made if necessary. Thus, the mails are scanned based on user feedback, user defined words and automatically generated word list. These lists constantly adapt as e-mails in the archive change. The system then combines all these to present the user with several possible sets of keywords and their relative importance that can be used as a policy for a desired level of accuracy. The system also shows the user any change if the set is modified. Finally, the system searches through the entire mail archive to find suspicious e-mails.
-
Citations
6 Claims
-
1. An automatic lexicon generation system to identify and construct a list of English phrases from a user specified set of example e-mails and documents written in English, said phrases being a set of relevant key phrases useful for identifying information leak in an archive of e-mails, said system comprises:
-
a) means (102) to identify a set of important key phrases from a user specified set of example e-mails leaking information and documents leaking information, written in English, using frequency analysis, word stemming, and removal of common words and domain specific words ( FIG. 1 ,FIG. 4 );b) means (102, 506, 511) to identify a set of important key phrases from a user specified set of example e-mails not leaking information and documents not leaking information, written in English, using frequency analysis, word stemming, and removal of common words ( FIG. 1 ,FIG. 5 );c) means (405, 511) to identify a set of relevant phrases and to assign a label, one of “
very highly sensitive”
, “
highly sensitive”
, “
sensitive”
, “
not sensitive”
or “
sensitive”
to each of the phrases of said set (FIG. 4 ,FIG. 5 );(d) means (613) for assigning weights to each of said key phrases ( FIG. 6 );e) means (614) for building multiple key phrase lists and weights from said important key phrases ( FIG. 6 );f) means (615) for presenting said lists of key phrases to the user for simulation and for storing the final approved list as weighted category lexicon ( FIG. 6 )g) means (716, 717) for using said list of phrases on an archive of e-mails and documents written in English for identifying any e-mail leaking information ( FIG. 7 ). - View Dependent Claims (2, 3, 4, 5, 6)
-
Specification