Method for categorizing documents into subjects using relevance normalization for documents retrieved from an information retrieval system in response to a query
First Claim
1. A subjector for selectively storing input information in an information retrieval system database, said input information being formed of a collection of English language words, comprising:
- at least one information subject category within said information retrieval system database;
a first plurality of subjected documents selected from said information retrieval system database and relating to said first information subject category;
a preliminary lexicon determined in accordance with said first plurality of subjected documents wherein an information comparing unit first compares selected documents with said preliminary lexicon;
a second plurality of documents selected in accordance with said first comparing, wherein a determination is made whether documents of said second plurality of documents belong in said first subject category and documents are removed from said second plurality of documents in accordance with said determining whether said documents belong in said second plurality of documents to provide a remaining third plurality of documents, and said information comparing unit second compares said third plurality of documents to determine said first subject lexicon in accordance with said third plurality of documents;
said first subject lexicon corresponding to said information subject category and containing information representative of said information subject category, said first subject lexicon containing a plurality of classifier words, wherein generally all of said classifier words in said first subject lexicon contain at least one English language word;
an information comparing unit for third comparing said collection of English language words from said input information with said classifier words from said first subject lexicon, wherein said collection of English language words from said input information form at least one document; and
memory for storing said input information in said information subject category in accordance with said third comparing.
5 Assignments
0 Petitions
Accused Products
Abstract
A method for storing input information in an information retrieval system database wherein a plurality of information subject categories are provided. A plurality of subject lexicons are provided, each subject lexicon of the plurality of subject lexicons corresponding to an information subject category of the plurality of information categories. Each subject lexicon contains information representative of its corresponding information subject category. The input information is compared with the subject lexicons and the input information is stored in a selected information subject category according to the comparing of the input information with the subject lexicons.
-
Citations
10 Claims
-
1. A subjector for selectively storing input information in an information retrieval system database, said input information being formed of a collection of English language words, comprising:
-
at least one information subject category within said information retrieval system database; a first plurality of subjected documents selected from said information retrieval system database and relating to said first information subject category; a preliminary lexicon determined in accordance with said first plurality of subjected documents wherein an information comparing unit first compares selected documents with said preliminary lexicon; a second plurality of documents selected in accordance with said first comparing, wherein a determination is made whether documents of said second plurality of documents belong in said first subject category and documents are removed from said second plurality of documents in accordance with said determining whether said documents belong in said second plurality of documents to provide a remaining third plurality of documents, and said information comparing unit second compares said third plurality of documents to determine said first subject lexicon in accordance with said third plurality of documents; said first subject lexicon corresponding to said information subject category and containing information representative of said information subject category, said first subject lexicon containing a plurality of classifier words, wherein generally all of said classifier words in said first subject lexicon contain at least one English language word; an information comparing unit for third comparing said collection of English language words from said input information with said classifier words from said first subject lexicon, wherein said collection of English language words from said input information form at least one document; and memory for storing said input information in said information subject category in accordance with said third comparing. - View Dependent Claims (2, 6, 7, 8, 9, 10)
-
-
3. A method for storing input information in an information retrieval system database, said input information being formed of a collection of English language words, comprising the steps of:
-
(A) determining an information subject category within said information retrieval system database; (B) determining a subject lexicon corresponding to said information subject category and containing information representative of said information subject category, said subject lexicon containing a plurality of classifier words, wherein each of said plurality of classifier words contains a first English language word (a) and a second English word (b), and each of said plurality of classifier words has a discriminator weight (w(a, b)) calculated in accordance with the formula;
space="preserve" listing-type="equation">w(a,b)=log (P(ab)/P(a)P(b))wherein P(a) is a probability that said first English language word (a) occurs in a document pool, P(b) is a probability that said second English language word (b) occurs in said document pool, and P(ab) is a probability that said first English language word (a) is positioned adjacent to said second English language word (b) in a subject corpus; (C) comparing said collection of English language words from said input information with said classifier words from said subject lexicon; and (D) storing said input information in said information subject category in accordance with said comparing of step (C).
-
-
4. A subjector for selectively storing input information in an information retrieval system database, said input information being formed of a collection of English language words, comprising:
-
at least one information subject category within said information retrieval system database; a subject lexicon corresponding to said information subject category and containing information representative of said information subject category, said subject lexicon containing a plurality of classifier words, wherein each of said plurality of classifier words contains a first English language word (a) and a second English word (b), and each of said plurality of classifier words has a discriminator weight (w(a, b)) calculated in accordance with the formula;
space="preserve" listing-type="equation">w(a, b)=log (P(ab)/P(a)P(b))wherein P(a) is a probability that said first English language word (a) occurs in a document pool, P(b) is a probability that said second English language word (b) occurs in said document pool, and P(ab) is a probability that said first English language word (a) is positioned adjacent to said second English language word (b) in a subject corpus; an information comparing unit for first comparing said collection of English language words from said input information with said classifier words from said subject lexicon; and memory for storing said input information in said information subject category in accordance with said first comparing.
-
-
5. A method for storing input information in an information retrieval system database, said input information being formed of a collection of English language words, comprising the steps of:
-
(a) determining an information subject category within said information retrieval system database; (b) determining a first subject lexicon corresponding to said information subject category and containing information representative of said information subject category, said first subject lexicon containing a plurality of classifier words, wherein generally all of said classifier words in said first subject lexicon contain at least one English language word, said step of determining said first subject lexicon including; (i) first selecting a first plurality of subjected documents from said information retrieval system database relating to said information subject category to provide first selected documents; (ii) determining a preliminary lexicon in accordance with said first plurality of subjected documents; (iii) comparing selected documents with said preliminary lexicon; (iv) second selecting a second plurality of documents in accordance with said comparing of step (iii); (v) determining whether documents of said second plurality of documents belong in said first subject category; (vi) removing documents from said second plurality of documents in accordance with the determining of step (v) to provide a third plurality of documents; and (vii) determining said first subject lexicon in accordance with said third plurality of documents; (c) comparing said collection of English language words from said input information with said classifier words from said first subject lexicon, wherein said collection of English language words from said input information form at least one document; and (d) storing said input information in said information subject category in accordance with said comparing of step (c).
-
Specification