Content filtering for electronic documents generated in multiple foreign languages
First Claim
Patent Images
1. A method for categorizing documents generated in one or more languages comprising the steps of:
- providing topic categories representing the terms from all of said languages for topic subject matter from documents;
assigning topic token IDs to said topic categories regardless of language of generation;
for each document to be categorized, assigning document token IDs representing the terms from all of said languages for the document subject matter, consistent with said topic categories;
replacing document content with at least one replacement document token ID for each of said topic categories; and
matching topic token IDs to said at least one replacement document token ID.
1 Assignment
0 Petitions
Accused Products
Abstract
A system for collecting and categorizing metadata about content provided via the internet or intranet, regardless of the language of generation of the content. The content of each document is assigned token IDs, which token IDs are the same for any given topic irrespective of the language in which the document is written. Filtering of single language documents will generate a single output; whereas, multilingual documents will be divided into language segments with each segment being filtered by the appropriate language filter.
-
Citations
13 Claims
-
1. A method for categorizing documents generated in one or more languages comprising the steps of:
-
providing topic categories representing the terms from all of said languages for topic subject matter from documents;
assigning topic token IDs to said topic categories regardless of language of generation;
for each document to be categorized, assigning document token IDs representing the terms from all of said languages for the document subject matter, consistent with said topic categories;
replacing document content with at least one replacement document token ID for each of said topic categories; and
matching topic token IDs to said at least one replacement document token ID. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
identifying documents as monolingual or multilingual; and
labelling the documents and portions thereof with language identifiers for each language found therein.
-
-
6. The method of claim 5 further comprising multiplexing each of said documents into a plurality of output streams, one for each language;
- and filtering each of said plurality of output streams in a different language filter.
-
7. The method of claim 3 further comprising calculating the dot products of the document vectors, and sorting the products.
-
8. The method of claim 7 further comprising comparing dot products to determine how closely the document matches said topic category.
-
9. The method of claim 1 further comprising the steps of:
-
receiving at least one user query;
assigning at least one query token ID to said at least one user query; and
matching said at least one document token ID to said at least one user query token ID.
-
-
10. The method of claim 9 further comprising the steps of:
-
converting said topic token IDs into topic vectors;
converting said at least one document token ID into at least one document vector;
converting said at least one user query token ID into at least one query vector; and
wherein said matching comprises vector processing.
-
-
11. The method of claim 1 further comprising the steps of:
-
identifying documents as monolingual or multilingual; and
labeling the documents and portions thereof with language identifiers for each language found therein.
-
-
12. The method of claim 1 further comprising multiplexing each of said documents into a plurality of output streams, one for each language;
- and filtering each of said plurality of output streams in a different language filter.
-
13. A system for categorizing documents according to topic categories, said topic categories representing the terms from more than one language for topic subject matter, said documents having been generated in one or more languages comprising:
-
means for identifying languages in which said documents were generated;
means for embedding language markers in said documents where said identified languages appear;
means for assigning topic token IDs to said topic categories;
means for assigning document token IDs representing the terms from more than one language for document subject matter, consistent with each of said topic categories;
means for replacing document content with at least one replacement document token ID for each of said topic categories; and
a plurality of document filter means, one for each of said one or more languages, each of said plurality of document filter means being adapted to recognize said replacement document token IDs and match said topic token IDs to said at least one document token ID.
-
Specification