Content filtering for electronic documents generated in multiple foreign languages

US 6,542,888 B2
Filed: 11/26/1997
Issued: 04/01/2003
Est. Priority Date: 11/26/1997
Status: Expired due to Term

First Claim

Patent Images

1. A method for categorizing documents generated in one or more languages comprising the steps of:

providing topic categories representing the terms from all of said languages for topic subject matter from documents;

assigning topic token IDs to said topic categories regardless of language of generation;

for each document to be categorized, assigning document token IDs representing the terms from all of said languages for the document subject matter, consistent with said topic categories;

replacing document content with at least one replacement document token ID for each of said topic categories; and

matching topic token IDs to said at least one replacement document token ID.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system for collecting and categorizing metadata about content provided via the internet or intranet, regardless of the language of generation of the content. The content of each document is assigned token IDs, which token IDs are the same for any given topic irrespective of the language in which the document is written. Filtering of single language documents will generate a single output; whereas, multilingual documents will be divided into language segments with each segment being filtered by the appropriate language filter.

Citations

13 Claims

1. A method for categorizing documents generated in one or more languages comprising the steps of:
- providing topic categories representing the terms from all of said languages for topic subject matter from documents;
  
  assigning topic token IDs to said topic categories regardless of language of generation;
  
  for each document to be categorized, assigning document token IDs representing the terms from all of said languages for the document subject matter, consistent with said topic categories;
  
  replacing document content with at least one replacement document token ID for each of said topic categories; and
  
  matching topic token IDs to said at least one replacement document token ID.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The method of claim 1 further comprising eliminating all document information not relating to said topic categories.
  - 3. The method of claim 1 further comprising converting said topic token IDs and said at least one document token ID into normalized vector representations and wherein said matching comprises vector processing.
  - 4. The method of claim 3 further comprising categorizing vectors that have at least one token ID in common.
  - 5. The method of claim 1 further comprising the steps of:
6. The method of claim 5 further comprising multiplexing each of said documents into a plurality of output streams, one for each language;
- and filtering each of said plurality of output streams in a different language filter.
7. The method of claim 3 further comprising calculating the dot products of the document vectors, and sorting the products.
8. The method of claim 7 further comprising comparing dot products to determine how closely the document matches said topic category.
9. The method of claim 1 further comprising the steps of:
- receiving at least one user query;
  
  assigning at least one query token ID to said at least one user query; and
  
  matching said at least one document token ID to said at least one user query token ID.
10. The method of claim 9 further comprising the steps of:
- converting said topic token IDs into topic vectors;
  
  converting said at least one document token ID into at least one document vector;
  
  converting said at least one user query token ID into at least one query vector; and
  
  wherein said matching comprises vector processing.
11. The method of claim 1 further comprising the steps of:
- identifying documents as monolingual or multilingual; and
  
  labeling the documents and portions thereof with language identifiers for each language found therein.
12. The method of claim 1 further comprising multiplexing each of said documents into a plurality of output streams, one for each language;
- and filtering each of said plurality of output streams in a different language filter.

13. A system for categorizing documents according to topic categories, said topic categories representing the terms from more than one language for topic subject matter, said documents having been generated in one or more languages comprising:
- means for identifying languages in which said documents were generated;
  
  means for embedding language markers in said documents where said identified languages appear;
  
  means for assigning topic token IDs to said topic categories;
  
  means for assigning document token IDs representing the terms from more than one language for document subject matter, consistent with each of said topic categories;
  
  means for replacing document content with at least one replacement document token ID for each of said topic categories; and
  
  a plurality of document filter means, one for each of said one or more languages, each of said plurality of document filter means being adapted to recognize said replacement document token IDs and match said topic token IDs to said at least one document token ID.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Marques, Joaquin M.
Primary Examiner(s)
Kazimi, Hani M.
Assistant Examiner(s)
Colbert, Ella

Application Number

US08/980,075
Publication Number

US 20010013047A1
Time in Patent Office

1,952 Days
Field of Search

707/536, 707/1, 707/5, 707/8, 707/9, 707/531, 707/501, 707/513, 707/501.1, 364/900, 704/9, 704/10
US Class Current

1/1
CPC Class Codes

G06F 16/353   into predefined classes

G06F 16/9535   Search customisation based ...

G06F 40/237   Lexical tools

Y10S 707/99935   Query augmenting and refini...

Content filtering for electronic documents generated in multiple foreign languages

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

13 Claims

Specification

Solutions

Use Cases

Quick Links

Content filtering for electronic documents generated in multiple foreign languages

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

13 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links