System and method for determining a document language and refining the character set encoding based on the document language
First Claim
1. A method for determining a language in which a document is created comprising the steps of:
- a) receiving at least one electronic document;
b) identifying at least one character set encoding used in the at least one electronic document;
c) determining whether the at least one character set encoding identifies a language in which the electronic document is created; and
d) indicating the language in which the electronic document is created if a determination is made that the at least one character set encoding identifies the language in which the electronic document is created.
1 Assignment
0 Petitions
Accused Products
Abstract
A system, method, and processor readable medium for determining a language in which a document is created. After receiving an electronic document, an appropriate character set encoding (or encodings) for the text of the electronic document is determined. The character set encoding(s) indicate a list of potential languages in which the electronic document is created. The potential languages may be identified using bit flags. The number of potential languages for which an electronic document is created may be increased or decreased according to predetermined criteria. The number of potential languages may be adjusted by comparing groups of characters (n-grams) included in the electronic document with entries in a look-up table. If n-grams are located in the look-up table, bit flags associated with the n-grams may be logically ANDed together. This process may be repeated until only a single bit flag remains. The remaining bit flag identifies the language in which the electronic document is created.
68 Citations
40 Claims
-
1. A method for determining a language in which a document is created comprising the steps of:
-
a) receiving at least one electronic document;
b) identifying at least one character set encoding used in the at least one electronic document;
c) determining whether the at least one character set encoding identifies a language in which the electronic document is created; and
d) indicating the language in which the electronic document is created if a determination is made that the at least one character set encoding identifies the language in which the electronic document is created. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A system for determining a language in which a document is created comprising:
-
receiving means for receiving at least one electronic document;
identifying means for identifying at least one character set encoding used in the at least one electronic document;
determining means for determining whether the at least one character set encoding identifies a language in which the electronic document is created; and
indicating means for indicating the language in which the electronic document is created if a determination is made that the at least one character set encoding identifies the language in which the electronic document is created. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19, 20)
-
-
21. A system for determining a language in which a document is created comprising:
-
a receiving module that receives at least one electronic document;
an identifying module that identifies at least one character set encoding used in the at least one electronic document;
a determining module that determines whether the at least one character set encoding identifies a language in which the electronic document is created; and
an indicating module that indicates the language in which the electronic document is created if a determination is made that the at least one character set encoding identifies the language in which the electronic document is created. - View Dependent Claims (22, 23, 24, 25, 26, 27, 28, 29, 30)
-
-
31. A processor readable medium comprising processor readable code that causes a processor to determine a language in which a document is created, the processor readable medium comprising:
-
receiving code that causes a processor to receive at least one electronic document;
identifying code that causes a processor to identify at least one character set encoding used in the at least one electronic document;
determining code that causes a processor to determine whether the at least one character set encoding identifies a language in which the electronic document is created; and
indicating code that causes a processor to indicate the language in which the electronic document is created if a determination is made that the at least one character set encoding identifies the language in which the electronic document is created. - View Dependent Claims (32, 33, 34, 35, 36, 37, 38, 39, 40)
-
Specification