Methods and systems for selecting a language for text segmentation
First Claim
1. A computer-implemented method comprising:
- accessing, by a computer system, a string of characters that are associated with a computing device;
identifying, by the computer system, a plurality of candidate languages for segmenting the string of characters, wherein the plurality of candidate languages are identified based on one or more language indicators associated with the string of characters or the computing device;
determining weights for the plurality of candidate languages based on the one or more language indicators, wherein each of the weights indicates a probability that a corresponding candidate language from the plurality of candidate languages is an appropriate language to use for interpreting the string of characters based on the string of characters or the computing device;
determining one or more segmented results from the string of characters for each of the plurality of candidate languages, wherein a segmented result comprises a plurality of tokens that are created by inserting one or more breaks into the string of characters;
identifying, from the plurality of candidate languages, an operable language for the string of characters based, at least in part, on a comparison of weighted frequencies associated with the candidate languages, wherein each of the weighted frequencies comprises a frequency with which the segmented results occur in a corpus associated with a corresponding candidate language, the frequency being weighted according to a corresponding weight from the determined weights that is associated with the corresponding candidate language; and
providing information that identifies the operable language.
2 Assignments
0 Petitions
Accused Products
Abstract
Methods and systems for selecting a language for text segmentation are disclosed. In one embodiment, at least a first candidate language and a second candidate language associated with a string of characters are identified, at least a first segmented result associated with the first candidate language and a second segmented result associated with the second candidate language are determined, a first frequency of occurrence for the first segmented result and a second frequency of occurrence for the second segmented result are determined, and an operable language is identified from the first candidate language and the second candidate language based at least in part on the first frequency of occurrence and the second frequency of occurrence.
98 Citations
18 Claims
-
1. A computer-implemented method comprising:
-
accessing, by a computer system, a string of characters that are associated with a computing device; identifying, by the computer system, a plurality of candidate languages for segmenting the string of characters, wherein the plurality of candidate languages are identified based on one or more language indicators associated with the string of characters or the computing device; determining weights for the plurality of candidate languages based on the one or more language indicators, wherein each of the weights indicates a probability that a corresponding candidate language from the plurality of candidate languages is an appropriate language to use for interpreting the string of characters based on the string of characters or the computing device; determining one or more segmented results from the string of characters for each of the plurality of candidate languages, wherein a segmented result comprises a plurality of tokens that are created by inserting one or more breaks into the string of characters; identifying, from the plurality of candidate languages, an operable language for the string of characters based, at least in part, on a comparison of weighted frequencies associated with the candidate languages, wherein each of the weighted frequencies comprises a frequency with which the segmented results occur in a corpus associated with a corresponding candidate language, the frequency being weighted according to a corresponding weight from the determined weights that is associated with the corresponding candidate language; and providing information that identifies the operable language. - View Dependent Claims (2, 3, 4, 5, 6)
-
-
7. A system comprising:
-
a computer system; a segmentation engine of the computer system to access a string of characters that are associated with a computing device, to identify a plurality of candidate languages for segmenting the string of characters based on one or more language indicators associated with the string of characters or the computing device, and to determine weights for the plurality of candidate languages based on the one or more language indicators, wherein each of the weights indicates a probability that a corresponding candidate language from the plurality of candidate languages is an appropriate language to use for interpreting the string of characters based on the string of characters or the computing device; a segmentation processor of the computer system to determine one or more segmented results from the string of characters for each of the plurality of candidate languages, wherein a segmented result comprises a plurality of tokens that are created by inserting one or more breaks into the string of characters; and a language processor of the computer system to identify, from the plurality of candidate languages, an operable language for the string of characters based, at least in part, on a comparison of weighted frequencies associated with the candidate languages, wherein each of the weighted frequencies comprises a frequency with which the segmented results occur in a corpus associated with a corresponding candidate language, the frequency being weighted according to a corresponding weight from the determined weights that is associated with the corresponding candidate language. - View Dependent Claims (8, 9, 10, 11, 12)
-
-
13. A computer program product comprising a computer-readable storage device including instructions that, when executed, cause a computer system to perform operations comprising:
-
accessing a string of characters that are associated with a computing device; identifying a plurality of candidate languages for segmenting the string of characters, wherein the plurality of candidate languages are identified based on one or more language indicators associated with the string of characters or the computing device; determining weights for the plurality of candidate languages based on the one or more language indicators, wherein each of the weights indicates a probability that a corresponding candidate language from the plurality of candidate languages is an appropriate language to use for interpreting the string of characters based on the string of characters or the computing device; determining one or more segmented results from the string of characters for each of the plurality of candidate languages, wherein a segmented result comprises a plurality of tokens that are created by inserting one or more breaks into the string of characters; identifying, from the plurality of candidate languages, an operable language for the string of characters based, at least in part, on a comparison of weighted frequencies associated with the candidate languages, wherein each of the weighted frequencies comprises a frequency with which the segmented results occur in a corpus associated with a corresponding candidate language, the frequency being weighted according to a corresponding weight from the determined weights that is associated with the corresponding candidate language; and providing information that identifies the operable language. - View Dependent Claims (14, 15, 16, 17, 18)
-
Specification