MINING BILINGUAL DICTIONARIES FROM MONOLINGUAL WEB PAGES
First Claim
1. A method for identifying translation pairs from web pages, the method comprising:
- receiving monolingual web page data of a source language;
processing the web page data by;
detecting the occurrence of a predefined pattern in the web page data;
extracting a plurality of translation pair candidates, each of the translation pair candidates including a source language string and target language string;
determining whether each translation pair candidate is a valid transliteration;
for each translation pair that is determined not to be a valid transliteration, determining whether each translation pair candidate is a valid translation; and
adding each translation pair that is determined to be a valid translation or transliteration to a dictionary.
2 Assignments
0 Petitions
Accused Products
Abstract
Systems and methods for identifying translation pairs from web pages are provided. One disclosed method includes receiving monolingual web page data of a source language, and processing the web page data by detecting the occurrence of a predefined pattern in the web page data, and extracting a plurality of translation pair candidates. Each of the translation pair candidates may include a source language string and target language string. The method may further include determining whether each translation pair candidate is a valid transliteration. The method may also include, for each translation pair that is determined not to be a valid transliteration, determining whether each translation pair candidate is a valid translation. The method may further include adding each translation pair that is determined to be a valid translation or transliteration to a dictionary.
66 Citations
20 Claims
-
1. A method for identifying translation pairs from web pages, the method comprising:
-
receiving monolingual web page data of a source language; processing the web page data by; detecting the occurrence of a predefined pattern in the web page data; extracting a plurality of translation pair candidates, each of the translation pair candidates including a source language string and target language string; determining whether each translation pair candidate is a valid transliteration; for each translation pair that is determined not to be a valid transliteration, determining whether each translation pair candidate is a valid translation; and adding each translation pair that is determined to be a valid translation or transliteration to a dictionary. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10)
-
-
11. A system for identifying translation pairs from web pages, the system comprising a computer program configured to be executed on a computing device, the computer program including:
-
a preprocessing module configured to detect an occurrence of a predefined pattern in monolingual web page data of a source language, and to extract a plurality of translation pair candidates, each of the translation pair candidates including a source language string and target language string; and a transliteration module configured to process the plurality of translation pair candidates to determine whether each translation pair candidate is a valid transliteration; and a translation module to process each translation pair candidate that is determined not to be a valid transliteration, to determine whether each translation pair candidate is a valid translation; wherein the computer program is configured to add each translation pair candidate that is determined to be a valid transliteration or a valid translation to a dictionary. - View Dependent Claims (12, 13, 14, 15, 16, 17, 18, 19)
-
-
20. A method for identifying translation pairs from web pages, the method comprising:
-
receiving monolingual web page data of a source language; processing the web page data to produce one or more translation pair candidates, each translation pair candidate including a source language string and a target language string; determining whether one or more of the translation pair candidates is a valid transliteration, at least in part by applying an alignment model to determine a probability the source language string matches the target language string in the translation pair candidate, the alignment model being based on a determined probability that each of one or more source language phonemes in the source language string is a transliteration of a corresponding target language phoneme in the target language string; and determining whether one or more of the translation pair candidate is a valid semantic translation at least in part by, ranking each of the translation pair candidates based on a plurality of predefined ranking factors, and determining whether a rank of each translation pair candidate is above a predetermined threshold.
-
Specification