Adaptive pattern learning for bilingual data mining
First Claim
1. A system, comprising:
- one or more processors; and
memory that includes a plurality of computer-executable components executable by the one or more processors, the plurality of computer-executable components comprising;
a pre-processing component to process a bilingual web page into a Document Object Model (DOM) tree that includes at least one node;
a seed mining component to link bilingual snippet pairs of the at least one node into a plurality of translation snippet pairs;
a pattern learning component to determine one or more best fit candidate patterns based on the plurality of translation snippet pairs via a Support Vector Machine (SVM) classifier;
a data mining component to mine one or more translation pairs from the bilingual web page using the one or more best fit candidate patterns; and
a data storage component to store the one or more translation pairs, wherein the one or more translation pairs including at least one of a term pair, a phrase pair, or a sentence pair.
2 Assignments
0 Petitions
Accused Products
Abstract
Embodiments for the adaptive learning of translation layout patterns to mine bilingual data are disclosed. In accordance with at least one embodiment, the adaptive learning of patterns to mine bilingual data includes processing a bilingual web page into a Document Object Model (DOM) tree. The embodiment further includes linking the bilingual snippet pairs of each node into a plurality bilingual snippet pairs. The embodiment also includes determining one or more best fit candidate patterns based on the plurality of translation snippets via a Support Vector Machine classifier. The embodiment additionally includes mining one or more translation pairs from the bilingual web page using the one or more best fit candidate patterns. The translation pairs are further stored in a data storage. The one or more translation pairs including at least one of a term pair, a phrase pair, or a sentence pair.
-
Citations
20 Claims
-
1. A system, comprising:
-
one or more processors; and memory that includes a plurality of computer-executable components executable by the one or more processors, the plurality of computer-executable components comprising; a pre-processing component to process a bilingual web page into a Document Object Model (DOM) tree that includes at least one node; a seed mining component to link bilingual snippet pairs of the at least one node into a plurality of translation snippet pairs; a pattern learning component to determine one or more best fit candidate patterns based on the plurality of translation snippet pairs via a Support Vector Machine (SVM) classifier; a data mining component to mine one or more translation pairs from the bilingual web page using the one or more best fit candidate patterns; and a data storage component to store the one or more translation pairs, wherein the one or more translation pairs including at least one of a term pair, a phrase pair, or a sentence pair. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
-
-
13. A method, comprising:
-
processing a bilingual web page into a Document Object Model (DOM) tree that includes at least one node; linking bilingual snippet pairs of the at least one node into a plurality of translation snippet pairs using an alignment model, the alignment model including a bilingual dictionary or a transliteration model; determining one or more best fit candidate patterns based on the plurality of translation snippet pairs via a Support Vector Machine (SVM) classifier; mining one or more translation pairs from the bilingual web page using the one or more best fit candidate patterns; and storing the one or more translation pairs into the alignment model, wherein the one or more translation pairs including at least one of a term pair, a phrase pair, or a sentence pair. - View Dependent Claims (14, 15, 16, 17, 18, 19)
-
-
20. A computer readable storage device storing computer-executable instructions that, when executed, cause one or more processors to perform acts comprising:
-
processing a bilingual web page into a Document Object Model (DOM) tree that includes and at least one attribute node and at least one content node that is a child of the at least one attribute node; removing the at least one attribute node from the DOM tree and linking the at least one content node to a root node of the DOM tree; linking bilingual snippet pairs of the at least one content node into a plurality of translation snippet pairs using an alignment model; determining one or more best fit candidate patterns based on the plurality of translation snippet pairs via a Support Vector Machine (SVM) classifier; mining one or more translation pairs from the bilingual web page using the one or more best fit candidate patterns; storing the one or more translation pairs into the alignment model, wherein the one or more translation pairs including at least one of a term pair, a phrase pair, or a sentence pair; and re-linking the bilingual snippet pairs of the at least one node into the plurality of translation snippet pairs using the alignment model that includes the one or more mined translation pairs.
-
Specification