×

Automatic acquisition of a parallel corpus from a network

  • US 20080168049A1
  • Filed: 01/08/2007
  • Published: 07/10/2008
  • Est. Priority Date: 01/08/2007
  • Status: Abandoned Application
First Claim
Patent Images

1. A method comprising:

  • identifying network pages based on whether the pages include image alternative text that indicates that the network pages contain links to pages that are translations of each other;

    retrieving a plurality of pages and a plurality of respective uniform resource locators from a server associated with the domain name of the identified network pages;

    using the uniform resource locators to identify a set of candidate parallel page pairs;

    creating a set of features for each candidate parallel page pair; and

    using the sets of features to identify parallel page pairs, wherein the pages in a parallel page pair are translations of each other.

View all claims
  • 2 Assignments
Timeline View
Assignment View
    ×
    ×