Weakly supervised part-of-speech tagging with coupled token and type constraints

US 9,311,299 B1
Filed: 07/31/2013
Issued: 04/12/2016
Est. Priority Date: 07/31/2013
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method comprising:

obtaining a word in a first language;

selecting a first, token-level set of one or more parts-of-speech tags to associate with the word in the first language, comprising;

identifying a translation of the word in a second language, andselecting, as the first, token-level set of one or more parts-of-speech tags to associate with the word in the first language, a set of one or more parts-of-speech tags that are associated with the translation of the word in the second language;

selecting a second, token-level set of one or more parts-of-speech tags to associate with the word in the first language, comprising;

when the word in the first language has no associated part-of-speech tag indicated for the word in the first language in a tag dictionary, selecting, as the second, token-level set of the one or more parts of speech tags, all of one or more of the parts-of-speech tags that (i) are in the first, token-level set of one or more parts-of-speech tags, and (ii) are associated as parts-of-speech tags with words in the tag dictionary, orwhen the word in the first language has one or more associated parts-of-speech tags indicated for the word in the first language in the tag dictionary, selecting, as the second, token-level set of the one or more parts-of-speech-tags, the one or more parts-of-speech tags that (I) are in the first, token-level set of one or more parts-of-speech tags, and (II) are indicated in the tag dictionary as associated with the word in the first language; and

providing the word and the second, token-level set of the one or more parts-of-speech tags as training data for training a machine-based part-of-speech tagger.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and system are provided for a part-of-speech tagger that may be particularly useful for resource-poor languages. Use of manually constructed tag dictionaries from dictionaries via bitext can be used as type constraints to overcome the scarcity of annotated data in some instances. Additional token constraints can be projected from a resource-rich source language via word-aligned bitext. Several example models are provided to demonstrate this such as a partially observed conditional random field model, where coupled token and type constraints may provide a partial signal for training. The disclosed method achieves a significant relative error reduction over the prior state of the art.

14 Citations

View as Search Results

20 Claims

1. A computer-implemented method comprising:
- obtaining a word in a first language;
  
  selecting a first, token-level set of one or more parts-of-speech tags to associate with the word in the first language, comprising;
  
  identifying a translation of the word in a second language, andselecting, as the first, token-level set of one or more parts-of-speech tags to associate with the word in the first language, a set of one or more parts-of-speech tags that are associated with the translation of the word in the second language;
  
  selecting a second, token-level set of one or more parts-of-speech tags to associate with the word in the first language, comprising;
  
  when the word in the first language has no associated part-of-speech tag indicated for the word in the first language in a tag dictionary, selecting, as the second, token-level set of the one or more parts of speech tags, all of one or more of the parts-of-speech tags that (i) are in the first, token-level set of one or more parts-of-speech tags, and (ii) are associated as parts-of-speech tags with words in the tag dictionary, orwhen the word in the first language has one or more associated parts-of-speech tags indicated for the word in the first language in the tag dictionary, selecting, as the second, token-level set of the one or more parts-of-speech-tags, the one or more parts-of-speech tags that (I) are in the first, token-level set of one or more parts-of-speech tags, and (II) are indicated in the tag dictionary as associated with the word in the first language; and
  
  providing the word and the second, token-level set of the one or more parts-of-speech tags as training data for training a machine-based part-of-speech tagger.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The computer-implemented method of claim 1, wherein selecting a second, token-level set of one or more parts-of-speech tags to associate with the word in the first language, further comprises:
    - when the word in the first language has one or more associated parts-of-speech tags indicated for the word in the first language in the tag dictionary, removing, from the second, token-level set, parts-of-speech tags of the first, token-level set that are not in the tag dictionary, andwherein providing the word and the second, token-level set of the one or more parts-of-speech tags as training data for training a machine-based part-of-speech tagger comprises;
      
      providing the first, token-level set of one or more parts-of-speech tags as the training data for training the machine-based part-of-speech tagger.
  - 3. The computer-implemented method of claim 1, further comprising:
    - generating a bidirectional word alignment based on the first, token-level set and the second, token-level set; and
      
      determining a projection coupled to a parts-of-speech tag in the first, token-level set based on the bidirectional word alignment.
  - 4. The computer-implemented method of claim 3, further comprising:
    - removing a parts-of-speech tag of the first, token-level set that is not coupled to the projection.
  - 5. The computer-implemented method of claim 3, further comprising:
    - removing all parts-of-speech tags of the first, token-level set other than the parts-of-speech tag coupled to the projection.
  - 6. The computer-implemented method of claim 1, wherein the first language is a resource-poor language and the second language is a resource-rich language.
  - 7. The computer-implemented method of claim 1, wherein each tag in the first, token-level set is a tag indicating a particular use context of the word in the first language.

8. A non-transitory computer-readable storage medium encoded with a computer program, the computer program comprising instructions that, upon execution by a computer, cause the computer to perform operations comprising:
- obtaining a word in a first language;
  
  selecting a first, token-level set of one or more parts-of-speech tags to associate with the word in the first language, comprising;
  
  identifying a translation of the word in a second language, andselecting, as the first, token-level set of one or more parts-of-speech tags to associate with the word in the first language, a set of one or more parts-of-speech tags that are associated with the translation of the word in the second language;
  
  selecting a second, token-level set of one or more parts-of-speech tags to associate with the word in the first language, comprising;
  
  when the word in the first language has no associated part-of-speech tag indicated for the word in the first language in a tag dictionary, selecting, as the second, token-level set of the one or more parts of speech tags, all of one or more of the parts-of-speech tags that (i) are in the first, token-level set of one or more parts-of-speech tags, and (ii) are associated as parts-of-speech tags with words in the tag dictionary, orwhen the word in the first language has one or more associated parts-of-speech tags indicated for the word in the first language in the tag dictionary, selecting, as the second, token-level set of the one or more parts-of-speech-tags, the one or more parts-of-speech tags that (I) are in the first, token-level set of one or more parts-of-speech tags, and (II) are indicated in the tag dictionary as associated with the word in the first language; and
  
  providing the word and the second, token-level set of the one or more parts-of-speech tags as training data for training a machine-based part-of-speech tagger.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The non-transitory computer-readable storage medium of claim 8, wherein selecting a second, token-level set of one or more parts-of-speech tags to associate with the word in the first language, further comprises:
    - when the word in the first language has one or more associated parts-of-speech tags indicated for the word in the first language in the tag dictionary, removing, from the second, token-level set, parts-of-speech tags of the first, token-level set that are not in the tag dictionary, andwherein providing the word and the second, token-level set of the one or more parts-of-speech tags as training data for training a machine-based part-of-speech tagger comprises;
      
      providing the first, token-level set of one or more parts-of-speech tags as the training data for training the machine-based part-of-speech tagger.
  - 10. The non-transitory computer-readable storage medium of claim 8, wherein the operations further comprise:
    - generating a bidirectional word alignment based on the first, token-level set and the second, token-level set; and
      
      determining a projection coupled to a parts-of-speech tag in the first, token-level set based on the bidirectional word alignment.
  - 11. The non-transitory computer-readable storage medium of claim 10, wherein the operations further comprise:
    - removing a parts-of-speech tag of the first, token-level set that is not coupled to the projection.
  - 12. The non-transitory computer-readable storage medium of claim 10, wherein the operations further comprise:
    - removing all parts-of-speech tags of the first, token-level set other than the parts-of-speech tag coupled to the projection.
  - 13. The non-transitory computer-readable storage medium of claim 8, wherein the first language is a resource-poor language and the second language is a resource-rich language.
  - 14. The non-transitory computer-readable storage medium of claim 8, wherein each tag in the first, token-level set is a tag indicating a particular use context of the word in the first language.

15. A system comprising:
- one or more processors and one or more computer storage media storing instructions that are operable, when executed by the one or more processors, to cause the one or more processors to perform operations comprising;
  
  obtaining a word in a first language;
  
  selecting a first, token-level set of one or more parts-of-speech tags to associate with the word in the first language, comprising;
  
  identifying a translation of the word in a second language, andselecting, as the first, token-level set of one or more parts-of-speech tags to associate with the word in the first language, a set of one or more parts-of-speech tags that are associated with the translation of the word in the second language;
  
  selecting a second, token-level set of one or more parts-of-speech tags to associate with the word in the first language, comprising;
  
  when the word in the first language has no associated part-of-speech tag indicated for the word in the first language in a tag dictionary, selecting, as the second, token-level set of the one or more parts of speech tags, all of one or more of the parts-of-speech tags that (i) are in the first, token-level set of one or more parts-of-speech tags, and (ii) are associated as parts-of-speech tags with words in the tag dictionary, orwhen the word in the first language has one or more associated parts-of-speech tags indicated for the word in the first language in the tag dictionary, selecting, as the second, token-level set of the one or more parts-of-speech-tags, the one or more parts-of-speech tags that (I) are in the first, token-level set of one or more parts-of-speech tags, and (II) are indicated in the tag dictionary as associated with the word in the first language; and
  
  providing the word and the second, token-level set of the one or more parts-of-speech tags as training data for training a machine-based part-of-speech tagger.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The system of claim 15, wherein selecting a second, token-level set of one or more parts-of-speech tags to associate with the word in the first language, further comprises:
    - when the word in the first language has one or more associated parts-of-speech tags indicated for the word in the first language in the tag dictionary, removing, from the second, token-level set, parts-of-speech tags of the first, token-level set that are not in the tag dictionary, andwherein providing the word and the second, token-level set of the one or more parts-of-speech tags as training data for training a machine-based part-of-speech tagger comprises;
      
      providing the first, token-level set of one or more parts-of-speech tags as the training data for training the machine-based part-of-speech tagger.
  - 17. The system of claim 15, wherein the operations further comprise:
    - generating a bidirectional word alignment based on the first, token-level set and the second, token-level set; and
      
      determining a projection coupled to a parts-of-speech tag in the first, token-level set based on the bidirectional word alignment.
  - 18. The system of claim 17, wherein the operations further comprise:
    - removing a parts-of-speech tag of the first, token-level set that is not coupled to the projection.
  - 19. The system of claim 17, wherein the operations further comprise:
    - removing all parts-of-speech tags of the first, token-level set other than the parts-of-speech tag coupled to the projection.
  - 20. The system of claim 15, wherein the first language is a resource-poor language and the second language is a resource-rich language, andwherein each tag in the first, token-level set is a tag indicating a particular use context of the word in the first language.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Petrov, Slav, Das, Dipanjan, McDonald, Ryan, Nivre, Joakim, Tackstrom, Oscar
Primary Examiner(s)
Spooner, Lamont
Assistant Examiner(s)
Ogunbiyi, Oluwadamilola M

Application Number

US13/955,491
Time in Patent Office

986 Days
Field of Search

704/9
US Class Current

1/1
CPC Class Codes

G06F 40/211   Syntactic parsing, e.g. bas...

G06F 40/30   Semantic analysis

G06F 40/40   Processing or translation o...

G06F 40/45   Example-based machine trans...

Weakly supervised part-of-speech tagging with coupled token and type constraints

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

14 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Weakly supervised part-of-speech tagging with coupled token and type constraints

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

14 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links