Finite state data structures with paths representing paired strings of tags and tag combinations
First Claim
1. An article of manufacture for use in tagging text in a language, comprising:
- a storage medium;
machine-accessible tagging data stored by the storage medium;
the tagging data including a finite state data structure for use in mapping strings of tag combinations to strings of tags for tokens in the language, each tag combination being a combination of tags for tokens in the language;
the finite state data structure including paths that represent pairs of strings with a first string that is a string of tag combinations and a second string that is a string of tags;
the second strings of at least one set of one or more paths with the same first string including only highly probable strings of tags for the first string;
paths that represent pairs of strings being machine accessible using the first string to obtain the second string.
9 Assignments
0 Petitions
Accused Products
Abstract
A finite state data structure includes paths that represent pairs of strings, with a first string that is a string of tag combinations and a second string that is a string of tags for tokens in a language. The second strings of a set of paths with the same first string include only highly probable strings of tags for the first string. The data structure can be an FST or a bimachine, and can be used for mapping strings of tag combinations to strings of tags. The tags can, for example, indicate parts of speech of words, and the tag combinations can be ambiguity classes or, in a bimachine, reduced ambiguity classes. An FST can be obtained by approximating a Hidden Markov Model. A bimachine can include left-to-right and right-to-left sequential FSTs obtained based on frequencies of tokens in a training corpus.
-
Citations
20 Claims
-
1. An article of manufacture for use in tagging text in a language, comprising:
-
a storage medium;
machine-accessible tagging data stored by the storage medium;
the tagging data including a finite state data structure for use in mapping strings of tag combinations to strings of tags for tokens in the language, each tag combination being a combination of tags for tokens in the language;
the finite state data structure including paths that represent pairs of strings with a first string that is a string of tag combinations and a second string that is a string of tags;
the second strings of at least one set of one or more paths with the same first string including only highly probable strings of tags for the first string;
paths that represent pairs of strings being machine accessible using the first string to obtain the second string.- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
(A) obtaining a first set that includes possible tags for tokens in the language and a second set that includes tag combinations, each tag combination in the second set including one or more tags in the first set; and
(B) using the first and second sets to produce the finite state transducer.
-
-
4. The article of claim 3 in which (B) includes:
-
(B1) creating a third set that includes an initial state and, for each pair of a tag combination in the second set with a tag in the first set, a state labelled with the pair; and
(B2) for each state in the third set and for each tag combination in the second set, determining a destination state in the third set and creating an outgoing arc from the state to the destination state in the third set, labelled with the pair label of the destination state;
the destination state being the state whose pair label is the most probable for the tag combination according to a Hidden Markov Model (HMM) that maps sequences of ambiguity classes to tags for the language.
-
-
5. The article of claim 4 in which (B2) includes:
-
if the state is the initial state, determining the pair label that is most probable using
where π
is initial probability according to the HMM, and b is class probability for the tag combination of the pair label according to the HMM.
-
-
6. The article of claim 4 in which (B2) includes:
-
if the state is not the initial state, determining the pair label that is most probable using
where a is transition probability according to the HMM, and b is class probability for the tag combination of the pair label according to the HMM.
-
-
7. The article of claim 3 in which (B) includes:
(B3) minimizing and determinizing the finite state transducer.
-
8. The article of claim 3 in which (B) includes:
-
(B4) using the first set and the second set to obtain possible preceding subsequences up to a given look-back length and possible following subsequences up to a given look-ahead length and to obtain a third set that includes, for every tag combination, every sequence in which the tag combination is preceded by one of the possible preceding subsequences and followed by one of the possible following subsequences;
each preceding and following subsequence including one or more elements, each element including a tag combination in the second set or a tag in the first set;
(B5) modifying each sequence in the third set to include a combination-tag pair that includes the sequence'"'"'s tag combination and that includes a tag that is the most probable for the tag combination to be preceded by the sequence'"'"'s preceding subsequence and to be followed by the sequence'"'"'s following subsequence according to a Hidden Markov Model (HMM) that maps sequences of ambiguity classes to tags for the language;
(B6) combining the modified sequences in the third set to produce a preliminary finite state transducer that accepts any of the modified sequences; and
(B7) obtaining one or more constraint finite state machines representing constraints and composing the constraint finite state machines with the preliminary finite state transducer to produce the finite state transducer;
the constraints including constraints for tags, combinations and sentence boundaries and ensuring a correct concatenation of sequences.
-
-
9. The article of claim 2 in which the finite state transducer is produced by:
-
(C) obtaining a first partial set that includes tag combination sequences ending with a tag combination with only one tag, the first partial set not including all possible tag combination sequences that end with a tag combination with only one tag;
(D) using the first partial set to obtain a second partial set that includes a combination-tag pair sequence for each tag combination sequence in the first partial set, each combination-tag pair in each sequence in the second partial set including a tag combination and a tag for the tag combination;
the tags for a tag combination sequence in the first partial set being obtained using a Hidden Markov Model (HMM) that maps sequences of ambiguity classes to tags for the language; and
(E) using the second partial set to obtain a finite state transducer that maps tag combinations to tags and that approximates the HMM.
-
-
10. The article of claim 9 in which (C) includes:
-
obtaining a first set that includes possible tags for tokens in the language and a second set that includes tag combinations, each tag combination in the second set including one or more tags in the first set; and
using the first and second sets to obtain all possible combination-tag pair sequences up to a sequence length.
-
-
11. The article of claim 9 in which (C) includes:
-
annotating an untagged text corpus with tag combination labels; and
extracting tag combination sequences from the annotated text corpus.
-
-
12. The article of claim 9 in which (E) includes:
-
obtaining a preexisting finite state transducer that approximates the HMM;
using the preexisting finite state transducer to obtain a complete set that includes combination-tag pair sequences, the complete set including a combination-tag pair sequence for each tag combination sequence that can be mapped by the preexisting finite state transducer, combining the complete set and the second partial set of combination-tag pair sequences to obtain a combined set of combination-tag pair sequences; and
using the combined set of combination-tag pair sequences to produce the finite state transducer.
-
-
13. The article of claim 12 in which the preexisting finite state transducer has arcs labelled with combination-tag pairs and in which the complete set is obtained by extracting all combination-tag sequences from the preexisting finite state transducer.
-
14. The article of claim 1 in which the finite state data structure is a finite state bimachine that includes first and second finite state transducers;
-
the first finite state transducer including paths that represent pairs of strings of tag combinations, each pair of strings including first and second strings of equal length, each tag combination in the first string having a counterpart tag combination in the second string;
at least one tag combination in the first string of a pair in the first finite state transducer including two or more tags, the counterpart tag combination in the second string being a subset of the tag combination in the first string;
the second finite state transducer including paths that represent pairs of strings of equal length, each pair of strings including a first string that is a string of tag combinations and a second string that is a string of tags, each tag combination in the first string having a counterpart tag in the second string;
at least one tag combination in the first string of a pair of strings in the second finite state transducer including two or more tags, the counterpart tag in the second string being one of the tags in the first string;
the second string of each path in the first finite state transducer being the first string of a path in the second finite state transducer.
-
-
15. The article of claim 1 in which the first and second strings of a pair of strings are of equal length, each tag combination in the first string having a counterpart tag in the second string;
- at least one tag combination in the first string including two or more tags, the counterpart tag in the second string being one of the tags in the tag combination.
-
16. The article of claim 1 in which each tag is a part-of-speech tag identifying a part of speech of the language and each tag combination is a combination of the part-of-speech tags.
-
17. The article of claim 1 in which each first string is represented by only one path, each first string'"'"'s path also representing one highly probable string of tags for the first string.
-
18. A machine for use in tagging text in a language, comprising:
-
a finite state data structure for use in mapping strings of tag combinations to strings of tags for tokens in the language, each tag combination being a combination of tags for tokens in the language;
the finite state data structure including paths that represent pairs of strings with a first string that is a string of tag combinations and a second string that is a string of tags;
the second strings of at least one set of one or more paths with the same first string including only highly probable strings of tags for the first string;
paths that represent pairs of strings being machine accessible using the first string to obtain the second string; and
a processor connected for accessing the finite state data structure;
the processor operating to;
obtain a text that includes a string of tokens in the language;
use the text to obtain a string of tag combinations that is the first string of one of the paths of the finite state data structure, the string of tag combinations being obtained by obtaining, for each token in the string of tokens, a tag combination; and
access the finite state data structure using the string of tag combinations to obtain the second string of the path, the second string being a string of tags for the string of tokens.
-
-
19. A method carried out in a system that includes tagging data for use in tagging text in a language;
- the tagging data including;
a lexical resource that can be accessed with a token to obtain a tag combination that includes tags the token can have in the language;
a finite state data structure for use in mapping strings of tag combinations to strings of tags for tokens in the language, each tag combination being a combination of tags for tokens in the language;
the finite state data structure including paths that represent pairs of strings with a first string that is a string of tag combinations and a second string that is a string of tags;
the second strings of at least one set of one or more paths with the same first string including only highly probable strings of tags for the first string;
paths that represent pairs of strings being machine accessible using the first string to obtain the second string;
the method comprising;
until a sentence'"'"'s end is reached, performing a series of iterations, each handling a token in the text;
each iteration after a first iteration handling a token that follows the token of the preceding iteration;
each iteration using the iteration'"'"'s token to access the lexical resource and obtain the token'"'"'s tag combination;
the series of iterations producing a string of tag combinations that is the first string of one of the paths in the finite state data structure;
using the string of tag combinations to access the finite state data structure to obtain the path'"'"'s second string, the second string being a string of tags for the sentence.
- the tagging data including;
-
20. A method of operating a first machine to transfer data to a second machine over a network, the second machine including a memory and a processor connected for accessing the memory;
- the method comprising;
establishing a connection between the first and second machines over the network; and
operating the first machine to transfer a finite state data structure for use in tagging text in a language to the memory of the second machine;
the finite state data structure mapping strings of tag combinations to strings of tags for tokens in the language, each tag combination being a combination of tags for tokens in the language;
the finite state data structure including paths that represent pairs of strings with a first string that is a string of tag combinations and a second string that is a string of tags;
the second strings of at least one set of one or more paths with the same first string including only highly probable strings of tags for the first string;
paths that represent pairs of strings being machine accessible using the first string to obtain the second string.
- the method comprising;
Specification