Method and apparatus for providing improved HMM POS tagger for multi-word entries and factoids
First Claim
1. A method calculating trigram path probabilities for an input string of text, the method comprising:
- tokenizing the input string of text to create plurality of parse leaf units (PLUs), wherein tokenizing the input string of text further comprises;
assigning a token number, consecutively from left to right, to each word and character in the input string of text;
identifying multi-word-entries (MWEs) and factoids in the input string of text; and
assigning parts of speech to each token, MWE and factoid;
constructing a PosColumn for each word, MWE, factoid and character in the input string of text which has a unique first (Ft) and last (Lt) token pair associated therewith, wherein constructing the PosColumn for each word, MWE, factoid and character in the input string of text further comprises;
adding dummy tokens for positions immediately prior to the first word, MWE, factoid or character and for positions immediately after the last word, MWE, factoid or character of the input string of text; and
assigning a Begin part of speech to dummy tokens for positions immediately prior to the first word, MWE, factoid or character of the input string of text, and assigning an End part of speech for positions immediately after the last word, MWE, factoid or character of the text;
constructing all TrigramColumns corresponding to the input string of text, wherein each TrigramColumn defines a corresponding TrigramNode representing a trigram for three PosColumns in the TrigramColumn, each TrigramNode being identifiable by a unique set of three tokens;
determining, for each TrigramColumn, all neighboring TrigramColumns to the immediate left and to the immediate right;
calculating a forward trigram path probability, for each separate TrigramNode of each TrigramColumn, of all forward paths from a TrigramNode in a right neighboring TrigramColumn through the separate TrigramNode;
calculating a backward trigram path probability, for each separate TrigramNode of each TrigramColumn, of all backward paths from a TrigramNode in a left neighboring TrigramColumn through the separate TrigramNode; and
calculating sums of all trigram path probabilities through each PLU as a function of the calculated forward and backward trigram path probabilities.
2 Assignments
0 Petitions
Accused Products
Abstract
A method of calculating trigram path probabilities for an input string of text containing a multi-word-entry (MWE) or a factoid includes tokenizing the input string to create a plurality of parse leaf units (PLUs). A PosColumn is constructed for each word, MWE, factoid and character in the input string of text which has a unique first (Ft) and last (Lt) token pair. TrigramColumns are constructed which define corresponding TrigramNodes each representing a trigram for three PosColumns. Forward and backward trigram path probabilities are calculated for each separate TrigramNode. The sums of all trigram path probabilities through each PLU are then calculated as a function of the forward and backward trigram path probabilities. Systems and computer-readable medium configured to implement the methods are also provided.
-
Citations
22 Claims
-
1. A method calculating trigram path probabilities for an input string of text, the method comprising:
-
tokenizing the input string of text to create plurality of parse leaf units (PLUs), wherein tokenizing the input string of text further comprises; assigning a token number, consecutively from left to right, to each word and character in the input string of text; identifying multi-word-entries (MWEs) and factoids in the input string of text; and assigning parts of speech to each token, MWE and factoid; constructing a PosColumn for each word, MWE, factoid and character in the input string of text which has a unique first (Ft) and last (Lt) token pair associated therewith, wherein constructing the PosColumn for each word, MWE, factoid and character in the input string of text further comprises; adding dummy tokens for positions immediately prior to the first word, MWE, factoid or character and for positions immediately after the last word, MWE, factoid or character of the input string of text; and assigning a Begin part of speech to dummy tokens for positions immediately prior to the first word, MWE, factoid or character of the input string of text, and assigning an End part of speech for positions immediately after the last word, MWE, factoid or character of the text; constructing all TrigramColumns corresponding to the input string of text, wherein each TrigramColumn defines a corresponding TrigramNode representing a trigram for three PosColumns in the TrigramColumn, each TrigramNode being identifiable by a unique set of three tokens; determining, for each TrigramColumn, all neighboring TrigramColumns to the immediate left and to the immediate right; calculating a forward trigram path probability, for each separate TrigramNode of each TrigramColumn, of all forward paths from a TrigramNode in a right neighboring TrigramColumn through the separate TrigramNode; calculating a backward trigram path probability, for each separate TrigramNode of each TrigramColumn, of all backward paths from a TrigramNode in a left neighboring TrigramColumn through the separate TrigramNode; and calculating sums of all trigram path probabilities through each PLU as a function of the calculated forward and backward trigram path probabilities. - View Dependent Claims (2, 3, 9, 10, 11)
-
-
4. A method of calculating trigram path probabilities for an input string of text, the method comprising:
-
tokenizing the input string of text to create a plurality of parse leaf units (PLUs); constructing a PosColumn for each word, multi-word-entry (MWE), factoid and character in the input string of text which has a unique first (Ft) and last (Lt) token pair associated therewith; constructing all TrigramColumns corresponding to the input string of text, wherein each TrigramColumn defines a corresponding TrigramNode representing a trigram for three PosColumns in the TrigramColumn, each TrigramNode being identifiable by a unique set of three tokens, wherein each TrigramNode contains a probability of the corresponding trigram, the forward probability of all forward paths through the TrigramNode, the backward probability of all backward paths through the TrigramNode, and the sum of all trigram paths probabilities of all paths through the TrigramNode; creating a Trigram Graph, the Trigram Graph including with each of the constructed TrigramColumns an array of associated TrigramNodes; determining, for each TrigramColumn, all neighboring TrigramColumns to the immediate left and to the immediate right; calculating a forward trigram path probability, for each separate TrigramNode of each TrigramColumn, of all forward paths from a TrigramNode in a right neighboring TrigramColumn through the separate TrigramNode; calculating a backward trigram path probability, for each separate TrigramNode of each TrigramColumn, of all backward paths from a TrigramNode in a left neighboring TrigramColumn through the separate TrigramNode; and calculating sums of all trigram path probabilities through each PLU as a function of the calculated forward and backward trigram path probabilities. - View Dependent Claims (5, 6, 12, 13, 14)
-
-
7. A method of calculating trigram path probabilities for an input string of text, the method comprising:
-
tokenizing the input string of text to create a plurality of parse leaf units (PLUs); constructing a PosColumn for each word, multi-word-entry (MWE), factoid and character in the input string of text which has a unique first (Ft) and last (Lt) token pair associated therewith; constructing all TrigramColumns corresponding to the input string of text, wherein each TrigramColumn defines a corresponding TrigramNode representing a trigram for three PosColumns in the TrigramColumn, each TrigramNode being identifiable by a unique set of three tokens; determining, for each TrigramColumn, all neighboring TrigramColumns to the immediate left and to the immediate right, wherein determining, for each TrigramColumn, all TrigramColumns to the immediate left further comprises determining that a first TrigramColumn TCI having a set of PosColumns (PCA, PCB, PCC) is to the immediate left of a second TrigramColumn TCJ having a set of PosColumns (PCX, PCY, PCZ) if PCB is equal to PCX and if PCC is equal to PCY; calculating a forward trigram path probability, for each separate TrigramNode of each TrigramColumn, of all forward paths from a TrigramNode in a right neighboring TrigramColumn through the separate TrigramNode; calculating a backward trigram path probability, for each separate TrigramNode of each TrigramColumn, of all backward paths from a TrigramNode in a left neighboring TrigramColumn through the separate TrigramNode; and calculating sums of all trigram path probabilities through each PLU as a function of the calculated forward and backward trigram path probabilities. - View Dependent Claims (8, 15, 16)
-
-
17. A trigram path probability calculating system for calculating trigram path probabilities for an input string of text, the system comprising:
-
a tokenizer configured to tokenize the input string of text to create a plurality of parse leaf units (PLUs), wherein the tokenizer is configured to tokenize the input string of text by assigning a token number, consecutively from left to right, to each word and character in the input string of text, by further identifying multi-word entries (MWEs) and factoids in the input string of text, and by further assigning parts of speech to each token, MWE and factoid; a PosColumn generator configured to construct a PosColumnfor each word, MWE factoid and character in the input string of text which has a unique first (Ft) and last (Lt) token pair associated therewith, wherein the PosColumn generator is configured to construct the PosColumn for each word, MWE, factoid and character in the input string of text by adding dummy tokens for positions immediately prior to the first word, MWE, factoid or character and for positions immediately after the last word, MWE, factoid or character of the input string of text, and by assigning a Begin part of speech to dummy tokens for positions immediately prior to the first word, MWE, factoid or character of the input string of text, and assigning an End part of speech for positions immediately after the last word, MWE, factoid or character of the input string of text; a TrigramColumn generator configured to construct TrigramColumns corresponding to the input string of text, wherein each TrigramColumn defines a corresponding TrigramNode representing a trigram for three PosColumns in the TrigramColumn, wherein the TrigramColumn generator is configured to construct the TrigramColumns such that each TrigramNode is identifiable by a unique set of three tokens, the TrigramColumn generator also configured to determine for each TrigramColumn all neighboring TrigramColumns to the immediate left and to the immediate right; and a trigram path probability calculator configured to calculate a forward trigram path probability and a backward trigram path probability for each separate TrigramNode of each TrigramColumn, wherein the trigram path probability calculator is configured to calculate the forward trigram path probability for each separate TrigramNode of each TrigramColumn by calculating the forward trigram path probability, for each separate TrigramNode of each TrigramColumn, of all forward paths from a TrigramNode in a right neighboring TrigramColumn through the separate TrigramNode, and wherein the trigram path probability calculator is configured to calculate the backward trigram path probability for each separate TrigramNode of each TrigramColumn by calculating the backward path probability, for each separate TrigramNode of each TrigramColumn, of all backward paths from a TrigramNode in a left neighboring TrigramColumn through the separate TrigramNode, the trigram path probability calculator further configured to calculate sums of all trigram path probabilities through each PLU as a function of the calculated forward and backward trigram path probabilities. - View Dependent Claims (18, 19)
-
-
20. A trigram path probability calculating system for calculating trigram path probabilities for an input string of text, the system comprising:
-
a tokenizer configured to tokenize the input string of text to create a plurality of parse leaf units (PLUs); a PosColumn generator configured to construct a PosColumn for each word, multi-word entry (MWE), factoid and character in the input string of text which has a unique first (Ft) and last (Lt) token pair associated therewith; a TrigramColumn generator configured to construct TrigramColumns corresponding to the input string of text, wherein each TrigramColumn defines a corresponding TrigramNode representing a trigram for three PosColumns in the TrigramColumn; a trigram path probability calculator configured to calculate a forward trigram path probability and a backward trigram path probability for each separate TrigramNode of each TrigramColumn, the trigram path probability calculator further configured to calculate sums of all trigram path probabilities through each PLU as a function of the calculated forward and backward trigram path probabilities; a Trigram Graph generator configured to construct a Trigram Graph, the Trigram Graph including with each of the constructed TrigramColumns an array of associated TrigramNodes; and wherein each TrigramNode contains a probability of the corresponding trigram, the forward probability of all forward paths through the TrigramNode, the backward probability of all backward paths through the TrigramNode, and the sum of all trigram path probabilities of all paths through the TrigramNode. - View Dependent Claims (21, 22)
-
Specification