Adaptive parser-centric text normalization
First Claim
1. A method comprising:
- receiving at a computing node an input sequence comprising a plurality of tokens;
applying by a processor of the computing node a plurality of domain-specific generators to the input sequence to generate a set of candidate replacements of the tokens of the input sequence;
creating in a memory of the computing node a directed graph comprising a plurality of nodes and a plurality of edges, each node having an associated candidate replacement of the set of candidate replacements, and each edge connecting a first node to a second node, the second node being associated with a consistent follower of the candidate replacement associated with the first node, and creating the plurality of edges comprising determining syntactic consistency between each pair of the set of candidate replacements;
determining by the processor a plurality of paths in the directed graph, each of the plurality of paths comprising at least one of the plurality of edges;
determining by the processor a score for each of the paths;
selecting by the processor a path of the plurality of paths having the highest score;
applying by the processor each candidate replacement of the selected path to the input sequence to generate a normalized output sequence; and
evaluating a correctness of the normalized output sequence by parsing the normalized output sequence to obtain a parse result and comparing the parse result with a gold standard that is obtained by parsing a manually normalized sequence.
1 Assignment
0 Petitions
Accused Products
Abstract
Embodiments of the present invention relate to a customizable text normalization framework providing for domain adaptability through modular replacement generators. In one embodiment, a method of and computer program product for text normalization are provided. An input sequence comprising a plurality of tokens is received. A plurality of generators is applied to the input sequence to generate a set of candidate replacements of the tokens of the sequence. A plurality of subsets of the set of candidate replacements is determined such that the candidate replacements of each subset are syntactically consistent. A probability is determined for each of the subsets. A subset of the plurality of subsets having the highest probability is selected. Each candidate replacement of the selected subset is applied to the input sequence to generate an output sequence. The output sequence is outputted.
13 Citations
20 Claims
-
1. A method comprising:
-
receiving at a computing node an input sequence comprising a plurality of tokens; applying by a processor of the computing node a plurality of domain-specific generators to the input sequence to generate a set of candidate replacements of the tokens of the input sequence; creating in a memory of the computing node a directed graph comprising a plurality of nodes and a plurality of edges, each node having an associated candidate replacement of the set of candidate replacements, and each edge connecting a first node to a second node, the second node being associated with a consistent follower of the candidate replacement associated with the first node, and creating the plurality of edges comprising determining syntactic consistency between each pair of the set of candidate replacements; determining by the processor a plurality of paths in the directed graph, each of the plurality of paths comprising at least one of the plurality of edges; determining by the processor a score for each of the paths; selecting by the processor a path of the plurality of paths having the highest score; applying by the processor each candidate replacement of the selected path to the input sequence to generate a normalized output sequence; and evaluating a correctness of the normalized output sequence by parsing the normalized output sequence to obtain a parse result and comparing the parse result with a gold standard that is obtained by parsing a manually normalized sequence. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
-
-
14. A computer program product for text normalization, the computer program product comprising a computer readable storage medium having program code embodied therewith, the program code executable by a processor to:
-
receive at a computing node an input sequence comprising a plurality of tokens; apply by a processor of the computing node a plurality of generators to the input sequence to generate a set of candidate replacements of the tokens of the input sequence; create in a memory of the computing node a directed graph comprising a plurality of nodes and a plurality of edges, each node having an associated candidate replacement of the set of candidate replacements, and each edge connecting a first node to a second node, the second node being associated with a consistent follower of the candidate replacement associated with the first node, and creating the plurality of edges comprising determining syntactic consistency between each pair of the set of candidate replacements; determine by the processor a plurality of paths in the directed graph, each of the plurality of paths comprising at least one of the plurality of edges; determine by the processor a score for each of the paths; select by the processor a path of the plurality of paths having the highest score; apply by the processor each candidate replacement of the selected path to the input sequence to generate a normalized output sequence; and evaluate a correctness of the normalized output sequence by parsing the normalized output sequence to obtain a parse result and comparing the parse result with a gold standard that is obtained by parsing a manually normalized sequence. - View Dependent Claims (15, 16, 17, 18, 19, 20)
-
Specification