Applying neural network language models to weighted finite state transducers for automatic speech recognition
First Claim
1. A non-transitory computer-readable medium having instructions stored thereon, the instructions, when executed by one or more processors, cause the one or more processors to:
- receive speech input;
traverse, based on the speech input, a sequence of states and arcs of a weighted finite state transducer (WFST), wherein;
the sequence of states and arcs represents one or more history candidate words and a current candidate word; and
a first probability of the candidate word given the one or more history candidate words is determined by traversing the sequence of states and arcs of the WFST;
traverse a negating finite state transducer (FST), wherein traversing the negating FST negates the first probability of the candidate word given the one or more history candidate words;
compose a virtual FST using a neural network language model and based on the sequence of states and arcs of the WFST, wherein one or more virtual states of the virtual FST represent the current candidate word;
traverse the one or more virtual states of the virtual FST, wherein a second probability of the candidate word given the one or more history candidate words is determined by traversing the one or more virtual states of the virtual FST;
determine, based on the second probability of the candidate word given the one or more history candidate words, text corresponding to the speech input;
based on the determined text, perform one or more tasks to obtain a result; and
cause the result to be presented in spoken or visual form.
1 Assignment
0 Petitions
Accused Products
Abstract
Systems and processes for converting speech-to-text are provided. In one example process, speech input can be received. A sequence of states and arcs of a weighted finite state transducer (WFST) can be traversed. A negating finite state transducer (FST) can be traversed. A virtual FST can be composed using a neural network language model and based on the sequence of states and arcs of the WFST. The one or more virtual states of the virtual FST can be traversed to determine a probability of a candidate word given one or more history candidate words. Text corresponding to the speech input can be determined based on the probability of the candidate word given the one or more history candidate words. An output can be provided based on the text corresponding to the speech input.
4389 Citations
40 Claims
-
1. A non-transitory computer-readable medium having instructions stored thereon, the instructions, when executed by one or more processors, cause the one or more processors to:
-
receive speech input; traverse, based on the speech input, a sequence of states and arcs of a weighted finite state transducer (WFST), wherein; the sequence of states and arcs represents one or more history candidate words and a current candidate word; and a first probability of the candidate word given the one or more history candidate words is determined by traversing the sequence of states and arcs of the WFST; traverse a negating finite state transducer (FST), wherein traversing the negating FST negates the first probability of the candidate word given the one or more history candidate words; compose a virtual FST using a neural network language model and based on the sequence of states and arcs of the WFST, wherein one or more virtual states of the virtual FST represent the current candidate word; traverse the one or more virtual states of the virtual FST, wherein a second probability of the candidate word given the one or more history candidate words is determined by traversing the one or more virtual states of the virtual FST; determine, based on the second probability of the candidate word given the one or more history candidate words, text corresponding to the speech input; based on the determined text, perform one or more tasks to obtain a result; and cause the result to be presented in spoken or visual form. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13)
-
-
14. A non-transitory computer-readable medium having instructions stored thereon, the instructions, when executed by one or more processors, cause the one or more processors to:
-
receive speech input; traverse, based on the speech input, a sequence of states and arcs of a weighted finite state transducer (WFST), wherein; the sequence of states and arcs represents one or more history candidate words and a non-terminal class; and a first probability of the non-terminal class given the one or more history candidate words is determined by traversing the sequence of states and arcs of the WFST; traverse a negating finite state transducer (FST), wherein traversing the negating FST negates the first probability of the non-terminal class given the one or more history candidate words; compose a virtual FST using a neural network language model and a user-specific language model FST, and based on the sequence of states and arcs of the WFST, wherein one or more virtual states of the virtual FST represent a current candidate word corresponding to the non-terminal class; traverse the one or more virtual states of the virtual FST, wherein a probability of the current candidate word given the one or more history candidate words and the non-terminal class is determined by traversing the one or more virtual states of the virtual FST; determine, based on the probability of the current candidate word given the one or more history candidate words and the non-terminal class, text corresponding to the speech input; based on the determined text, perform one or more tasks to obtain a result; and cause the result to be presented in spoken or visual form. - View Dependent Claims (15, 16, 17, 18)
-
-
19. A method for performing speech-to-text conversion, the method comprising:
at an electronic device having a processor and memory; receiving speech input; traversing, based on the speech input, a sequence of states and arcs of a weighted finite state transducer (WFST), wherein; the sequence of states and arcs represents one or more history candidate words and a current candidate word; and a first probability of the candidate word given the one or more history candidate words is determined by traversing the sequence of states and arcs of the WFST; traversing a negating finite state transducer (FST), wherein traversing the negating FST negates the first probability of the candidate word given the one or more history candidate words; composing a virtual FST using a neural network language model and based on the sequence of states and arcs of the WFST, wherein one or more virtual states of the virtual FST represent the current candidate word; traversing the one or more virtual states of the virtual FST, wherein a second probability of the candidate word given the one or more history candidate words is determined by traversing the one or more virtual states of the virtual FST; determining, based on the second probability of the candidate word given the one or more history candidate words, text corresponding to the speech input; based on the determined text, performing one or more tasks to obtain a result; and causing the result to be presented in spoken or visual form. - View Dependent Claims (21, 22, 23, 24, 25)
-
20. An electronic device comprising:
-
one or more processors; and memory having instructions stored thereon, the instructions, when executed by the one or more processors, cause the one or more processors to; receive speech input; traverse, based on the speech input, a sequence of states and arcs of a weighted finite state transducer (WFST), wherein; the sequence of states and arcs represents one or more history candidate words and a current candidate word; and a first probability of the candidate word given the one or more history candidate words is determined by traversing the sequence of states and arcs of the WFST; traverse a negating finite state transducer (FST), wherein traversing the negating FST negates the first probability of the candidate word given the one or more history candidate words; compose a virtual FST using a neural network language model and based on the sequence of states and arcs of the WFST, wherein one or more virtual states of the virtual FST represent the current candidate word; traverse the one or more virtual states of the virtual FST, wherein a second probability of the candidate word given the one or more history candidate words is determined by traversing the one or more virtual states of the virtual FST; determine, based on the second probability of the candidate word given the one or more history candidate words, text corresponding to the speech input; based on the determined text, perform one or more tasks to obtain a result; and cause the result to be presented in spoken or visual form. - View Dependent Claims (26, 27, 28, 29, 30)
-
-
31. A method for performing speech-to-text conversion, the method comprising:
at an electronic device having a processor and memory; receiving speech input; traversing, based on the speech input, a sequence of states and arcs of a weighted finite state transducer (WFST), wherein; the sequence of states and arcs represents one or more history candidate words and a non-terminal class; and a first probability of the non-terminal class given the one or more history candidate words is determined by traversing the sequence of states and arcs of the WFST; traversing a negating finite state transducer (FST), wherein traversing the negating FST negates the first probability of the non-terminal class given the one or more history candidate words; composing a virtual FST using a neural network language model and based on the sequence of states and arcs of the WFST, wherein one or more virtual states of the virtual FST represent a current candidate word corresponding to the non-terminal class; traversing the one or more virtual states of the virtual FST, wherein a probability of the current candidate word given the one or more history candidate words and the non-terminal class is determined by traversing the one or more virtual states of the virtual FST; determining, based on the probability of the candidate word given the one or more history candidate words and the non-terminal class, text corresponding to the speech input; based on the determined text, performing one or more tasks to obtain a result; and causing the result to be presented in spoken or visual form. - View Dependent Claims (32, 33, 34, 35)
-
36. An electronic device comprising:
-
one or more processors; and memory having instructions stored thereon, the instructions, when executed by the one or more processors, cause the one or more processors to; receive speech input; traverse, based on the speech input, a sequence of states and arcs of a weighted finite state transducer (WFST), wherein; the sequence of states and arcs represents one or more history candidate words and a non-terminal class; and a first probability of the non-terminal class given the one or more history candidate words is determined by traversing the sequence of states and arcs of the WFST; traverse a negating finite state transducer (FST), wherein traversing the negating FST negates the first probability of the non-terminal class given the one or more history candidate words; compose a virtual FST using a neural network language model and based on the sequence of states and arcs of the WFST, wherein one or more virtual states of the virtual FST represent a current candidate word corresponding to the non-terminal class; traverse the one or more virtual states of the virtual FST, wherein a probability of the current candidate word given the one or more history candidate words and the non-terminal class is determined by traversing the one or more virtual states of the virtual FST; determine, based on the probability of the current candidate word given the one or more history candidate words and the non-terminal class, text corresponding to the speech input; based on the determined text, perform one or more tasks to obtain a result; and cause the result to be presented in spoken or visual form. - View Dependent Claims (37, 38, 39, 40)
-
Specification