Phoneme lattice construction and its application to speech recognition and keyword spotting
First Claim
Patent Images
1. A method for processing a speech signal, comprising:
- using a memory, coupled to a processor, to receive an input speech signal;
using the processor to construct a phoneme lattice for the input speech signal;
determining vertices and arc parameters of the phoneme lattice for the input speech signal;
searching the phoneme lattice to produce a likelihood score for each potential path; and
determining a processing result for the input speech signal based on the likelihood score of each potential path;
wherein constructing the phoneme lattice includes;
segmenting an input speech signal into frames,extracting acoustic features for a frame of the input speech signal,determining K-best initial phoneme paths leading to the frame based on a first score of each potential phoneme path leading to the frame, andcalculating a second score for each of the K-best phoneme paths for the frame;
wherein searching the phoneme lattice comprises;
receiving a phoneme lattice;
traversing the phoneme lattice via potential paths;
computing a score for a traversed path based on at least one of a phoneme confusion matrix and a plurality of language models; and
modifying the score for the traversed path by allowing repetition of phonemes and allowing flexible endpoints for phonemes in a path such that at least one of a first arc that ends at a first frame and a second arc that starts at a third frame is extended so that the first arc and the second arc are directly connected at a second frame.
1 Assignment
0 Petitions
Accused Products
Abstract
An arrangement is provided for using a phoneme lattice for speech recognition and/or keyword spotting. The phoneme lattice may be constructed for an input speech signal and searched to produce a textual representation for the input speech signal and/or to determine if the input speech signal contains targeted keywords. An expectation maximization (EM) trained phoneme confusion matrix may be used when searching the phoneme lattice. The phoneme lattice may be constructed in a client and sent to a server, which may search the phoneme lattice to produce a result.
33 Citations
13 Claims
-
1. A method for processing a speech signal, comprising:
-
using a memory, coupled to a processor, to receive an input speech signal; using the processor to construct a phoneme lattice for the input speech signal; determining vertices and arc parameters of the phoneme lattice for the input speech signal; searching the phoneme lattice to produce a likelihood score for each potential path; and determining a processing result for the input speech signal based on the likelihood score of each potential path; wherein constructing the phoneme lattice includes; segmenting an input speech signal into frames, extracting acoustic features for a frame of the input speech signal, determining K-best initial phoneme paths leading to the frame based on a first score of each potential phoneme path leading to the frame, and calculating a second score for each of the K-best phoneme paths for the frame; wherein searching the phoneme lattice comprises; receiving a phoneme lattice; traversing the phoneme lattice via potential paths; computing a score for a traversed path based on at least one of a phoneme confusion matrix and a plurality of language models; and modifying the score for the traversed path by allowing repetition of phonemes and allowing flexible endpoints for phonemes in a path such that at least one of a first arc that ends at a first frame and a second arc that starts at a third frame is extended so that the first arc and the second arc are directly connected at a second frame. - View Dependent Claims (2, 3, 4)
-
-
5. A method for distributing speech processing, comprising:
-
using a memory, included in a client, to receive an input speech signal; using a processor, included in the client and coupled to the memory, to construct a phoneme lattice for the input speech signal; determining vertices and arc parameters of the phoneme lattice for the input speech signal; transmitting the phoneme lattice from the client to a server; and searching the phoneme lattice to produce a result for the input speech signal for the purpose of at least one of recognizing speech and spotting keywords, in the input speech signal; wherein constructing the phoneme lattice includes; segmenting an input speech signal into frames, extracting acoustic features for a frame of the input speech signal, determining K-best initial phoneme paths leading to the frame based on a first score of each potential phoneme path leading to the frame, and calculating a second score for each of the K-best phoneme paths; wherein searching the phoneme lattice comprises; receiving a phoneme lattice; traversing the phoneme lattice via potential paths; computing a score for a traversed path based on at least one of a phoneme confusion matrix and a plurality of language models; and modifying the score for the traversed path by allowing repetition of phonemes and allowing flexible endpoints for phonemes in a path such that at least one of a first arc that ends at a first frame and a second arc that starts at a third frame is extended such that the first arc and the third arc are directly connected at a second frame. - View Dependent Claims (6, 7)
-
-
8. A speech processing system, comprising:
-
a phoneme lattice constructor to construct a phoneme lattice for an input speech signal; a phoneme lattice search mechanism to search the phoneme lattice for the purpose of at least of recognizing speech and spotting keywords, in the input speech signal; a plurality of models for lattice construction; and a plurality of models for lattice search; wherein the phoneme lattice constructor includes; an acoustic feature extractor to segment the input speech signal into frames and to extract acoustic features for a frame, a phoneme path estimator to determine K-best initial phoneme paths leading to the frame, a global score evaluator to determine M-best refined phoneme paths based on a cluster of K-best paths of at least one consecutive frame, and a lattice parameter identifier to identify lattice vertices and arc parameters based on M-best refined phoneme paths of each frame, wherein at least one of a first arc that ends at a first frame and a second arc that starts at a third frame is extended such that the first arc and the third arc are directly connected at a second frame. - View Dependent Claims (9, 10)
-
-
11. An article comprising:
- a machine accessible medium having content stored thereon, wherein the content is accessed by a processor, the content provides for processing a speech signal by;
receiving an input speech signal; constructing a phoneme lattice for the input speech signal; determining arc parameters of the phoneme lattice; receiving a phoneme lattice; traversing the phoneme lattice via potential paths; computing a score for a traversed path based on at least one of a phoneme confusion matrix and a plurality of language models; and modifying the score for based on flexible endpoints for phonemes in the traversed path; and determining a processing result for the input speech signal based on the modified score.
- a machine accessible medium having content stored thereon, wherein the content is accessed by a processor, the content provides for processing a speech signal by;
-
12. An article comprising:
- a machine accessible medium having content stored thereon, wherein when the content is accessed by a processor, the content provides for distributing speech processing by;
receiving an input speech signal by a client; constructing a phoneme lattice for the input speech signal by the client; determining vertices and arc parameters of the phoneme lattice for the input speech signal; transmitting the phoneme lattice from the client to a server; and searching the phoneme lattice to produce a result for the input speech signal for the purpose of at least one of recognizing speech and spotting keywords, in the input speech signal; wherein constructing the phoneme lattice includes; segmenting an input speech signal into frames, extracting acoustic features for a frame of the input speech signal, determining K-best initial phoneme paths leading to the frame based on a first score of each potential phoneme path leading to the frame, and calculating a second score for each of the K-best phoneme paths; wherein searching the phoneme lattice comprises; receiving a phoneme lattice; traversing the phoneme lattice via potential paths; computing a score for a traversed path based on at least one of a phoneme confusion matrix and a plurality of language models; and modifying the score for the traversed path by allowing flexible endpoints for phonemes in a path such that, based on the flexible endpoints, at least one of a first arc that ends at a first frame and a second arc that starts at a third frame is extended so that the first arc and the second arc are directly connected at a second frame. - View Dependent Claims (13)
- a machine accessible medium having content stored thereon, wherein when the content is accessed by a processor, the content provides for distributing speech processing by;
Specification