Keyword detection for speech recognition
First Claim
1. A method of recognizing a keyword in a speech, comprising:
- on an electronic device;
receiving a sequence of audio frames comprising a current frame and a subsequent frame that follows the current frame;
determining a candidate keyword for the current frame using a predetermined decoding network that comprises keywords and filler words of multiple languages,associating the audio frame sequence with a confidence score that is partially determined according to the candidate keyword;
identifying a word option for the subsequent frame using the candidate keyword and the predetermined decoding network;
when the candidate keyword and the word option are associated with two distinct types of languages, updating the confidence score of the audio frame sequence based on a penalty factor that is predetermined according to the two distinct types of languages, the word option and an acoustic model of the subsequent frame; and
determining that the audio frame sequence includes both the candidate keyword and the word option by evaluating the updated confidence score according to a keyword determination criterion.
1 Assignment
0 Petitions
Accused Products
Abstract
This application discloses a method implemented of recognizing a keyword in a speech that includes a sequence of audio frames further including a current frame and a subsequent frame. A candidate keyword is determined for the current frame using a decoding network that includes keywords and filler words of multiple languages, and used to determine a confidence score for the audio frame sequence. A word option is also determined for the subsequent frame based on the decoding network, and when the candidate keyword and the word option are associated with two distinct types of languages, the confidence score of the audio frame sequence is updated at least based on a penalty factor associated with the two distinct types of languages. The audio frame sequence is then determined to include both the candidate keyword and the word option by evaluating the updated confidence score according to a keyword determination criterion.
22 Citations
20 Claims
-
1. A method of recognizing a keyword in a speech, comprising:
on an electronic device; receiving a sequence of audio frames comprising a current frame and a subsequent frame that follows the current frame; determining a candidate keyword for the current frame using a predetermined decoding network that comprises keywords and filler words of multiple languages, associating the audio frame sequence with a confidence score that is partially determined according to the candidate keyword; identifying a word option for the subsequent frame using the candidate keyword and the predetermined decoding network; when the candidate keyword and the word option are associated with two distinct types of languages, updating the confidence score of the audio frame sequence based on a penalty factor that is predetermined according to the two distinct types of languages, the word option and an acoustic model of the subsequent frame; and determining that the audio frame sequence includes both the candidate keyword and the word option by evaluating the updated confidence score according to a keyword determination criterion. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
-
15. An electronic device, comprising:
-
one or more processors; and memory having instructions stored thereon, which when executed by the one or more processors cause the processors to perform operations comprising; receiving a sequence of audio frames comprising a current frame and a subsequent frame that follows the current frame; determining a candidate keyword for the current frame using a predetermined decoding network that comprises keywords and filler words of multiple languages, associating the audio frame sequence with a confidence score that is partially determined according to the candidate keyword; identifying a word option for the subsequent frame using the candidate keyword and the predetermined decoding network; when the candidate keyword and the word option are associated with two distinct types of languages, updating the confidence score of the audio frame sequence based on a penalty factor that is predetermined according to the two distinct types of languages, the word option and an acoustic model of the subsequent frame; and determining that the audio frame sequence includes both the candidate keyword and the word option by evaluating the updated confidence score according to a keyword determination criterion. - View Dependent Claims (16, 17, 18)
-
-
19. A non-transitory computer-readable medium, having instructions stored thereon, which when executed by one or more processors cause the processors to perform operations comprising:
-
receiving a sequence of audio frames comprising a current frame and a subsequent frame that follows the current frame; determining a candidate keyword for the current frame using a predetermined decoding network that comprises keywords and filler words of multiple languages, associating the audio frame sequence with a confidence score that is partially determined according to the candidate keyword; identifying a word option for the subsequent frame using the candidate keyword and the predetermined decoding network; when the candidate keyword and the word option are associated with two distinct types of languages, updating the confidence score of the audio frame sequence based on a penalty factor that is predetermined according to the two distinct types of languages, the word option and an acoustic model of the subsequent frame; and determining that the audio frame sequence includes both the candidate keyword and the word option by evaluating the updated confidence score according to a keyword determination criterion. - View Dependent Claims (20)
-
Specification