Keyword detection for speech recognition

US 9,230,541 B2
Filed: 12/11/2014
Issued: 01/05/2016
Est. Priority Date: 08/15/2013
Status: Active Grant

First Claim

Patent Images

1. A method of recognizing a keyword in a speech, comprising:

on an electronic device;

receiving a sequence of audio frames comprising a current frame and a subsequent frame that follows the current frame;

determining a candidate keyword for the current frame using a predetermined decoding network that comprises keywords and filler words of multiple languages,associating the audio frame sequence with a confidence score that is partially determined according to the candidate keyword;

identifying a word option for the subsequent frame using the candidate keyword and the predetermined decoding network;

when the candidate keyword and the word option are associated with two distinct types of languages, updating the confidence score of the audio frame sequence based on a penalty factor that is predetermined according to the two distinct types of languages, the word option and an acoustic model of the subsequent frame; and

determining that the audio frame sequence includes both the candidate keyword and the word option by evaluating the updated confidence score according to a keyword determination criterion.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

This application discloses a method implemented of recognizing a keyword in a speech that includes a sequence of audio frames further including a current frame and a subsequent frame. A candidate keyword is determined for the current frame using a decoding network that includes keywords and filler words of multiple languages, and used to determine a confidence score for the audio frame sequence. A word option is also determined for the subsequent frame based on the decoding network, and when the candidate keyword and the word option are associated with two distinct types of languages, the confidence score of the audio frame sequence is updated at least based on a penalty factor associated with the two distinct types of languages. The audio frame sequence is then determined to include both the candidate keyword and the word option by evaluating the updated confidence score according to a keyword determination criterion.

22 Citations

View as Search Results

20 Claims

1. A method of recognizing a keyword in a speech, comprising:
- on an electronic device;
  
  receiving a sequence of audio frames comprising a current frame and a subsequent frame that follows the current frame;
  
  determining a candidate keyword for the current frame using a predetermined decoding network that comprises keywords and filler words of multiple languages,associating the audio frame sequence with a confidence score that is partially determined according to the candidate keyword;
  
  identifying a word option for the subsequent frame using the candidate keyword and the predetermined decoding network;
  
  when the candidate keyword and the word option are associated with two distinct types of languages, updating the confidence score of the audio frame sequence based on a penalty factor that is predetermined according to the two distinct types of languages, the word option and an acoustic model of the subsequent frame; and
  
  determining that the audio frame sequence includes both the candidate keyword and the word option by evaluating the updated confidence score according to a keyword determination criterion.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
- - 2. The method of claim 1, wherein a plurality of candidate keywords, including the candidate keyword, are determined for the current frame of the audio frame sequence, and each candidate keyword is associated with at least one word option, and wherein a subset of the plurality of candidate keyword are determined to be included in the audio frame sequence together with their respective at least one word options based on the keyword determination criterion.
  - 3. The method of claim 2, wherein the subsequent frame is the last frame of the audio frame sequence, and in accordance with the keyword determination criterion, a candidate keyword that is associated with a preferred confidence score among is selected from the plurality of candidate keywords as a keyword associated with the current frame of the audio frame sequence.
  - 4. The method of claim 2, wherein in accordance with the keyword determination criterion, each of the plurality of candidate keywords is associated with a respective confidence score of the audio frame sequence, and the respective confidence score is larger than a keyword threshold value.
  - 5. The method of claim 2, wherein after the subset of the candidate keywords are determined to be included in the audio frame sequence together with their respective at least one word options, the corresponding confidence score is updated and is determined to exceed a keyword threshold value in accordance with the keyword determination criterion.
  - 6. The method of claim 1, wherein in accordance with the keyword determination criterion, the confidence score of the audio frame sequence is larger than a keyword threshold.
  - 7. The method of claim 1, wherein the predetermined decoding network is associated with two or more languages of English, Chinese, Japanese, Russian, French, German and the like, and includes a subset of keywords and a subset of filler words for each of the two or more languages.
  - 8. The method of claim 1, wherein each keyword of the predetermined decoding network comprises one or more triphones.
  - 9. The method of claim 1, wherein in accordance with a decoding structure of the predetermined decoding network, each keyword in the predetermined decoding network is associated with at least one word that is used together with the respective keyword in real speech and included in the decoding network.
  - 10. The method of claim 9, wherein in accordance with a decoding structure of the predetermined decoding network, each keyword in a subset of keywords and the respective at least one word that is used together with the respective keyword originate from two distinct languages.
  - 11. The method of claim 1, further comprising:
    - establishing a penalty factor table including a plurality of penalty factors each associated with two different languages, wherein the penalty factor used for updating the confidence score of the audio frame sequence is identified by looking up the penalty factor table based on the two distinct language types of the candidate keyword and the word option.
  - 12. The method of claim 1, further comprising:
    - establishing the predetermined decoding network, wherein the keywords and filler words of multiple languages are grouped according to their language types, further comprising;
      
      creating a start node and an end node;
      
      creating a plurality of language nodes each representing a type of language;
      
      linking each language node with the start node;
      
      associating each language node with a subset of respective keywords and a subset of respective filler words both originating from the corresponding language;
      
      for each keyword;
      
      converting the respective keyword to a sequence of tripohones,creating a respective triphone node for each triphone of the sequence of triphones of the respective keyword,linking the triphone nodes of the sequence of triphones together to form a sequence of triphone nodes including a head triphone node and a tail triphone node, andlinking a respective head triphone node to a corresponding language node and a respective tail triphone node to the end node;
      
      for each filler word, creating a respective filler node and coupling the respective filler node between the corresponding language node and the end node; and
      
      linking the start node and the end node.
  - 13. The method of claim 12, wherein the candidate keyword and the word option are determined to be associated with two distinct types of languages, when one of the plurality of language nodes is linked between the candidate keyword and the word option on the predetermined decoding network.
  - 14. The method of claim 12, wherein in accordance with a decoding structure of the predetermined decoding network, each keyword in the decoding network is linked on the predetermined decoding network to at least one word that is used together with the respective keyword in real speech.

15. An electronic device, comprising:
- one or more processors; and
  
  memory having instructions stored thereon, which when executed by the one or more processors cause the processors to perform operations comprising;
  
  receiving a sequence of audio frames comprising a current frame and a subsequent frame that follows the current frame;
  
  determining a candidate keyword for the current frame using a predetermined decoding network that comprises keywords and filler words of multiple languages,associating the audio frame sequence with a confidence score that is partially determined according to the candidate keyword;
  
  identifying a word option for the subsequent frame using the candidate keyword and the predetermined decoding network;
  
  when the candidate keyword and the word option are associated with two distinct types of languages, updating the confidence score of the audio frame sequence based on a penalty factor that is predetermined according to the two distinct types of languages, the word option and an acoustic model of the subsequent frame; and
  
  determining that the audio frame sequence includes both the candidate keyword and the word option by evaluating the updated confidence score according to a keyword determination criterion.
- View Dependent Claims (16, 17, 18)
- - 16. The electronic device of claim 15, wherein in accordance with the keyword determination criterion, the confidence score of the audio frame sequence is larger than a keyword threshold.
  - 17. The electronic device of claim 15, wherein the operations performed by the processors further comprise:
    - establishing a penalty factor table including a plurality of penalty factors each associated with two different languages, wherein the penalty factor used for updating the confidence score of the audio frame sequence is identified by looking up the penalty factor table based on the two distinct language types of the candidate keyword and the word option.
  - 18. The electronic device of claim 15, wherein the predetermined decoding network is associated with two or more languages of English, Chinese, Japanese, Russian, French, German and the like, and includes a subset of keywords and a subset of filler words for each of the two or more languages.

19. A non-transitory computer-readable medium, having instructions stored thereon, which when executed by one or more processors cause the processors to perform operations comprising:
- receiving a sequence of audio frames comprising a current frame and a subsequent frame that follows the current frame;
  
  determining a candidate keyword for the current frame using a predetermined decoding network that comprises keywords and filler words of multiple languages,associating the audio frame sequence with a confidence score that is partially determined according to the candidate keyword;
  
  identifying a word option for the subsequent frame using the candidate keyword and the predetermined decoding network;
  
  when the candidate keyword and the word option are associated with two distinct types of languages, updating the confidence score of the audio frame sequence based on a penalty factor that is predetermined according to the two distinct types of languages, the word option and an acoustic model of the subsequent frame; and
  
  determining that the audio frame sequence includes both the candidate keyword and the word option by evaluating the updated confidence score according to a keyword determination criterion.
- View Dependent Claims (20)
- - 20. The non-transitory computer-readable medium of claim 19, wherein the operations performed by the processors further comprise:
    - establishing a penalty factor table including a plurality of penalty factors each associated with two different languages, wherein the penalty factor used for updating the confidence score of the audio frame sequence is identified by looking up the penalty factor table based on the two distinct language types of the candidate keyword and the word option.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Tencent Technology Company Limited (Tencent Holdings Limited)
Original Assignee
Tencent Technology Company Limited (Tencent Holdings Limited)
Inventors
Lu, Li, Ma, Jianxiong, Kong, Linghui, Rao, Feng, Yue, Shuai, Zhang, Xiang, Liu, Haibo, Wang, Eryu, Chen, Bo, Ll, Lu
Primary Examiner(s)
He, Jialong

Application Number

US14/567,969
Publication Number

US 20150095032A1
Time in Patent Office

390 Days
Field of Search

704/231, 704/242, 704/251
US Class Current

1/1
CPC Class Codes

G10L 15/08   Speech classification or se...

G10L 15/083   Recognition networks G10L15...

G10L 2015/088   Word spotting

Keyword detection for speech recognition

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

22 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Keyword detection for speech recognition

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

22 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links