Method and system of automatic speech recognition with dynamic vocabularies

US 9,740,678 B2
Filed: 06/25/2015
Issued: 08/22/2017
Est. Priority Date: 06/25/2015
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method of automatic speech recognition, comprising:

obtaining, via at least one acoustic signal receiving unit, audio data including human speech;

generating, via a decoder, a static vocabulary weighted finite state transducer (WFST) having nodes connected by arcs to propagate at least one token through the static vocabulary WFST and at least one dynamic vocabulary trigger marker at at least one of the arcs;

propagating, via the decoder, a token through at least one dynamic vocabulary WFST upon the at least one token reaching the trigger marker;

propagating, via the decoder, a token through at least one grammar WFST having at least one dynamic vocabulary class marker that indicates a type of dynamic vocabulary and is associated with the dynamic vocabulary of at least one of the dynamic vocabulary WFSTs with a propagating token;

providing, via the decoder, a hypothetical word or phrase based at least in part on the obtained human speech and depending, at least in part, on the WFSTs and comprising terms in the static vocabulary, dynamic vocabulary, or both vocabularies;

determining, via an interpretation engine, user intent based at least in part on output from the decoder based at least in part on the hypothetical word or phrase; and

initiating, via the interpretation engine, a response or action based at least in part on the determined user intent, the initiated response or action being implemented via speech output from a speaker component, via visual output from display component, and/or via other action from one or more end devices.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system, article, and method of automatic speech recognition with dynamic vocabularies is described herein.

10 Citations

View as Search Results

25 Claims

1. A computer-implemented method of automatic speech recognition, comprising:
- obtaining, via at least one acoustic signal receiving unit, audio data including human speech;
  
  generating, via a decoder, a static vocabulary weighted finite state transducer (WFST) having nodes connected by arcs to propagate at least one token through the static vocabulary WFST and at least one dynamic vocabulary trigger marker at at least one of the arcs;
  
  propagating, via the decoder, a token through at least one dynamic vocabulary WFST upon the at least one token reaching the trigger marker;
  
  propagating, via the decoder, a token through at least one grammar WFST having at least one dynamic vocabulary class marker that indicates a type of dynamic vocabulary and is associated with the dynamic vocabulary of at least one of the dynamic vocabulary WFSTs with a propagating token;
  
  providing, via the decoder, a hypothetical word or phrase based at least in part on the obtained human speech and depending, at least in part, on the WFSTs and comprising terms in the static vocabulary, dynamic vocabulary, or both vocabularies;
  
  determining, via an interpretation engine, user intent based at least in part on output from the decoder based at least in part on the hypothetical word or phrase; and
  
  initiating, via the interpretation engine, a response or action based at least in part on the determined user intent, the initiated response or action being implemented via speech output from a speaker component, via visual output from display component, and/or via other action from one or more end devices.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 2. The method of claim 1 comprising generating multiple dynamic vocabulary WFSTs each for a different class of dynamic vocabulary and when the at least one token reaches the dynamic vocabulary trigger marker.
  - 3. The method of claim 2 wherein a dynamic vocabulary WFST is formed for each dynamic vocabulary class that is available.
  - 4. The method of claim 1 wherein a dynamic vocabulary is a list of data provided for at least one of:
    - name, email address, phone number, or a characteristic associated with contact information,audio or video description information,dining or entertainment data,geographical location information, andsearch engine search terms information.
  - 5. The method of claim 1 comprising matching at least one of the dynamic vocabulary WFSTs with a dynamic vocabulary class marker on the grammar WFST when a token is created in either an initial node or a final node of the dynamic vocabulary WFST.
  - 6. The method of claim 5 wherein a grammar WFST comprises a plurality of symbols each being a dynamic vocabulary class marker each formed for a single dynamic vocabulary WFST.
  - 7. The method of claim 1 wherein dynamic vocabulary outputs from the dynamic WFST are not placed on the grammar WFST.
  - 8. The method of claim 1 comprising forming a hypothetical word or phrase from the output of the static vocabulary WFST and words or phoneme from output of the at least one dynamic vocabulary WFST propagating a token to an end state.
  - 9. The method of claim 8 wherein a new token is created at the initial state of the lexicon-based WFST upon the dynamic vocabulary WFST forming the end state.
  - 10. The method of claim 8 wherein the static vocabulary weighted finite state or the dynamic vocabulary weighted finite state transducer or both is a context-sensitive lexicon (CL)-type WFST and comprising cross-word arcs at at least one boundary between static vocabulary terms and dynamic vocabulary terms and providing alternative arcs of multiple possible phonetic or word context inputs or outputs.
  - 11. The method of claim 10 wherein an alternative is provided for each possible phonetic or word contexts.
  - 12. The method of claim 1 wherein the dynamic vocabulary trigger marker of an arc on the static vocabulary WFST comprises a non-acoustic input label of the arc.
  - 13. The method of claim 1 wherein the grammar WFST has a non-acoustic symbol for each dynamic vocabulary class marker being used that may both the input and output symbol of an arc on the grammar WFST, and wherein the symbols do not appear on the static vocabulary WFST.
  - 14. The method of claim 1 wherein at least one dynamic vocabulary first class is related to a sub-class of a second class, and wherein decoding a word in the first class causes use of a dynamic vocabulary WFST for the sub-class rather than the entire second class.
  - 15. The method of claim 1 comprising generating multiple dynamic vocabulary WFSTs each for a different class of dynamic vocabulary and when the at least one token reaches the dynamic vocabulary trigger marker;
    - wherein a dynamic vocabulary WFST is formed for each dynamic vocabulary class that is available;
      
      wherein a dynamic vocabulary is a list of data provided for at least one of;
      
      name, email address, phone number, or a characteristic associated with contact information,audio or video description information,dining or entertainment data,geographical location information, andsearch engine search terms information;
      
      the method comprising matching at least one of the dynamic vocabulary WFSTs with a dynamic vocabulary class marker on the grammar WFST when a token is created in either an initial node or a final node of the dynamic vocabulary WFST, wherein a grammar WFST comprises a plurality of symbols each being a dynamic vocabulary class marker each formed for a single dynamic vocabulary WFST;
      
      wherein dynamic vocabulary outputs from the dynamic WFST are not placed on the grammar WFST;
      
      the method comprising forming a hypothetical word or phrase from the output of the static vocabulary WFST and words or phoneme from output of the at least one dynamic vocabulary WFST propagating a token to an end state, wherein a new token is created at the initial state of the lexicon-based WFST upon the dynamic vocabulary WFST forming the end state, and wherein the static vocabulary weighted finite state or the dynamic vocabulary weighted finite state transducer or both is a context-sensitive lexicon (CL)-type WFST and comprising cross-word arcs at at least one boundary between static vocabulary terms and dynamic vocabulary terms and providing alternative arcs of multiple possible phonetic or word context inputs or outputs, wherein an alternative is provided for each possible phonetic or word contexts;
      
      wherein the dynamic vocabulary trigger marker of an arc on the static vocabulary WFST comprises a non-acoustic input label of the arc;
      
      wherein the grammar WFST has a non-acoustic symbol for each dynamic vocabulary class marker being used that may both the input and output symbol of an arc on the grammar WFST, and wherein the symbols do not appear on the static vocabulary WFST; and
      
      wherein at least one dynamic vocabulary first class is related to a sub-class of a second class, and wherein decoding a word in the first class causes use of a dynamic vocabulary WFST for the sub-class rather than the entire second class.

16. A computer-implemented system of automatic speech recognition comprising:
- at least one acoustic signal receiving unit to obtain audio data including human speech;
  
  at least one processor communicatively connected to the acoustic signal receiving unit;
  
  at least one memory communicatively coupled to the at least one processor; and
  
  a WFST decoder operated by the at least one processor and to;
  
  generate a static vocabulary weighted finite state transducer (WFST) having nodes connected by arcs to propagate at least one token through the static vocabulary WFST and at least one dynamic vocabulary trigger marker at at least one of the arcs;
  
  propagate a token through at least one dynamic vocabulary WFST upon the at least one token reaching the trigger marker;
  
  propagate a token through at least one grammar WFST having at least one dynamic vocabulary class marker that indicates a type of dynamic vocabulary and is associated with the dynamic vocabulary of at least one of the dynamic vocabulary WFSTs with a propagating token;
  
  provide a hypothetical word or phrase based at least in part on the obtained human speech and depending, at least in part, on the WFSTs and comprising terms in the static vocabulary, dynamic vocabulary, or both vocabularies; and
  
  an interpretation engine to;
  
  determine user intent based at least in part on output from the decoder based at least in part on the hypothetical word or phrase; and
  
  initiate a response or action based at least in part on the determined user intent, the initiated response or action being implemented via speech output from a speaker component, via visual output from display component, and/or via other action from one or more end devices.
- View Dependent Claims (17, 18, 19, 20, 21, 22, 23)
- - 17. The system of claim 16, wherein the WFST decoder is to generate multiple dynamic vocabulary WFSTs each for a different class of dynamic vocabulary and when the at least one token reaches the dynamic vocabulary trigger marker.
  - 18. The system of claim 17 wherein a dynamic vocabulary WFST is formed for each dynamic vocabulary class that is available.
  - 19. The system of claim 16, wherein the WFST decoder is to match at least one of the dynamic vocabulary WFSTs with a dynamic vocabulary class marker on the grammar WFST when a token is created in either an initial node or a final node of the dynamic vocabulary WFST.
  - 20. The system of claim 19 wherein a grammar WFST comprises a plurality of symbols each being a dynamic vocabulary class marker each formed for a single dynamic vocabulary WFST.
  - 21. The system of claim 16 wherein dynamic vocabulary outputs from the dynamic WFST are not placed on the grammar WFST.
  - 22. The system of claim 16 wherein the WFST decoder is to forming a hypothetical word or phrase from the output of the static vocabulary WFST and words or phoneme from the output of the at least one dynamic vocabulary WFST propagating a token to an end state.
  - 23. The system of claim 18 wherein the WFST decoder is to generate multiple dynamic vocabulary WFSTs each for a different class of dynamic vocabulary and when the at least one token reaches the dynamic vocabulary trigger marker;
    - wherein a dynamic vocabulary WFST is formed for each dynamic vocabulary class that is available;
      
      wherein a dynamic vocabulary is a list of data provided for at least one of;
      
      name, email address, phone number, or a characteristic associated with contact information,audio or video description information,dining or entertainment data,geographical location information, andsearch engine search terms information;
      
      wherein the WFST decoder is to match at least one of the dynamic vocabulary WFSTs with a dynamic vocabulary class marker on the grammar WFST when a token is created in either an initial node or a final node of the dynamic vocabulary WFST, wherein a grammar WFST comprises a plurality of symbols each being a dynamic vocabulary class marker each formed for a single dynamic vocabulary WFST;
      
      wherein dynamic vocabulary outputs from the dynamic WFST are not placed on the grammar WFST;
      
      the WFST decoder to form a hypothetical word or phrase from the output of the static vocabulary WFST and words or phoneme from output of the at least one dynamic vocabulary WFST propagating a token to an end state, wherein a new token is created at the initial state of the lexicon-based WFST upon the dynamic vocabulary WFST forming the end state, and wherein the static vocabulary weighted finite state or the dynamic vocabulary weighted finite state transducer or both is a context-sensitive lexicon (CL)-type WFST and comprising cross-word arcs at at least one boundary between static vocabulary terms and dynamic vocabulary terms and providing alternative arcs of multiple possible phonetic or word context inputs or outputs, wherein an alternative is provided for each possible phonetic or word contexts;
      
      wherein the dynamic vocabulary trigger marker of an arc on the static vocabulary WFST comprises a non-acoustic input label of the arc;
      
      wherein the grammar WFST has a non-acoustic symbol for each dynamic vocabulary class marker being used that may both the input and output symbol of an arc on the grammar WFST, and wherein the symbols do not appear on the static vocabulary WFST; and
      
      wherein at least one dynamic vocabulary first class is related to a sub-class of a second class, and wherein decoding a word in the first class causes use of a dynamic vocabulary WFST for the sub-class rather than the entire second class.

24. At least one non-transitory computer readable medium comprising a plurality of instructions that in response to being executed on a computing device, causes the computing device perform automatic speech recognition, comprising:
- obtain audio data including human speech;
  
  generate, via a decoder, a static vocabulary weighted finite state transducer (WFST) having nodes connected by arcs to propagate at least one token through the static vocabulary WFST and at least one dynamic vocabulary trigger marker at at least one of the arcs;
  
  propagate, via the decoder, a token through at least one dynamic vocabulary WFST upon the at least one token reaching the trigger marker;
  
  propagate, via the decoder, a token through at least one grammar WFST having at least one dynamic vocabulary class marker that indicates a type of dynamic vocabulary and is associated with the dynamic vocabulary of at least one of the dynamic vocabulary WFSTs with a propagating token; and
  
  provide, via the decoder, a hypothetical word or phrase based at least in part on the obtained human speech and depending, at least in part, on the WFSTs and comprising terms in the static vocabulary, dynamic vocabulary, or both vocabularies;
  
  determine, via an interpretation engine, user intent based at least in part on output from the decoder based at least in part on the hypothetical word or phrase; and
  
  initiate, via the interpretation engine, a response or action based at least in part on the determined user intent, the initiated response or action being implemented via speech output from a speaker component, via visual output from display component, and/or via other action from one or more end devices.
- View Dependent Claims (25)
- - 25. The medium of claim 24 wherein the computing device is caused to generate multiple dynamic vocabulary WFSTs each for a different class of dynamic vocabulary and when the at least one token reaches the dynamic vocabulary trigger marker;
    - wherein a dynamic vocabulary WFST is formed for each dynamic vocabulary class that is available;
      
      wherein a dynamic vocabulary is a list of data provided for at least one of;
      
      name, email address, phone number, or a characteristic associated with contact information,audio or video description information,dining or entertainment data,geographical location information, andsearch engine search terms information;
      
      wherein the computing device is caused to match at least one of the dynamic vocabulary WFSTs with a dynamic vocabulary class marker on the grammar WFST when a token is created in either an initial node or a final node of the dynamic vocabulary WFST, wherein a grammar WFST comprises a plurality of symbols each being a dynamic vocabulary class marker each formed for a single dynamic vocabulary WFST;
      
      wherein dynamic vocabulary outputs from the dynamic WFST are not placed on the grammar WFST;
      
      the commuting device being caused to form a hypothetical word or phrase from the output of the static vocabulary WFST and words or phoneme from output of the at least one dynamic vocabulary WFST propagating a token to an end state, wherein a new token is created at the initial state of the lexicon-based WFST upon the dynamic vocabulary WFST forming the end state, and wherein the static vocabulary weighted finite state or the dynamic vocabulary weighted finite state transducer or both is a context-sensitive lexicon (CL)-type WFST and comprising cross-word arcs at at least one boundary between static vocabulary terms and dynamic vocabulary terms and providing alternative arcs of multiple possible phonetic or word context inputs or outputs, wherein an alternative is provided for each possible phonetic or word contexts;
      
      wherein the dynamic vocabulary trigger marker of an arc on the static vocabulary WFST comprises a non-acoustic input label of the arc;
      
      wherein the grammar WFST has a non-acoustic symbol for each dynamic vocabulary class marker being used that may both the input and output symbol of an arc on the grammar WFST, and wherein the symbols do not appear on the static vocabulary WFST; and
      
      wherein at least one dynamic vocabulary first class is related to a sub-class of a second class, and wherein decoding a word in the first class causes use of a dynamic vocabulary WFST for the sub-class rather than the entire second class.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Intel Corporation
Original Assignee
Intel Corporation
Inventors
Hofer, Joachim, Stemmer, Georg, Bauer, Josef
Primary Examiner(s)
Cordero, Marivelisse Santiago
Assistant Examiner(s)
BRINICH, STEPHEN M

Application Number

US14/750,185
Publication Number

US 20160379629A1
Time in Patent Office

789 Days
Field of Search

704 1- 10, 704243-251, 704254-256, 704270, 704230-231, 704275
US Class Current
CPC Class Codes

G06F 40/253   Grammatical analysis; Style...

G06F 40/284   Lexical analysis, e.g. toke...

G06F 40/289   Phrasal analysis, e.g. fini...

G10L 15/083   Recognition networks G10L15...

G10L 15/19   Grammatical context, e.g. d...

Method and system of automatic speech recognition with dynamic vocabularies

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

10 Citations

25 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system of automatic speech recognition with dynamic vocabularies

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

10 Citations

25 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links