Method and system for controlling home assistant devices

US 10,796,702 B2
Filed: 12/21/2018
Issued: 10/06/2020
Est. Priority Date: 12/31/2017
Status: Active Grant

First Claim

Patent Images

1. A method of controlling a home assistant device, comprising:

at a computing system having one or more processors and memory;

receiving an audio input;

performing speaker recognition on the audio input;

in accordance with a determination from performing speaker recognition that the audio input includes a voice input from a first user that is authorized to control the home assistant device;

performing, using speech recognition, speech-to-text conversion on the audio input to obtain a textual string;

searching for a predefined trigger word for activating the home assistant device in the textual string;

selecting, from a plurality of task domains of the home assistant device, one or more first task domains that the first user is authorized to control, to perform intent deduction on the textual string; and

forgoing using one or more second task domains among the plurality of task domains that the first user is not authorized to control to process the textual string; and

in accordance with a determination from performing speaker recognition that the audio input includes a voice input from the home assistant device;

forgoing performance of speech-to-text conversion on the audio input; and

forgoing search for the predefined trigger word, so that the home assistant device avoids being triggered by the home assistant device'"'"'s own speech or a speech output of a neighboring home assistant device,wherein the speaker recognition uses less resources than the speech recognition.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

System and method for controlling a home assistant device include: receiving an audio input; performing speaker recognition on the audio input; in accordance with a determination that the audio input includes a voice input from a first user that is authorized to control the home assistant device: performing speech-to-text conversion on the audio input to obtain a textual string; and searching for a predefined trigger word for activating the home assistant device in the textual string; and in accordance with a determination that the audio input includes a voice input from the home assistant device: forgoing performance of speech-to-text conversion on the audio input; and forgoing search for the predefined trigger word.

17 Citations

View as Search Results

17 Claims

1. A method of controlling a home assistant device, comprising:
- at a computing system having one or more processors and memory;
  
  receiving an audio input;
  
  performing speaker recognition on the audio input;
  
  in accordance with a determination from performing speaker recognition that the audio input includes a voice input from a first user that is authorized to control the home assistant device;
  
  performing, using speech recognition, speech-to-text conversion on the audio input to obtain a textual string;
  
  searching for a predefined trigger word for activating the home assistant device in the textual string;
  
  selecting, from a plurality of task domains of the home assistant device, one or more first task domains that the first user is authorized to control, to perform intent deduction on the textual string; and
  
  forgoing using one or more second task domains among the plurality of task domains that the first user is not authorized to control to process the textual string; and
  
  in accordance with a determination from performing speaker recognition that the audio input includes a voice input from the home assistant device;
  
  forgoing performance of speech-to-text conversion on the audio input; and
  
  forgoing search for the predefined trigger word, so that the home assistant device avoids being triggered by the home assistant device'"'"'s own speech or a speech output of a neighboring home assistant device,wherein the speaker recognition uses less resources than the speech recognition.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method of claim 1, wherein searching for the predefined trigger word in the textual string includes:
    - selecting a respective trigger word that corresponds to the first user from a plurality of preset trigger words that correspond different users among a plurality of users that include the first user; and
      
      using the respective trigger word that corresponds to the first user as the predefined trigger word that is to be searched.
  - 3. The method of claim 1, including:
    - obtaining a default speech-to-text model corresponding to the home assistant device; and
      
      in accordance with a determination that a plurality of recorded speech samples provided by the first user are available, adjusting the default speech-to-text model in accordance with the plurality of recorded speech samples provided by the first user to generate a first user-specific speech-to-text model for the first user, wherein performing speech-to-text conversion on the audio input to obtain the textual string includes performing speech-to-text conversion on the audio input using the first user-specific speech-to-text model for the first user.
  - 4. The method of claim 3, including:
    - in accordance with a determination that a plurality of recorded speech samples provided by the first user are not available, performing the speech-to-text conversion on the audio input using the default speech-to-text model.
  - 5. The method of claim 4, including:
    - in accordance with a determination that a plurality of recorded speech samples provided by the first user are available, setting a first confidence threshold for recognizing the trigger word in the audio input when the first user-specific speech-to-text model is used to perform the speech-to-text conversion on the audio input; and
      
      in accordance with a determination that a plurality of recorded speech samples provided by the first user are not available, setting a second confidence threshold for recognizing the trigger word in the audio input when the default speech-to-text model is used to perform the speech-to-text conversion on the audio input.
  - 6. The method of claim 5, wherein the first confidence threshold that is used for the first user-specific speech-to-text model is higher than the second confidence threshold that is used for the default speech-to-text model.

7. A system for controlling a home assistant device, comprising:
- one or more processors; and
  
  memory storing instructions, the instructions, when executed by the processors, cause the processors to perform operations comprising;
  
  receiving an audio input;
  
  performing speaker recognition on the audio input;
  
  in accordance with a determination from performing speaker recognition that the audio input includes a voice input from a first user that is authorized to control the home assistant device;
  
  performing, using speech recognition, speech-to-text conversion on the audio input to obtain a textual string;
  
  searching for a predefined trigger word for activating the home assistant device in the textual string;
  
  selecting, from a plurality of task domains of the home assistant device, one or more first task domains that the first user is authorized to control, to perform intent deduction on the textual string; and
  
  forgoing using one or more second task domains among the plurality of task domains that the first user is not authorized to control to process the textual string; and
  
  in accordance with a determination from performing speaker recognition that the audio input includes a voice input from the home assistant device;
  
  forgoing performance of speech-to-text conversion on the audio input; and
  
  forgoing search for the predefined trigger word, so that the home assistant device avoids being triggered by the home assistant device'"'"'s own speech or a speech output of a neighboring home assistant device,wherein the speaker recognition uses less resources than the speech recognition.
- View Dependent Claims (8, 9, 10, 11, 12)
- - 8. The system of claim 7, wherein searching for the predefined trigger word in the textual string includes:
    - selecting a respective trigger word that corresponds to the first user from a plurality of preset trigger words that correspond different users among a plurality of users that include the first user; and
      
      using the respective trigger word that corresponds to the first user as the predefined trigger word that is to be searched.
  - 9. The system of claim 7, wherein the operations include:
    - obtaining a default speech-to-text model corresponding to the home assistant device; and
      
      in accordance with a determination that a plurality of recorded speech samples provided by the first user are available, adjusting the default speech-to-text model in accordance with the plurality of recorded speech samples provided by the first user to generate a first user-specific speech-to-text model for the first user, wherein performing speech-to-text conversion on the audio input to obtain the textual string includes performing speech-to-text conversion on the audio input using the first user-specific speech-to-text model for the first user.
  - 10. The system of claim 9, wherein the operations include:
    - in accordance with a determination that a plurality of recorded speech samples provided by the first user are not available, performing the speech-to-text conversion on the audio input using the default speech-to-text model.
  - 11. The system of claim 10, wherein the operations include:
    - in accordance with a determination that a plurality of recorded speech samples provided by the first user are available, setting a first confidence threshold for recognizing the trigger word in the audio input when the first user-specific speech-to-text model is used to perform the speech-to-text conversion on the audio input; and
      
      in accordance with a determination that a plurality of recorded speech samples provided by the first user are not available, setting a second confidence threshold for recognizing the trigger word in the audio input when the default speech-to-text model is used to perform the speech-to-text conversion on the audio input.
  - 12. The system of claim 11, wherein the first confidence threshold that is used for the first user-specific speech-to-text model is higher than the second confidence threshold that is used for the default speech-to-text model.

13. A non-transitory computer-readable storage medium storing instructions, the instructions, when executed by one or more processors, cause the processors to perform operations comprising:
- receiving an audio input;
  
  performing speaker recognition on the audio input;
  
  in accordance with a determination from performing speaker recognition that the audio input includes a voice input from a first user that is authorized to control a home assistant device;
  
  performing, using speech recognition, speech-to-text conversion on the audio input to obtain a textual string;
  
  searching for a predefined trigger word for activating the home assistant device in the textual string;
  
  selecting, from a plurality of task domains of the home assistant device, one or more first task domains that the first user is authorized to control, to perform intent deduction on the textual string; and
  
  forgoing using one or more second task domains among the plurality of task domains that the first user is not authorized to control to process the textual string; and
  
  in accordance with a determination from performing speaker recognition that the audio input includes a voice input from the home assistant device;
  
  forgoing performance of speech-to-text conversion on the audio input; and
  
  forgoing search for the predefined trigger word, so that the home assistant device avoids being triggered by the home assistant device'"'"'s own speech or a speech output of a neighboring home assistant device,wherein the speaker recognition uses less resources than the speech recognition.
- View Dependent Claims (14, 15, 16, 17)
- - 14. The computer-readable storage medium of claim 13, wherein searching for the predefined trigger word in the textual string includes:
    - selecting a respective trigger word that corresponds to the first user from a plurality of preset trigger words that correspond different users among a plurality of users that include the first user; and
      
      using the respective trigger word that corresponds to the first user as the predefined trigger word that is to be searched.
  - 15. The computer-readable storage medium of claim 13, wherein the operations include:
    - obtaining a default speech-to-text model corresponding to the home assistant device; and
      
      in accordance with a determination that a plurality of recorded speech samples provided by the first user are available, adjusting the default speech-to-text model in accordance with the plurality of recorded speech samples provided by the first user to generate a first user-specific speech-to-text model for the first user, wherein performing speech-to-text conversion on the audio input to obtain the textual string includes performing speech-to-text conversion on the audio input using the first user-specific speech-to-text model for the first user.
  - 16. The computer-readable storage medium of claim 15, wherein the operations include:
    - in accordance with a determination that a plurality of recorded speech samples provided by the first user are not available, performing the speech-to-text conversion on the audio input using the default speech-to-text model.
  - 17. The computer-readable storage medium of claim 16, wherein the operations include:
    - in accordance with a determination that a plurality of recorded speech samples provided by the first user are available, setting a first confidence threshold for recognizing the trigger word in the audio input when the first user-specific speech-to-text model is used to perform the speech-to-text conversion on the audio input; and
      
      in accordance with a determination that a plurality of recorded speech samples provided by the first user are not available, setting a second confidence threshold for recognizing the trigger word in the audio input when the default speech-to-text model is used to perform the speech-to-text conversion on the audio input.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Midea Group Co., Ltd
Original Assignee
Midea Group Co., Ltd
Inventors
Li, Baojie, Pan, Yingbin, Gu, Haisong
Primary Examiner(s)
Azad, Abul K

Application Number

US16/230,835
Publication Number

US 20190206412A1
Time in Patent Office

655 Days
Field of Search

None
US Class Current
CPC Class Codes

G10L 15/07   to the speaker

G10L 15/08   Speech classification or se...

G10L 15/22   Procedures used during a sp...

G10L 17/00   Speaker identification or v...

G10L 17/04   Training, enrolment or mode...

G10L 17/22   Interactive procedures; Man...

G10L 2015/088   Word spotting

G10L 2015/223   Execution procedure of a sp...

G10L 2015/227   of the speaker; Human-fact...

H04L 12/2816   Controlling appliance servi...

Method and system for controlling home assistant devices

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

17 Citations

17 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for controlling home assistant devices

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

17 Citations

17 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links