Voice search device, voice search method, and non-transitory recording medium

US 9,437,187 B2
Filed: 01/23/2015
Issued: 09/06/2016
Est. Priority Date: 03/05/2014
Status: Active Grant

First Claim

Patent Images

1. A voice search device comprising:

a processor; and

a memory storing instructions that, when executed by the processor, control the processor to;

convert a search string into a phoneme sequence;

acquire durations of respective phonemes included in the phoneme sequence;

derive a spoken time length of voice corresponding to the search string based on the durations;

designate a plurality of designated zones having time lengths in a target voice signal;

acquire, using an acoustic model that does not depend on adjacent phonemes, a first group of likelihoods indicating how likely each zone from among the plurality of designated zones is a zone in which voice corresponding to the search string is spoken;

specify, based on the first group of likelihoods, a plurality of estimated zones from among the plurality of designated zones, wherein each estimated zone is a zone in which the voice corresponding to the search string is estimated to be spoken, and wherein a number of the estimated zones is less than a number of the plurality of designated zones; and

acquire, using an acoustic model that depends on adjacent phonemes, a second group of likelihoods indicating how likely each of the plurality of estimated zones is a zone in which the voice corresponding to the search string is spoken.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A search string acquiring unit acquires a search string. A converting unit converts the search string into a phoneme sequence. A time length deriving unit derives the spoken time length of the voice corresponding to the search string. A zone designating unit designates a likelihood acquisition zone in a target voice signal. A likelihood acquiring device acquires a likelihood indicating how likely the likelihood acquisition interval is an interval in which voice corresponding to the search string is spoken. A repeating unit changes the likelihood acquisition zone designated by the zone designating unit, and repeats the process of the zone designating unit and the likelihood acquiring device. An identifying unit identifies, from the target voice signal, estimated intervals for which the voice corresponding to the search string is estimated to be spoken, on the basis of the likelihoods acquired for each of the likelihood acquisition zones.

Citations

18 Claims

1. A voice search device comprising:
- a processor; and
  
  a memory storing instructions that, when executed by the processor, control the processor to;
  
  convert a search string into a phoneme sequence;
  
  acquire durations of respective phonemes included in the phoneme sequence;
  
  derive a spoken time length of voice corresponding to the search string based on the durations;
  
  designate a plurality of designated zones having time lengths in a target voice signal;
  
  acquire, using an acoustic model that does not depend on adjacent phonemes, a first group of likelihoods indicating how likely each zone from among the plurality of designated zones is a zone in which voice corresponding to the search string is spoken;
  
  specify, based on the first group of likelihoods, a plurality of estimated zones from among the plurality of designated zones, wherein each estimated zone is a zone in which the voice corresponding to the search string is estimated to be spoken, and wherein a number of the estimated zones is less than a number of the plurality of designated zones; and
  
  acquire, using an acoustic model that depends on adjacent phonemes, a second group of likelihoods indicating how likely each of the plurality of estimated zones is a zone in which the voice corresponding to the search string is spoken.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The voice search device according to claim 1, wherein the instructions, when executed by the processor, further control the processor to:
    - acquire, for each frame, an output probability of a feature quantity of the target voice signal being output from each phoneme included in the phoneme sequence; and
      
      wherein the first group of likelihoods are acquired based on the acquired output probability.
  - 3. The voice search device according to claim 2, wherein the instructions, when executed by the processor, further control the processor to:
    - calculate, for each frame, a feature quantity of the target voice signal in the plurality of designated zones;
      
      wherein the output probability is acquired based on the calculated feature quantity.
  - 4. The voice search device according to claim 2, further comprising:
    - an output probability storage in which each phoneme of an acoustic model is stored in association with the output probability of the feature quantity of the target voice signal being output from the each phoneme, for every frame included in the target voice signal;
      
      wherein, after the search string is converted into the phoneme sequence, an output probability stored in association with each phoneme included in the phoneme sequence is acquired from among the output probabilities stored in the output probability storage.
  - 5. The voice search device according to claim 2, wherein the instructions, when executed by the processor, further control the processor to:
    - replace the output probability acquired for each frame with an output probability of maximum value from among a plurality of output probabilities acquired in a plurality of consecutive frames that includes the output probability;
      
      wherein the first group of likelihoods are acquired based on a replaced output probability.
  - 6. The voice search device according to claim 1, wherein the plurality of estimated zones are specified by selecting a designated zone of maximum likelihood one by one from among designated zones included in a zone of a predetermined selection time length, for every predetermined selection time length.
  - 7. The voice search device according to claim 1, wherein the instructions, when executed by the processor, further control the processor to:
    - derive a plurality of mutually different time lengths as spoken time lengths of voice corresponding to one search string,designate the plurality of designated zones having the time lengths in the target voice signal for each of the plurality of mutually different time lengths,acquire the first group of likelihoods for each of the plurality of mutually different time lengths, andspecify, based on first group of likelihoods, the plurality of estimated zones for each of the plurality of time lengths.
  - 8. The voice search device according to claim 1, wherein the instructions, when executed by the processor, further control the processor to:
    - specify a zone corresponding to the search word from among the plurality of estimated zones based on the second group of likelihoods.

9. A voice search method comprising:
- converting a search string into a phoneme sequence;
  
  acquiring durations of respective phonemes included in the phoneme sequence;
  
  deriving a spoken time length of voice corresponding to the search string based on the durations;
  
  designating a plurality of designated zones having time lengths in a target voice signal;
  
  acquiring, using an acoustic model that does not depend on adjacent phonemes, a first group of likelihoods indicating how likely each zone from among the plurality of designated zones is a zone in which voice corresponding to the search string is spoken;
  
  specifying, based on the first group of likelihoods, a plurality of estimated zones from among the plurality of designated zones, wherein each estimated zone is a zone in which the voice corresponding to the search string is estimated to be spoken, and wherein a number of the estimated zones is less than a number of the plurality of designated zones; and
  
  acquiring, using an acoustic model that depends on adjacent phonemes, a second group of likelihoods indicating how likely each of the plurality of estimated zones is a zone in which the voice corresponding to the search string is spoken.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
- - 10. The voice search method according to claim 9, further comprising:
    - acquiring, for each frame, an output probability of a feature quantity of the target voice signal being output from each phoneme included in the phoneme sequence;
      
      wherein the first group of likelihoods are acquired based on the acquired output probability.
  - 11. The voice search method according to claim 10, further comprising:
    - calculating, for each frame, a feature quantity of the target voice signal in the plurality of designated zones;
      
      wherein the output probability is acquired based on the calculated feature quantity.
  - 12. The voice search method according to claim 10, further comprising:
    - storing each phoneme of an acoustic model in association with the output probability of the feature quantity of the target voice signal being output from the each phoneme, for every frame included in the target voice signal;
      
      wherein after the search string is converted into the phoneme sequence, an output probability stored in association with each phoneme included in the phoneme sequence is acquired from among the stored output probabilities.
  - 13. The voice search method according to claim 10, further comprising:
    - replacing the output probability acquired for each frame with an output probability of maximum value from among a plurality of output probabilities acquired in a plurality of consecutive frames that includes the output probability;
      
      wherein the first group of likelihoods are acquired based on a replaced output probability.
  - 14. The voice search method according to Claim 13, wherein the plurality of estimated zones are specified by selecting a designated zone of maximum likelihood one by one from among designated zones included in a zone of a predetermined selection time length, for every predetermined selection time length.
  - 15. The voice search method according to claim 9, further comprising:
    - deriving a plurality of mutually different time lengths as spoken time lengths of voice corresponding to one search string,designating the plurality of designated zones having the time lengths in the target voice signal for each of the plurality of mutually different time lengths,acquiring the first group of likelihoods for each of the plurality of mutually different time lengths, andspecifying, based on the first group of likelihoods, the plurality of estimated zones for each of the plurality of time lengths.
  - 16. The voice search method according to claim 9, further comprising:
    - specifying a zone corresponding to the search word from among the plurality of estimated zones based on the second group of likelihoods.

17. A non-transitory recording medium having a program recorded thereon that is executable to control a computer to:
- convert a search string into a phoneme sequence;
  
  acquire durations of respective phonemes included in the phoneme sequence;
  
  derive a spoken time length of voice corresponding to the search string based on the durations;
  
  designate a plurality of designated zones having time lengths in a target voice signal;
  
  acquire, using an acoustic model that does not depend on adjacent phonemes, a first group of likelihoods indicating how likely each zone from among the plurality of designated zones is a zone in which voice corresponding to the search string is spoken;
  
  specify, based on the first group of likelihoods, a plurality of estimated zones from among the plurality of designated zones, wherein each estimated zone is a zone in which the voice corresponding to the search string is estimated to be spoken, and wherein a number of the estimated zones is less than a number of the plurality of designated zones; and
  
  acquire, using an acoustic model that depends on adjacent phonemes, a second group of likelihoods indicating how likely each of the plurality of estimated zones is a zone in which the voice corresponding to the search string is spoken.
- View Dependent Claims (18)
- - 18. The non-transitory recording medium according to claim 17, wherein the program is executable to further control the processor to:
    - specify a zone corresponding to the search word from among the plurality of estimated zones based on the second group of likelihoods.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Casio Computer Company Limited
Original Assignee
Casio Computer Company Limited
Inventors
Ide, Hiroyasu
Primary Examiner(s)
Pullias, Jesse

Application Number

US14/604,345
Publication Number

US 20150255059A1
Time in Patent Office

592 Days
Field of Search

704231-257, 704270-275
US Class Current

1/1
CPC Class Codes

G06F 16/632   Query formulation

G06F 16/683   using metadata automaticall...

G10L 15/02   Feature extraction for spee...

G10L 15/08   Speech classification or se...

G10L 2015/025   Phonemes, fenemes or fenone...

G10L 2015/081   Search algorithms, e.g. Bau...

G10L 25/54   for retrieval

G10L 25/87   Detection of discrete point...

Voice search device, voice search method, and non-transitory recording medium

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Voice search device, voice search method, and non-transitory recording medium

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links