Voice search device, voice search method, and non-transitory recording medium
First Claim
1. A voice search device comprising:
- a processor; and
a memory storing instructions that, when executed by the processor, control the processor to;
convert a search string into a phoneme sequence;
acquire durations of respective phonemes included in the phoneme sequence;
derive a spoken time length of voice corresponding to the search string based on the durations;
designate a plurality of designated zones having time lengths in a target voice signal;
acquire, using an acoustic model that does not depend on adjacent phonemes, a first group of likelihoods indicating how likely each zone from among the plurality of designated zones is a zone in which voice corresponding to the search string is spoken;
specify, based on the first group of likelihoods, a plurality of estimated zones from among the plurality of designated zones, wherein each estimated zone is a zone in which the voice corresponding to the search string is estimated to be spoken, and wherein a number of the estimated zones is less than a number of the plurality of designated zones; and
acquire, using an acoustic model that depends on adjacent phonemes, a second group of likelihoods indicating how likely each of the plurality of estimated zones is a zone in which the voice corresponding to the search string is spoken.
1 Assignment
0 Petitions
Accused Products
Abstract
A search string acquiring unit acquires a search string. A converting unit converts the search string into a phoneme sequence. A time length deriving unit derives the spoken time length of the voice corresponding to the search string. A zone designating unit designates a likelihood acquisition zone in a target voice signal. A likelihood acquiring device acquires a likelihood indicating how likely the likelihood acquisition interval is an interval in which voice corresponding to the search string is spoken. A repeating unit changes the likelihood acquisition zone designated by the zone designating unit, and repeats the process of the zone designating unit and the likelihood acquiring device. An identifying unit identifies, from the target voice signal, estimated intervals for which the voice corresponding to the search string is estimated to be spoken, on the basis of the likelihoods acquired for each of the likelihood acquisition zones.
-
Citations
18 Claims
-
1. A voice search device comprising:
-
a processor; and a memory storing instructions that, when executed by the processor, control the processor to; convert a search string into a phoneme sequence; acquire durations of respective phonemes included in the phoneme sequence; derive a spoken time length of voice corresponding to the search string based on the durations; designate a plurality of designated zones having time lengths in a target voice signal; acquire, using an acoustic model that does not depend on adjacent phonemes, a first group of likelihoods indicating how likely each zone from among the plurality of designated zones is a zone in which voice corresponding to the search string is spoken; specify, based on the first group of likelihoods, a plurality of estimated zones from among the plurality of designated zones, wherein each estimated zone is a zone in which the voice corresponding to the search string is estimated to be spoken, and wherein a number of the estimated zones is less than a number of the plurality of designated zones; and acquire, using an acoustic model that depends on adjacent phonemes, a second group of likelihoods indicating how likely each of the plurality of estimated zones is a zone in which the voice corresponding to the search string is spoken. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
-
-
9. A voice search method comprising:
-
converting a search string into a phoneme sequence; acquiring durations of respective phonemes included in the phoneme sequence; deriving a spoken time length of voice corresponding to the search string based on the durations; designating a plurality of designated zones having time lengths in a target voice signal; acquiring, using an acoustic model that does not depend on adjacent phonemes, a first group of likelihoods indicating how likely each zone from among the plurality of designated zones is a zone in which voice corresponding to the search string is spoken; specifying, based on the first group of likelihoods, a plurality of estimated zones from among the plurality of designated zones, wherein each estimated zone is a zone in which the voice corresponding to the search string is estimated to be spoken, and wherein a number of the estimated zones is less than a number of the plurality of designated zones; and acquiring, using an acoustic model that depends on adjacent phonemes, a second group of likelihoods indicating how likely each of the plurality of estimated zones is a zone in which the voice corresponding to the search string is spoken. - View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
-
-
17. A non-transitory recording medium having a program recorded thereon that is executable to control a computer to:
-
convert a search string into a phoneme sequence; acquire durations of respective phonemes included in the phoneme sequence; derive a spoken time length of voice corresponding to the search string based on the durations; designate a plurality of designated zones having time lengths in a target voice signal; acquire, using an acoustic model that does not depend on adjacent phonemes, a first group of likelihoods indicating how likely each zone from among the plurality of designated zones is a zone in which voice corresponding to the search string is spoken; specify, based on the first group of likelihoods, a plurality of estimated zones from among the plurality of designated zones, wherein each estimated zone is a zone in which the voice corresponding to the search string is estimated to be spoken, and wherein a number of the estimated zones is less than a number of the plurality of designated zones; and acquire, using an acoustic model that depends on adjacent phonemes, a second group of likelihoods indicating how likely each of the plurality of estimated zones is a zone in which the voice corresponding to the search string is spoken. - View Dependent Claims (18)
-
Specification