Audio signal time offset estimation algorithm and measuring normalizing block algorithms for the perceptually-consistent comparison of speech signals
First Claim
1. A method for measuring differences between two speech signals consistent with human auditory perception and judgement, said method comprising the steps of:
- preparing, using a digital signal processor element programmed with a speech signal preparation algorithm, digital representations of two speech signals for further processing,transforming the digital representations of the two speech signals using a digital signal processor element programmed with a frequency domain transformation algorithm to segment the digital representations of the two speech signals into respective groups of frames, and transforming the respective groups of frames into the frequency domain,selecting frames using a digital signal processor element programmed with a frame selection algorithm to select frequency-domain frames for further processing,measuring perceived loudness of selected frames using a digital signal processor element programmed with a perceived loudness approximation algorithm, andcomparing, using a digital signal processor element programmed with an auditory distance algorithm to compare measured loudness values for at least two selected frequency-domain frames each corresponding to a respective one of the two speech signals and generate a numerical result representing auditory distance;
wherein the auditory distance value is directly proportional to human auditory perception of the difference between the two speech signals,wherein said step of preparing comprises the steps of;
converting a first of the two speech signals from analog to digital form and storing the digital form as a first vector x, andconverting a second of the two speech signals from analog to digital form and storing the digital form as a second vector y,wherein said transforming step comprises the steps of;
generating a plurality of frames for each of the x and y vectors, respectively,transforming each frame to a frequency domain vector, andstoring each frequency domain vector in respective matrices X and Y,wherein said step of selecting frames comprises the steps of;
selecting only frames that meet or exceed predetermined energy thresholds, andwherein said step of selecting only frames that meet or exceed predetermined energy thresholds comprises the steps of;
for matrix X, selecting only frames which meet or exceed an energy threshold xthreshold of substantially 15 dB below an energy level xenergy of a peak frame in matrix X;
##EQU59## for matrix Y, selecting only frames which meet or exceed an energy threshold ythreshold of substantially 35 dB below an energy level yenergy of a peak frame in matrix Y;
##EQU60##
1 Assignment
0 Petitions
Accused Products
Abstract
An audio signal time offset estimation algorithm estimates a time offset between two audio signals. The audio signal time offset estimation algorithm provides a way to measure that delay, even when the audio equipment causes severe distortion, and the signal coming out of the equipment sounds very different from the signal going in. Normalizing block algorithms provide perceptually consistent comparison of speech signals. These algorithms compare the sounds of two speech signals in a way that agrees with human auditory perception. This means, for example, that when these algorithms indicate that two speech signals sound identical, it is very likely that persons listening to those speech signals would describe them as identical. When these algorithms indicate that two speech signals sound similar, it is very likely that persons listening to those speech signals would describe them as similar.
-
Citations
66 Claims
-
1. A method for measuring differences between two speech signals consistent with human auditory perception and judgement, said method comprising the steps of:
-
preparing, using a digital signal processor element programmed with a speech signal preparation algorithm, digital representations of two speech signals for further processing, transforming the digital representations of the two speech signals using a digital signal processor element programmed with a frequency domain transformation algorithm to segment the digital representations of the two speech signals into respective groups of frames, and transforming the respective groups of frames into the frequency domain, selecting frames using a digital signal processor element programmed with a frame selection algorithm to select frequency-domain frames for further processing, measuring perceived loudness of selected frames using a digital signal processor element programmed with a perceived loudness approximation algorithm, and comparing, using a digital signal processor element programmed with an auditory distance algorithm to compare measured loudness values for at least two selected frequency-domain frames each corresponding to a respective one of the two speech signals and generate a numerical result representing auditory distance; wherein the auditory distance value is directly proportional to human auditory perception of the difference between the two speech signals, wherein said step of preparing comprises the steps of; converting a first of the two speech signals from analog to digital form and storing the digital form as a first vector x, and converting a second of the two speech signals from analog to digital form and storing the digital form as a second vector y, wherein said transforming step comprises the steps of; generating a plurality of frames for each of the x and y vectors, respectively, transforming each frame to a frequency domain vector, and storing each frequency domain vector in respective matrices X and Y, wherein said step of selecting frames comprises the steps of; selecting only frames that meet or exceed predetermined energy thresholds, and wherein said step of selecting only frames that meet or exceed predetermined energy thresholds comprises the steps of; for matrix X, selecting only frames which meet or exceed an energy threshold xthreshold of substantially 15 dB below an energy level xenergy of a peak frame in matrix X;
##EQU59## for matrix Y, selecting only frames which meet or exceed an energy threshold ythreshold of substantially 35 dB below an energy level yenergy of a peak frame in matrix Y;
##EQU60##
-
-
2. A method for measuring differences between two speech signals consistent with human auditory perception and Judgment, said method comprising the steps of:
- preparing, using a digital signal processor element programmed with a speech signal preparation algorithm, digital representations of two speech signals for further processing,
transforming the digital representations of the two speech signals using a digital signal processor element programmed with a frequency domain transformation algorithm to segment the digital representations of the two speech signals into respective groups of frame, and transforming the respective groups of frames into the frequency domain, selecting frames using a digital signal processor element programmed with a frame selection algorithm to select frequency-domain frames for further processing, measuring perceived loudness of selected frames using a digital signal processor element programmed with a perceived loudness approximation algorithm, and comparing, using a digital signal processor element programmed with an auditory distance algorithm to compare measured loudness values for at least two selected frequency-domain frames each corresponding to a respective one of the two speech signals and generate a numerical result representing auditory distance; wherein the auditory distance value is directly proportional to human auditory perception of the difference between the two speech signals, wherein said step of preparing comprises the steps of; converting a first of the two speech signals from analog to digital form and storing the digital form as a first vector x; and converting a second of the two speech signals from analog to digital form and storing the digital form as a second vector y, wherein said transforming step comprises the steps of; generating a plurality of frames for each of the x and v vectors, respectively, transforming each frame to a frequency domain vector, and storing each frequency domain vector in respective matrices X and Y, and wherein said comparing step comprises the step of applying a frequency measuring normalizing block to matrices X and Y. - View Dependent Claims (3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31)
- preparing, using a digital signal processor element programmed with a speech signal preparation algorithm, digital representations of two speech signals for further processing,
-
32. An apparatus for measuring differences between two speech signals consistent with human auditory perception and Judgment, said apparatus comprising:
-
first means for preraring digital representations of two speech signals for further processing; second means, coupled to the first means, for transforming the digital representations of the two speech signals to segment the digital representations of the two speech signals into respective groups of frames, and transforming the respective groups of frames into the frequency domain; third means, coupled to the second means, for selecting frequency-domain frames for further processing; and fourth means, coupled to the third means, for measuring perceived loudness of selected frames, and fifth means, coupled to the fourth means, for comparing measured loudness values for at least two selected frequency-domain frames each corresponding to a respective one of the two speech signals and generate a numerical result representing auditory distance; wherein the auditory distance value is directly proportional to human auditory perception of the difference between the two speech signals, wherein said first means comprises; means for converting a first of the two speech signals from analog to digital form and storing the digital form as a first vector x; and means for converting a second of the two speech signals from analog to digital form and storing the digital form as a second vector y, wherein said second means comprises; means for generating a plurality of frames for each of the x and v vectors. respectively; means for transforming each frame to a frequency domain vector; and means for storing each frequency domain vector in respective matrices X and Y. wherein said third means selects only frames that meet or exceed predetermined energy thresholds, and wherein third means selects only frames that meet or exceed predetermined energy thresholds determined as; for matrix X, selecting only frames which meet or exceed an energy threshold xthreshold of substantially 15 dB below an energy level xenergy of a peak frame in matrix X;
##EQU97## for matrix Y, selecting only frames which meet or exceed an energy threshold ythreshold of substantially 35 dB below an energy level yenergy of a peak frame in matrix Y;
##EQU98##
-
-
33. A apparatus of for measuring differences between two speech signals consistent with human auditory perception and Judgment, said apparatus comprising:
- first means for preparing digital representations of two speech signals for further processing;
second means, coupled to the first means, for transforming the digital representations of the two speech signals to segment the digital representations of the two speech signals into respective groups of frames, and transforming the respective croups of frames into the frequency domain; third means, coupled to the second means, for selecting frequency-domain frames for further processing; and fourth means, coupled to the third means, for measuring perceived loudness of selected frames, and fifth means, coupled to the fourth means, for comparing measured loudness values for at least two selected frequency-domain frames each corresponding to a respective one of the two speech signals and generate a numerical result representing auditory distance; wherein the auditory distance value is directly proportional to human auditory perception of the difference between the two speech signals, wherein said first means comprises; means for converting a first of the two speech signals from analog to digital form and storing the digital form as a first vector x; and means for converting a second of the two speech signals from analog to digital form and storing the digital form as a second vector y, wherein said second means comprises; means for generating a plurality of frames for each of the x and y vectors, respectively; means for transforming each frame to a frequency domain vector; and means for storing each frequency domain vector in respective matrices X and Y, and wherein said fifth means comprises means for applying a frequency measuring normalizing block to matrices X and Y. - View Dependent Claims (34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62)
- first means for preparing digital representations of two speech signals for further processing;
-
63. A computer readable memory for directing a computer to measure differences between two speech signals consistent with human auditory perception and judgment, said computer readable memory comprising:
-
a first memory portion containing instructions for preparing digital representations of two speech signals for further processing, a second memory portion containing instructions for transforming the digital representations of the two speech signals to segment the digital representations of the two speech signals into respective croups of frames, and transforming the respective groups of frames into the freauencv domain, a third memory portion containing instructions for selecting frequency-domain frames for further processing, a fourth memory portion containing instructions for measuring perceived loudness of selected frames, and a fifth memory portion containing instructions for comparing measured loudness values for at least two selected freguency-domain frames each corresponding to a respective one of the two speech signals and generate a numerical result representing auditory distance; wherein the auditory distance value is directly proportional to human auditory perception of the difference between the two speech signals, wherein said first memory portion comprises; instructions for converting a first of the two speech signals from analog to digital form and storing the digital form as a first vector x, and instructions for converting a second of the two speech signals from analog to digital form and storing the digital form as a second vector y, wherein said transforming step comprises; instructions for generating a plurality of frames for each of the x and v vectors, respectively, instructions for transforming each frame to a frequency domain vector, and instructions for storing each frequency domain vector in respective matrices X and Y, wherein said instructions for selecting frames comprises; instructions for selecting only frames that meet or exceed predetermined energy thresholds, and wherein said instructions for selecting only frames that meet or exceed predetermined energy thresholds comprises instructions for; for matrix X, selecting only frames which meet or exceed an energy threshold xthreshold of substantially 15 dB below an energy level xenergy of a peak frame in matrix X;
##EQU135## for matrix Y, selecting only frames which meet or exceed an energy threshold ythreshold of substantially 35 dB below an energy level yenergy of a peak frame in matrix Y;
##EQU136##
-
-
64. A computer readable memory directing a computer to measure differences between two speech signals consistent with human auditory perception and iudgment, said computer readable memory comprising:
- a first memory portion containing instructions for preparing digital representations of two speech signals for further processing,
a second memory portion containing instructions for transforming the digital representations of the two speech signals to segment the digital representations of the two speech signals into respective groups of frames, and transforming the respective groups of frames into the frequency domain, a third memory portion containing instructions for selecting frequency-domain frames for further processing, a fourth memory portion containing instructions for measuring perceived loudness of selected frames, and a fifth memory portion containing instructions for comparing measured loudness values for at least two selected frequency-domain frames each corresponding to a respective one of the two speech signals and generate a numerical result representing auditory distance; wherein the auditory distance value is directly proportional to human auditory perception of the difference between the two speech signals, wherein said first memory portion comprises; instructions for converting a first of the two speech signals from analog to digital form and storing the digital form as a first vector x, and instructions for converting a second of the two speech signals from analog to digital form and storing the digital form as a second vector y, wherein said transforming step comprises; instructions for generating a plurality of frames for each of the x and y vectors, respectively, instructions for transforming each frame to a frequency domain vector, and instructions for storing each frequency domain vector in respective matrices X and Y, and wherein said comparing step comprises the instructions for applying a frequency measuring normalizing block to matrices x and Y. - View Dependent Claims (65, 66)
- a first memory portion containing instructions for preparing digital representations of two speech signals for further processing,
Specification