Method and device for audio recognition
First Claim
1. A method of performing audio recognition, comprising:
- at a device having one or more processors and memory;
collecting a first audio document to be recognized in response to an audio recognition request;
determining first characteristic information of the first audio document by;
calculating a short-term Fourier transform (STFT) for the first audio document, the STFT having M phase channels producing M sub-graphs in a frequency domain, each of the M sub-graphs corresponding to a distinct range of time in the first audio document, wherein M is a positive integer greater than or equal to two;
for each sub-graph of the M sub-graphs in the frequency domain, extracting a respective sequence of one or more peak frequencies at which the sub-graph has a peak;
in accordance with preset pairing criteria, pairing respective peak frequencies in the M sequences of one or more peak frequencies with another, distinct, peak frequency in the M sequences of one or more peak frequencies to produce a sequence of peak frequency pair values for each of the M sub-graphs;
wherein the first characteristic information includes information corresponding to the M sequences of peak frequency pair values; and
in accordance with preset matching criteria, matching the first characteristic information of the first audio document to second characteristic information of a second audio document to obtain a recognition result.
1 Assignment
0 Petitions
Accused Products
Abstract
A method and device for performing audio recognition, including: collecting a first audio document to be recognized; initiating calculation of first characteristic information of the first audio document, including: conducting time-frequency analysis for the first audio document to generate a first preset number of phase channels; and extracting at least one peak value characteristic point from each phase channel of the first preset number of phrase channels, where the at least one peak value characteristic point of each phase channel constitutes the peak value characteristic point sequence of said each phase channel; and obtaining a recognition result for the first audio document, wherein the recognition result is identified based on the first characteristic information, and wherein the first characteristic information is calculated based on the respective peak value characteristic point sequences of the preset number of phase channels.
9 Citations
20 Claims
-
1. A method of performing audio recognition, comprising:
at a device having one or more processors and memory; collecting a first audio document to be recognized in response to an audio recognition request; determining first characteristic information of the first audio document by; calculating a short-term Fourier transform (STFT) for the first audio document, the STFT having M phase channels producing M sub-graphs in a frequency domain, each of the M sub-graphs corresponding to a distinct range of time in the first audio document, wherein M is a positive integer greater than or equal to two; for each sub-graph of the M sub-graphs in the frequency domain, extracting a respective sequence of one or more peak frequencies at which the sub-graph has a peak; in accordance with preset pairing criteria, pairing respective peak frequencies in the M sequences of one or more peak frequencies with another, distinct, peak frequency in the M sequences of one or more peak frequencies to produce a sequence of peak frequency pair values for each of the M sub-graphs; wherein the first characteristic information includes information corresponding to the M sequences of peak frequency pair values; and in accordance with preset matching criteria, matching the first characteristic information of the first audio document to second characteristic information of a second audio document to obtain a recognition result. - View Dependent Claims (2, 3, 4, 5, 6, 7)
-
8. A system for performing audio recognition, comprising:
-
one or more processors; and memory storing instructions that, when executed by the one or more processors, cause the processors to perform operations comprising; collecting a first audio document to be recognized in response to an audio recognition request; determining first characteristic information of the first audio document by; calculating a short-term Fourier transform (STFT) for the first audio document, the STFT having M phase channels producing M sub-graphs in a frequency domain, each of the M sub-graphs corresponding to a distinct range of time in the first audio document, wherein M is a positive integer greater than or equal to two; for each sub-graph of the M sub-graphs in the frequency domain, extracting a respective sequence of one or more peak frequencies at which the sub-graph has a peak; in accordance with preset pairing criteria, pairing respective peak frequencies in the M sequences of one or more peak frequencies with another, distinct, peak frequency in the M sequences of one or more peak frequencies to produce a sequence of peak frequency pair values for each of the M sub-graphs; wherein the first characteristic information includes information corresponding to the M sequences of peak frequency pair values; and in accordance with preset matching criteria, matching the first characteristic information of the first audio document to second characteristic information of a second audio document to obtain a recognition result. - View Dependent Claims (9, 10, 11, 12, 13, 14)
-
-
15. A non-transitory computer-readable medium storing instructions that, when executed by one or more processors, cause the processors to perform operations comprising:
-
collecting a first audio document to be recognized in response to an audio recognition request; determining first characteristic information of the first audio document by; calculating a short-term Fourier transform (STFT) for the first audio document, the STFT having M phase channels producing M sub-graphs in a frequency domain, each of the M sub-graphs corresponding to a distinct range of time in the first audio document, wherein M is a positive integer greater than or equal to two; for each sub-graph of the M sub-graphs in the frequency domain, extracting a respective sequence of one or more peak frequencies at which the sub-graph has a peak; in accordance with preset pairing criteria, pairing respective peak frequencies in the M sequences of one or more peak frequencies with another, distinct, peak frequency in the M sequences of one or more peak frequencies to produce a sequence of peak frequency pair values for each of the M sub-graphs; wherein the first characteristic information includes information corresponding to the M sequences of peak frequency pair values; and in accordance with preset matching criteria, matching the first characteristic information of the first audio document to second characteristic information of a second audio document to obtain a recognition result. - View Dependent Claims (16, 17, 18, 19, 20)
-
Specification