Method and device for audio recognition

US 9,373,336 B2
Filed: 12/11/2013
Issued: 06/21/2016
Est. Priority Date: 02/04/2013
Status: Active Grant

First Claim

Patent Images

1. A method of performing audio recognition, comprising:

at a device having one or more processors and memory;

collecting a first audio document to be recognized in response to an audio recognition request;

determining first characteristic information of the first audio document by;

calculating a short-term Fourier transform (STFT) for the first audio document, the STFT having M phase channels producing M sub-graphs in a frequency domain, each of the M sub-graphs corresponding to a distinct range of time in the first audio document, wherein M is a positive integer greater than or equal to two;

for each sub-graph of the M sub-graphs in the frequency domain, extracting a respective sequence of one or more peak frequencies at which the sub-graph has a peak;

in accordance with preset pairing criteria, pairing respective peak frequencies in the M sequences of one or more peak frequencies with another, distinct, peak frequency in the M sequences of one or more peak frequencies to produce a sequence of peak frequency pair values for each of the M sub-graphs;

wherein the first characteristic information includes information corresponding to the M sequences of peak frequency pair values; and

in accordance with preset matching criteria, matching the first characteristic information of the first audio document to second characteristic information of a second audio document to obtain a recognition result.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and device for performing audio recognition, including: collecting a first audio document to be recognized; initiating calculation of first characteristic information of the first audio document, including: conducting time-frequency analysis for the first audio document to generate a first preset number of phase channels; and extracting at least one peak value characteristic point from each phase channel of the first preset number of phrase channels, where the at least one peak value characteristic point of each phase channel constitutes the peak value characteristic point sequence of said each phase channel; and obtaining a recognition result for the first audio document, wherein the recognition result is identified based on the first characteristic information, and wherein the first characteristic information is calculated based on the respective peak value characteristic point sequences of the preset number of phase channels.

9 Citations

20 Claims

1. A method of performing audio recognition, comprising:
- at a device having one or more processors and memory;
  
  collecting a first audio document to be recognized in response to an audio recognition request;
  
  determining first characteristic information of the first audio document by;
  
  calculating a short-term Fourier transform (STFT) for the first audio document, the STFT having M phase channels producing M sub-graphs in a frequency domain, each of the M sub-graphs corresponding to a distinct range of time in the first audio document, wherein M is a positive integer greater than or equal to two;
  
  for each sub-graph of the M sub-graphs in the frequency domain, extracting a respective sequence of one or more peak frequencies at which the sub-graph has a peak;
  
  in accordance with preset pairing criteria, pairing respective peak frequencies in the M sequences of one or more peak frequencies with another, distinct, peak frequency in the M sequences of one or more peak frequencies to produce a sequence of peak frequency pair values for each of the M sub-graphs;
  
  wherein the first characteristic information includes information corresponding to the M sequences of peak frequency pair values; and
  
  in accordance with preset matching criteria, matching the first characteristic information of the first audio document to second characteristic information of a second audio document to obtain a recognition result.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, further comprising:
    - sending, to a server, the respective sequences of the one or more peak frequencies for the M sub-graphs, wherein the server completes the determination of the first characteristic information based on the respective sequences of the one or more peak frequencies for the M sub-graphs.
  - 3. The method of claim 2, further including:
    - before sending, to the server, the respective sequences of the one or more peak frequencies for the M sub-graphs;
      
      performing a first type of compression on respective time values corresponding to each of the M sub-graphs and a second type of compression on respective frequency values in the respective sequences of the one or more peak frequencies for each of the M sub-graphs.
  - 4. The method of claim 1, further comprising:
    - establishing a database of a plurality of known audio documents by, for each known audio document of the plurality of known audio documents;
      
      calculating a collection of audio fingerprint sequences comprising one or more audio fingerprints;
      
      calculating a hashcode for the collection of audio fingerprint sequences; and
      
      storing, as respective characteristic information for the known audio document, the collection of audio fingerprints in a hash table according to the hashcode.
  - 5. The method of claim 4, wherein determining the first characteristic information further includes:
    - generating a collection of audio fingerprint sequences for the first audio document by calculating a hashcode for each peak frequency pair value of the M sequences of peak frequency pair values for the first audio document;
      
      wherein the first characteristic information includes the collection of audio fingerprint sequences for the first audio document.
  - 6. The method of claim 5, further comprising:
    - comparing the first characteristic information with the respective characteristic information of one or more of the known audio documents in the database;
      
      weighting the one or more known audio documents according to the respective comparison result; and
      
      in accordance with the weights for the one or more known audio documents, selecting a preset number of the one or more known audio documents to construct a document candidate list.
  - 7. The method of claim 6, further comprising:
    - calculating a time dependency between the first characteristic information and the second characteristic information; and
      
      in accordance with a determination that the time dependency between the second characteristic information and the first characteristic information exceeds a preset threshold value, selecting the second characteristic information as matching the first characteristic information.

8. A system for performing audio recognition, comprising:
- one or more processors; and
  
  memory storing instructions that, when executed by the one or more processors, cause the processors to perform operations comprising;
  
  collecting a first audio document to be recognized in response to an audio recognition request;
  
  determining first characteristic information of the first audio document by;
  
  calculating a short-term Fourier transform (STFT) for the first audio document, the STFT having M phase channels producing M sub-graphs in a frequency domain, each of the M sub-graphs corresponding to a distinct range of time in the first audio document, wherein M is a positive integer greater than or equal to two;
  
  for each sub-graph of the M sub-graphs in the frequency domain, extracting a respective sequence of one or more peak frequencies at which the sub-graph has a peak;
  
  in accordance with preset pairing criteria, pairing respective peak frequencies in the M sequences of one or more peak frequencies with another, distinct, peak frequency in the M sequences of one or more peak frequencies to produce a sequence of peak frequency pair values for each of the M sub-graphs;
  
  wherein the first characteristic information includes information corresponding to the M sequences of peak frequency pair values; and
  
  in accordance with preset matching criteria, matching the first characteristic information of the first audio document to second characteristic information of a second audio document to obtain a recognition result.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The system of claim 8, wherein the operations further comprise:
    - sending, to a server, the respective sequences of the one or more peak frequencies for the M sub-graphs, wherein the server completes the determination of the first characteristic information based on the respective sequences of the one or more peak frequencies for the M sub-graphs.
  - 10. The system of claim 9, wherein the operations further comprise:
    - before sending, to the server, the respective sequences of the one or more peak frequencies for the M sub-graphs;
      
      performing a first type of compression on respective time values corresponding to each of the M sub-graphs and a second type of compression on respective frequency values in the respective sequences of the one or more peak frequencies for each of the M sub-graphs.
  - 11. The system of claim 8, wherein the operations further comprise:
    - establishing a database of a plurality of known audio documents by, for each known audio document of the plurality of known audio documents;
      
      calculating a collection of audio fingerprint sequences comprising one or more audio fingerprints;
      
      calculating a hashcode for the collection of audio fingerprint sequences;
      
      storing, as respective characteristic information for the known audio document, the collection of audio fingerprints in a hash table according to the hashcode.
  - 12. The system of claim 11, wherein determining the first characteristic information further includes:
    - generating a collection of audio fingerprint sequences for the first audio document by calculating a hashcode for each peak frequency pair value of the M sequences of peak frequency pair values for the first audio document;
      
      wherein the first characteristic information includes the collection of audio fingerprint sequences for the first audio document.
  - 13. The system of claim 12, wherein the operations further comprise:
    - comparing the first characteristic information with the respective characteristic information of one or more of the known audio documents in the database;
      
      weighting the one or more known audio documents according to the respective comparison result; and
      
      in accordance with the weights for the one or more known audio documents, selecting a preset number of the one or more known audio documents to construct a document candidate list.
  - 14. The system of claim 13, wherein the operations further comprise:
    - calculating a time dependency between the first characteristic information and the second characteristic information; and
      
      in accordance with a determination that the time dependency between the second characteristic information and the first characteristic information exceeds a preset threshold value, selecting the second characteristic information as matching the first characteristic information.

15. A non-transitory computer-readable medium storing instructions that, when executed by one or more processors, cause the processors to perform operations comprising:
- collecting a first audio document to be recognized in response to an audio recognition request;
  
  determining first characteristic information of the first audio document by;
  
  calculating a short-term Fourier transform (STFT) for the first audio document, the STFT having M phase channels producing M sub-graphs in a frequency domain, each of the M sub-graphs corresponding to a distinct range of time in the first audio document, wherein M is a positive integer greater than or equal to two;
  
  for each sub-graph of the M sub-graphs in the frequency domain, extracting a respective sequence of one or more peak frequencies at which the sub-graph has a peak;
  
  in accordance with preset pairing criteria, pairing respective peak frequencies in the M sequences of one or more peak frequencies with another, distinct, peak frequency in the M sequences of one or more peak frequencies to produce a sequence of peak frequency pair values for each of the M sub-graphs;
  
  wherein the first characteristic information includes information corresponding to the M sequences of peak frequency pair values; and
  
  in accordance with preset matching criteria, matching the first characteristic information of the first audio document to second characteristic information of a second audio document to obtain a recognition result.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The non-transitory computer-readable medium of claim 15, wherein the operations further comprise:
    - sending, to a server, the respective sequences of the one or more peak frequencies for the M sub-graphs, wherein the server completes the determination of the first characteristic information based on the respective sequences of the one or more peak frequencies for the M sub-graphs.
  - 17. The non-transitory computer-readable medium of claim 15, wherein the operations further comprise:
    - establishing a database of a plurality of known audio documents by, for each known audio document of the plurality of known audio documents;
      
      calculating a collection of audio fingerprint sequences comprising one or more audio fingerprints;
      
      calculating a hashcode for the collection of audio fingerprint sequences;
      
      storing, as respective characteristic information for the known audio document, the collection of audio fingerprints in a hash table according to the hashcode.
  - 18. The non-transitory computer-readable medium of claim 17, wherein determining the first characteristic information further includes:
    - generating a collection of audio fingerprint sequences for the first audio document by calculating a hashcode for each peak frequency pair value of the M sequences of peak frequency pair values for the first audio document;
      
      wherein the first characteristic information includes the collection of audio fingerprint sequences for the first audio document.
  - 19. The non-transitory computer-readable medium of claim 18, wherein the operations further comprise:
    - comparing the first characteristic information with the respective characteristic information of one or more of the known audio documents in the database;
      
      weighting the one or more known audio documents according to the respective comparison result; and
      
      in accordance with the weights for the one or more known audio documents, selecting a preset number of the one or more known audio documents to construct a document candidate list.
  - 20. The non-transitory computer-readable medium of claim 19, wherein the operations further comprise:
    - calculating a time dependency between the first characteristic information and the second characteristic information; and
      
      in accordance with a determination that the time dependency between the second characteristic information and the first characteristic information exceeds a preset threshold value, selecting the second characteristic information as matching the first characteristic information.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Tencent Technology Company Limited (Tencent Holdings Limited)
Original Assignee
Tencent Technology Company Limited (Tencent Holdings Limited)
Inventors
Liu, Hailong, Xie, Dadong, Hou, Jie, Xiao, Bin, Liu, Xiao, Chen, Bo
Primary Examiner(s)
Jamal, Alexander

Application Number

US14/103,753
Publication Number

US 20140219461A1
Time in Patent Office

923 Days
Field of Search

381/56, 381/30, 704/231
US Class Current

1/1
CPC Class Codes

G06F 16/683   using metadata automaticall...

G10L 19/018   Audio watermarking, i.e. em...

G10L 19/02   using spectral analysis, e....

G10L 25/18   the extracted parameters be...

G10L 25/54   for retrieval

Method and device for audio recognition

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

9 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Method and device for audio recognition

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

9 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links