System and Method for Audio Scene Understanding of Physical Object Sound Sources

US 20170105080A1
Filed: 10/07/2015
Published: 04/13/2017
Est. Priority Date: 10/07/2015
Status: Active Grant

First Claim

Patent Images

1. A method of training an audio monitoring system comprising:

receiving with a processor in the audio monitoring system first registration information for a first object in a first scene around a sound sensor in the audio monitoring system;

training with the processor a first classifier for a first predetermined action of the first object in the first scene, the first predetermined action generating sound detected by the sound sensor;

receiving with the processor second registration information for a second object in the first scene around the sound sensor;

training with the processor a second classifier for a second predetermined action of the second object in the first scene, the second predetermined action generating sound detected by the sound sensor;

receiving with the processor object relationship data corresponding to a relationship between the first object and the second object in the first scene;

generating with the processor a specific scene grammar including a first sound event formed from with reference to a predetermined general scene grammar stored in a memory, the first registration information, the second registration information, and the object relationship data; and

storing with the processor the specific scene grammar in the memory in association with the first classifier and the second classifier for identification of a subsequent occurrence of the first sound event including the first predetermined action of the first object and the second predetermined action of the second object.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method of operating an audio monitoring system includes generating with a sound sensor audio data corresponding to a sound event generated by an object in a scene around the sound sensor, identifying with a processor a type and action of the object in the scene that generated the sound with reference to the audio data, generating with the processor a timestamp corresponding to a time of the detection of the sound event, and updating a scene state model corresponding to sound events generated by a plurality of objects in the scene with reference to the identified type of object, action taken by the object, and the timestamp. The method further includes identifying a sound event in the scene with reference to the scene state model and a predetermined scene grammar stored in a memory, and generating with the processor an output corresponding to the sound event.

19 Citations

20 Claims

1. A method of training an audio monitoring system comprising:
- receiving with a processor in the audio monitoring system first registration information for a first object in a first scene around a sound sensor in the audio monitoring system;
  
  training with the processor a first classifier for a first predetermined action of the first object in the first scene, the first predetermined action generating sound detected by the sound sensor;
  
  receiving with the processor second registration information for a second object in the first scene around the sound sensor;
  
  training with the processor a second classifier for a second predetermined action of the second object in the first scene, the second predetermined action generating sound detected by the sound sensor;
  
  receiving with the processor object relationship data corresponding to a relationship between the first object and the second object in the first scene;
  
  generating with the processor a specific scene grammar including a first sound event formed from with reference to a predetermined general scene grammar stored in a memory, the first registration information, the second registration information, and the object relationship data; and
  
  storing with the processor the specific scene grammar in the memory in association with the first classifier and the second classifier for identification of a subsequent occurrence of the first sound event including the first predetermined action of the first object and the second predetermined action of the second object.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, the training of the first classifier further comprising:
    - generating with a sound sensor in the audio monitoring system first audio data corresponding to a first predetermined action of the first object;
      
      extracting with the processor a first plurality of features from the first audio data;
      
      generating with the processor a first classifier corresponding to the predetermined sound event from the first object with reference to the first plurality of features; and
      
      storing with the processor the first classifier in the memory on association with the first predetermined action of the first object and the specific scene grammar.
  - 3. The method of claim 2, the extracting of the first plurality of features further comprising:
    - extracting with the processor at least one of a mel spectrogram, mel-frequency cepstrum (MFCC), delta, and chroma feature from the audio data.
  - 4. The method of claim 1 further comprising:
    - receiving with the processor a relationship identifier indicating presence of the first object and the second object within the first scene; and
      
      generating with the processor the specific scene grammar including the first sound event with reference to the first predetermined action and the second predetermined action.
  - 5. The method of claim 1, the receiving of the object relationship data further comprising:
    - receiving with the processor a relationship identifier indicating a functional relationship including data specifying a temporal order of the first predetermined action and the second predetermined action; and
      
      generating with the processor the specific scene grammar including the first sound event with reference to the temporal order between the first predetermined action and the second predetermined action.
  - 6. The method of claim 1, the generation of the specific scene grammar further comprising:
    - retrieving with the processor a predetermined general scene grammar from the memory, the predetermined general scene grammar including a plurality of sound events corresponding to actions performed by a plurality of objects;
      
      identifying with the processor one sound event in the plurality of sound events in the predetermined general scene grammar including objects corresponding to the first object and the second object with reference to the first registration information and the second registration information; and
      
      generating with the processor the specific scene grammar including the one event identified in the predetermined general scene grammar.
  - 7. The method of claim 1 further comprising:
    - generating with the processor a hierarchical scene grammar including the specific scene grammar corresponding to the first scene and at least one other specific scene grammar corresponding to a second scene; and
      
      storing with the processor the hierarchical scene grammar in the memory with a relationship between the specific scene grammar of the first scene and the specific scene grammar of the second scene for identification of another sound event corresponding to sounds from object actions that occur in both the first scene and the second scene.

8. A method of operating an audio monitoring system comprising:
- generating with a sound sensor audio data corresponding to sound produced by an action performed by an object in a first scene around the sound sensor;
  
  identifying with a processor a type of object in the first scene that generated the sound with reference to the audio data;
  
  identifying with the processor the action taken by the object to generate a sound event with reference to the audio data;
  
  generating with the processor a timestamp corresponding to a time of the detection of the sound;
  
  updating with the processor a scene state model corresponding to a plurality of sound events generated by a plurality of objects in the first scene around the sound sensor with reference to the identified type of object, action taken by the object, and the timestamp;
  
  identifying with the processor one sound event in the plurality of sound events for the first scene with reference to the first scene state model and a predetermined scene grammar stored in a memory; and
  
  generating with the processor an output corresponding to the one sound event.
- View Dependent Claims (9, 10, 11, 12)
- - 9. The method of claim 8 further comprising:
    - filtering with the processor audio data corresponding to a human voice from the audio data received from the sound sensor prior to identification of the type of object in the first scene that generated the sound.
  - 10. The method of claim 8, the identification of the type of object and action taken by the object further comprising:
    - selecting with the processor at least one classifier from a plurality of classifiers stored in the memory, the first classifier being selected with reference to the first scene state model for the first scene prior to updating the first scene state model and the predetermined scene grammar to select the at least one classifier corresponding to an expected object action for the one sound event in the predetermined scene grammar; and
      
      applying with the processor the at least one classifier to identify the type of object and the action taken by the object based on a result from the at least one classifier that produces a highest confidence score.
  - 11. The method of claim 8 further comprising:
    - identifying with the processor that the first scene state model does not correspond to any sound event in the plurality of sound events in the first scene grammar; and
      
      generating with the processor an output indicating an anomaly in the first scene.
  - 12. The method of claim 11, the generation of the output further comprising:
    - transmitting with the processor a message including the identified type of object, action taken by the object, timestamp, and a copy of the audio data to a monitoring service.

13. An audio monitoring system comprising:
- a sound sensor configured to generate audio data corresponding to sound produced by an action performed by an object in a first scene around the sound sensor;
  
  an output device; and
  
  a processor operatively connected to the sound sensor, the output device, and a memory, the processor being configured to;
  
  identifying a type of object in the first scene that generated the sound with reference to the audio data;
  
  identify the action taken by the object to generate a sound event with reference to the audio data;
  
  generate a timestamp corresponding to a time of the detection of the sound;
  
  update a scene state model corresponding to a plurality of sound events generated by a plurality of objects in the first scene around the sound sensor with reference to the identified type of object, action taken by the object, and the timestamp;
  
  identify one sound event in the plurality of sound events for the first scene with reference to the first scene state model and a predetermined scene grammar stored in the memory; and
  
  generate an output corresponding to the one sound event.
- View Dependent Claims (14, 15, 16, 17, 18, 19, 20)
- - 14. The system of claim 13, the processor being further configured to:
    - filter audio data corresponding to a human voice from the audio data received from the sound sensor prior to identification of the type of object in the first scene that generated the sound.
  - 15. The system of claim 13, the processor being further configured to:
    - select at least one classifier from a plurality of classifiers stored in the memory, the first classifier being selected with reference to the first scene state model for the first scene prior to updating the first scene state model and the predetermined scene grammar to select the at least one classifier corresponding to an expected object action for the one sound event in the predetermined scene grammar; and
      
      apply the at least one classifier to identify the type of object and the action taken by the object based on a result from the at least one classifier that produces a highest confidence score.
  - 16. The system of claim 13, the processor being further configured to:
    - identify that the first scene state model does not correspond to any sound event in the plurality of sound events in the first scene grammar; and
      
      generate an output indicating an anomaly in the first scene.
  - 17. The system of claim 16, the processor being further configured to:
    - transmit a message including the identified type of object, action taken by the object, timestamp, and a copy of the audio data to a monitoring service.
  - 18. The system of claim 13, the processor being further configured to:
    - receive first registration information for a first object in the first scene around the sound sensor;
      
      train a first classifier for a first predetermined action of the first object in the first scene, the first predetermined action generating sound detected by the sound sensor;
      
      receive second registration information for a second object in the first scene around the sound sensor;
      
      train a second classifier for a second predetermined action of the second object in the first scene, the second predetermined action generating sound detected by the sound sensor;
      
      receive object relationship data corresponding to a relationship between the first object and the second object in the first scene;
      
      generate the predetermined scene grammar including a first sound event formed from with reference to a predetermined general scene grammar stored in the memory, the first registration information, the second registration information, and the object relationship data; and
      
      store the predetermined scene grammar in the memory in association with the first classifier and the second classifier for identification of a subsequent occurrence of the first sound event including the first predetermined action of the first object and the second predetermined action of the second object.
  - 19. The system of claim 18, the processor being further configured to:
    - receive first audio data corresponding to a first predetermined action of the first object from the sound sensor;
      
      extract a first plurality of features from the first audio data;
      
      generate a first classifier corresponding to the predetermined sound event from the first object with reference to the first plurality of features; and
      
      store the first classifier in the memory on association with the first predetermined action of the first object and the predetermined scene grammar
  - 20. The system of claim 18, the processor being further configured to:
    - retrieve a predetermined general scene grammar from the memory, the predetermined general scene grammar including a plurality of sound events corresponding to actions performed by a plurality of objects;
      
      identify one sound event in the plurality of sound events in the predetermined general scene grammar including objects corresponding to the first object and the second object with reference to the first registration information and the second registration information; and
      
      generate the predetermined scene grammar including the one event identified in the predetermined general scene grammar.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Robert Bosch GmbH
Original Assignee
Robert Bosch GmbH
Inventors
Das, Samarjit, Sousa, Joao P.

Granted Patent

US 9,668,073 B2
Time in Patent Office

Days
Field of Search
US Class Current
CPC Class Codes

G08B 1/08   using electric transmission...

G08B 21/0423   detecting deviation from an...

G08B 21/0469   Presence detectors to detec...

G10L 15/063   Training

G10L 25/51   for comparison or discrimin...

G10L 99/00   Subject matter not provided...

H04R 29/00   Monitoring arrangements; Te...

H04R 29/004   for microphones H04R29/007 ...

System and Method for Audio Scene Understanding of Physical Object Sound Sources

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

19 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

System and Method for Audio Scene Understanding of Physical Object Sound Sources

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

19 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links