Behavior recognition system and method by combining image and speech

US 8,487,867 B2
Filed: 12/09/2009
Issued: 07/16/2013
Est. Priority Date: 11/10/2009
Status: Expired due to Fees

First Claim

Patent Images

1. A behavior recognition system by combining an image and a speech, comprising:

a database, for storing a plurality of image-and-speech relation modules, wherein each of the image-and-speech relation modules comprises a feature extraction parameter and an image-and-speech relation parameter;

a data analyzing module, for substituting a gesture image and a speech data corresponding to each other into each feature extraction parameter to obtain a plurality of image feature sequences and a plurality of speech feature sequences, and substituting each image feature sequence and each speech feature sequence corresponding to a same image-and-speech relation module into each image-and-speech relation parameter, so as to calculate a plurality of image-and-speech status parameters, wherein each image feature sequence comprises a plurality of image frame data, and the image frame data forms a plurality of image frame status combinations;

each speech feature sequence comprises a plurality of speech frame data, and the speech frame data forms a plurality of speech frame status combinations, when the data analyzing module calculates each one of the image-and-speech status parameters, the data analyzing module substitutes each image frame status combination and each speech frame status combination into the image-and-speech relation parameter corresponding to the same image-and-speech relation module to calculate a plurality of image-and-speech sub-status parameters and selects one image-and-speech sub-status parameter from the plurality of image-and-speech sub-status parameters to serve as the image-and-speech status parameter corresponding to the image-and-speech relation module; and

a calculating module, for using the image feature sequences, the speech feature sequences, and the image-and-speech status parameters to calculate a recognition probability corresponding to each of the image-and-speech relation modules, and taking a target parameter from the recognition probabilities.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A behavior recognition system and method by combining an image and a speech are provided. The system includes a data analyzing module, a database, and a calculating module. A plurality of image-and-speech relation modules is stored in the database. Each image-and-speech relation module includes a feature extraction parameter and an image-and-speech relation parameter. The data analyzing module obtains a gesture image and a speech data corresponding to each other, and substitutes the gesture image and the speech data into each feature extraction parameter to generate image feature sequences and speech feature sequences. The data analyzing module uses each image-and-speech relation parameter to calculate image-and-speech status parameters. The calculating module uses the image-and-speech status parameters, the image feature sequences, and the speech feature sequences to calculate a recognition probability corresponding to each image-and-speech relation parameter, so as to take a maximum value among the recognition probabilities as a target parameter.

Citations

17 Claims

1. A behavior recognition system by combining an image and a speech, comprising:
- a database, for storing a plurality of image-and-speech relation modules, wherein each of the image-and-speech relation modules comprises a feature extraction parameter and an image-and-speech relation parameter;
  
  a data analyzing module, for substituting a gesture image and a speech data corresponding to each other into each feature extraction parameter to obtain a plurality of image feature sequences and a plurality of speech feature sequences, and substituting each image feature sequence and each speech feature sequence corresponding to a same image-and-speech relation module into each image-and-speech relation parameter, so as to calculate a plurality of image-and-speech status parameters, wherein each image feature sequence comprises a plurality of image frame data, and the image frame data forms a plurality of image frame status combinations;
  
  each speech feature sequence comprises a plurality of speech frame data, and the speech frame data forms a plurality of speech frame status combinations, when the data analyzing module calculates each one of the image-and-speech status parameters, the data analyzing module substitutes each image frame status combination and each speech frame status combination into the image-and-speech relation parameter corresponding to the same image-and-speech relation module to calculate a plurality of image-and-speech sub-status parameters and selects one image-and-speech sub-status parameter from the plurality of image-and-speech sub-status parameters to serve as the image-and-speech status parameter corresponding to the image-and-speech relation module; and
  
  a calculating module, for using the image feature sequences, the speech feature sequences, and the image-and-speech status parameters to calculate a recognition probability corresponding to each of the image-and-speech relation modules, and taking a target parameter from the recognition probabilities.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The behavior recognition system by combining an image and a speech according to claim 1, wherein the data analyzing module utilizes a hidden Markov model (HMM) to perform training on the speech feature sequence and the image feature sequence to form the speech frame status combinations and the image frame status combinations respectively.
  - 3. The behavior recognition system by combining an image and a speech according to claim 1, wherein the image-and-speech status parameter is one image-and-speech sub-status parameter with a maximum value among the plurality of image-and-speech sub-status parameters.
  - 4. The behavior recognition system by combining an image and a speech according to claim 1, wherein each image feature sequence comprises a plurality of image frame status groups, each speech feature sequence comprises a plurality of speech frame status groups, and the feature extraction parameter records a probability parameter for mapping each image frame status group to each speech frame status group and a probability parameter for mapping each speech frame status group to each image frame status group under a condition of corresponding to the same image-and-speech relation module.
  - 5. The behavior recognition system by combining an image and a speech according to claim 4, wherein a frame mapping relation exists between each image frame status group and each speech frame status group, and in one relation calculation of substituting the image feature sequence and the speech feature sequence into the image-and-speech relation parameter, the data analyzing module substitutes the image frame status groups and the speech frame status groups into the image-and-speech relation parameter corresponding to the same image-and-speech relation module to calculate a plurality of image-and-speech recognition probabilities according to types of the frame mapping relation, and selects one image-and-speech recognition probability having a maximum value from the plurality of image-and-speech recognition probabilities to serve as the image-and-speech sub-status parameter corresponding to the relation calculation.
  - 6. The behavior recognition system by combining an image and a speech according to claim 1, wherein the gesture image comprises a plurality of image frame data, each image frame data comprises an image feature value, and the data analyzing module uses the image feature values to determine that the gesture image comprises a repetitive image data and extracts any one of the repetitive image data to generate each image feature sequence.
  - 7. The behavior recognition system by combining an image and a speech according to claim 1, wherein the speech data comprises a plurality of speech frame data, each speech frame data comprises a speech feature value, and the data analyzing module uses the speech feature values to determine that the speech data comprises a repetitive speech data and extracts any one of the repetitive speech data to generate each speech feature sequence.
  - 8. The behavior recognition system by combining an image and a speech according to claim 1, wherein the target parameter is the recognition probability with a maximum value among the recognition probabilities.

9. A behavior recognition method by combining an image and a speech, comprising:
- obtaining a gesture image and a speech data corresponding to each other;
  
  providing a plurality of image-and-speech relation modules, wherein each of the image-and-speech relation modules comprises a feature extraction parameter and an image-and-speech relation parameter;
  
  obtaining a plurality of image feature sequences and a plurality of speech feature sequences, wherein the gesture image and the speech data are individually substituted into the feature extraction parameters, so as to calculate the image feature sequences and the speech feature sequences, wherein each image feature sequence comprises a plurality of image frame data, and the image frame data forms a plurality of image frame status combinations;
  
  each speech feature sequence comprises a plurality of speech frame data, and the speech frame data forms a plurality of speech frame status combinations;
  
  calculating a plurality of image-and-speech status parameters, wherein each image feature sequence and each speech feature sequence corresponding to a same image-and-speech relation module are substituted into each image-and-speech relation parameter, so as to obtain the image-and-speech status parameters, wherein the step of calculating each one of the image-and-speech status parameters comprises;
  
  obtaining a plurality of image-and-speech sub-status parameters, wherein each image frame status combination and each speech frame status combination are substituted into the image-and-speech relation parameter corresponding to the same image-and-speech relation module, so as to calculate the image-and-speech sub-status parameters; and
  
  selecting one image-and-speech sub-status parameter from the image-and-speech sub-status parameters to serve as the image-and-speech status parameter corresponding to the image-and-speech relation module;
  
  calculating a plurality of recognition probabilities, wherein the image feature sequences, the speech feature sequences, and the image-and-speech status parameters are used to calculate a recognition probability corresponding to each of the image-and-speech relation modules; and
  
  taking a target parameter from the recognition probabilities.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16, 17)
- - 10. The behavior recognition method by combining an image and a speech according to claim 9, wherein the speech frame status combinations and the image frame status combinations are generated through hidden Markov model (HMM) training.
  - 11. The behavior recognition method by combining an image and a speech according to claim 9, wherein the image-and-speech status parameter is the image-and-speech sub-status parameter with a maximum value among the image-and-speech sub-status parameters.
  - 12. The behavior recognition method by combining an image and a speech according to claim 9, wherein each image feature sequence comprises a plurality of image frame status groups, each speech feature sequence comprises a plurality of speech frame status groups, and each feature extraction parameter records a probability parameter for mapping each image frame status group to each speech frame status group.
  - 13. The behavior recognition method by combining an image and a speech according to claim 12, wherein a mapping relation exists between each image frame status group and each speech frame status group, and the step of obtaining a plurality of image-and-speech sub-status parameters further comprises:
    - obtaining a plurality of image-and-speech recognition probabilities, wherein in one relation calculation of substituting the image feature sequence and the speech feature sequence into the image-and-speech relation parameter, the image frame status groups and the speech frame status groups are substituted into the image-and-speech relation parameter corresponding to the same image-and-speech relation module to calculate the image-and-speech recognition probabilities according to types of the frame mapping relation; and
      
      selecting one image-and-speech recognition probability from the image-and-speech recognition probabilities to serve as the image-and-speech sub-status parameter corresponding to the relation calculation, wherein the image-and-speech sub-status parameter is the image-and-speech recognition probability with a maximum value among the image-and-speech recognition probabilities.
  - 14. The behavior recognition method by combining an image and a speech according to claim 9, wherein the step of obtaining a plurality of image feature sequences and a plurality of speech feature sequences comprises:
    - parsing a plurality of image frame data contained in the gesture image to obtain an image feature value contained in each image frame data;
      
      determining whether the gesture image comprises a plurality of repetitive image data by using the image feature values;
      
      if yes, extracting any one of the repetitive image data to generate each image feature sequence; and
      
      if no, converting the gesture image into each image feature sequence.
  - 15. The behavior recognition method by combining an image and a speech according to claim 9, wherein the step of obtaining a plurality of image feature sequences and a plurality of speech feature sequences comprises:
    - parsing a plurality of speech frame data contained in the speech data to obtain a speech feature value contained in each speech frame data;
      
      determining whether the speech data comprises a plurality of repetitive speech data by using the speech feature values;
      
      if yes, extracting any one of the repetitive speech data to generate each speech feature sequence; and
      
      if no, converting the speech data into each speech feature sequence.
  - 16. The behavior recognition method by combining an image and a speech according to claim 9, wherein the target parameter is the recognition probability with a maximum value among the recognition probabilities.
  - 17. The behavior recognition method by combining an image and a speech according to claim 9, wherein a process for establishing any one of the image-and-speech relation modules comprises:
    - obtaining a training image and a training speech corresponding to each other;
      
      converting the training image and the training speech to generate an image training sequence and a speech training sequence, wherein the image training sequence comprises a plurality of image frame data, and the speech training sequence comprises a plurality of speech frame data;
      
      dividing the image training sequence and the speech training sequence individually by using a plurality of division manners, so as to form a plurality of image division sequences and a plurality of speech division sequences;
      
      deriving mapping relations between the image division sequences and the speech division sequences, so as to generate the image-and-speech relation parameter corresponding to the any one of the image-and-speech relation modules;
      
      recording a feature extraction mode of the training image and the training speech as a feature extraction parameter of the any one of the image-and-speech relation modules; and
      
      recording the feature extraction parameter and the image-and-speech relation parameter to form the any one of the image-and-speech relation modules.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Institute For Information Industry
Original Assignee
Institute For Information Industry
Inventors
Wu, Chung-Hsien, Chu, Chia-Te, Hsu, Chin-Shun, Lin, Jen-Chun, Wei, Wen-Li, Lin, Red-Tom
Primary Examiner(s)
SIMPSON, LIXI CHOW

Application Number

US12/634,148
Publication Number

US 20110109539A1
Time in Patent Office

1,315 Days
Field of Search

None
US Class Current

345/156
CPC Class Codes

G06F 2203/011   Emotion or mood input deter...

G06F 2203/0381   Multimodal input, i.e. inte...

G06F 3/011   Arrangements for interactio...

G06F 3/017   Gesture based interaction, ...

G06F 3/038   Control and interface arran...

G06V 40/20   Movements or behaviour, e.g...

G10L 15/24   Speech recognition using no...

Behavior recognition system and method by combining image and speech

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

17 Claims

Specification

Solutions

Use Cases

Quick Links

Behavior recognition system and method by combining image and speech

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

17 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links