Apparatus and method for speech segment detection and system for speech recognition

US 7,860,718 B2
Filed: 12/04/2006
Issued: 12/28/2010
Est. Priority Date: 12/08/2005
Status: Active Grant

First Claim

Patent Images

1. An apparatus for speech segment detection including a sound receiver and an image receiver, the apparatus comprising:

a lip motion signal detector for detecting a motion region from image frames output from the image receiver, applying lip motion image feature information to the detected motion region, and detecting a lip motion signal for determining whether or not a sound frame is a speech frame; and

a speech segment detector for detecting a speech segment using sound frames output from the sound receiver and the lip motion signal detected from the lip motion signal detector, the speech segment detector determining whether the sound frame is a potential speech frame or is stationary noise, and when it is determined that the sound frame is a potential speech frame, determining whether the potential speech frame is a speech frame or dynamic noise according to the lip motion signal.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Provided are an apparatus and method for speech segment detection, and a system for speech recognition. The apparatus is equipped with a sound receiver and an image receiver and includes: a lip motion signal detector for detecting a motion region from image frames output from the image receiver, applying lip motion image feature information to the detected motion region, and detecting a lip motion signal; and a speech segment detector for detecting a speech segment using sound frames output from the sound receiver and the lip motion signal detected from the lip motion signal detector. Since lip motion image information is checked in a speech segment detection process, it is possible to prevent dynamic noise from being misrecognized as speech.

Citations

12 Claims

1. An apparatus for speech segment detection including a sound receiver and an image receiver, the apparatus comprising:
- a lip motion signal detector for detecting a motion region from image frames output from the image receiver, applying lip motion image feature information to the detected motion region, and detecting a lip motion signal for determining whether or not a sound frame is a speech frame; and
  
  a speech segment detector for detecting a speech segment using sound frames output from the sound receiver and the lip motion signal detected from the lip motion signal detector, the speech segment detector determining whether the sound frame is a potential speech frame or is stationary noise, and when it is determined that the sound frame is a potential speech frame, determining whether the potential speech frame is a speech frame or dynamic noise according to the lip motion signal.
- View Dependent Claims (2, 3)
- - 2. The apparatus of claim 1, wherein the lip motion signal detector compares the image frames output from the image receiver with each other, detects a motion region, obtains information on an area, width, length, and position of the detected motion region, compares the obtained features of the motion region with previously stored lip motion image feature information, and detects the lip motion signal.
  - 3. The apparatus of claim 1, wherein the speech segment detector determines whether or not each sound frame input from the sound receiver is a potential speech frame using absolute energy and zero-crossing rate of the sound frame, determines whether or not the lip motion signal is detected from the image frames at a point of time when the determined potential speech frame is detected, and detects the speech segment.

4. A method for speech segment detection in a speech recognition system including a sound receiver and an image receiver, the method comprising the steps of:
- removing stationary noise from a sound frame output from the sound receiver, and determining whether or not the sound frame from which the noise is removed is a potential speech frame;
  
  when it is determined that the sound frame is a potential speech frame, determining whether or not a lip motion signal is detected from image frames at a point of time when the potential speech frame is detected;
  
  when it is determined that the lip motion signal is detected from the image frames, determining that the potential speech frame is a speech frame, storing the speech frame, and determining whether or not the number of speech frames is at least a predetermined number; and
  
  when it is determined that the number of speech frames is at least the predetermined number, detecting the speech frames as a speech segment.
- View Dependent Claims (5, 6, 7, 8, 9, 10)
- - 5. The method of claim 4, wherein in the step of removing stationary noise from a sound frame output from the sound receiver, low-pass filtering is performed for the sound frame and a high frequency component is removed.
  - 6. The method of claim 4, wherein in the step of determining whether or not the sound frame from which noise is removed is a potential speech frame, a level of absolute energy of the sound frame from which the noise is removed and a zero-crossing rate of the sound frame are analyzed, and it is determined whether the sound frame is a potential speech frame or a noise frame.
  - 7. The method of claim 4, wherein the step of determining whether or not a lip motion signal is detected from image frames at a point of time when the potential speech frame is detected comprises the steps of:
    - respectively comparing pixel values of a previous frame with pixel values of a current frame among the continuously received image frames, and detecting a motion region;
      
      obtaining information on an area, width, length, and location of each detected motion region; and
      
      applying lip motion image feature information to the obtained features of the motion region, determining whether or not the motion region is a lip motion region, and generating the lip motion signal according to the determination result.
  - 8. The method of claim 7, wherein the lip motion image feature information comprises a shape of lips and a change in the shape.
  - 9. The method of claim 7, wherein the step of applying lip motion image feature information to the obtained features of the motion region, determining whether or not the motion region is a lip motion region, and generating the lip motion signal according to the determination result comprises the steps of:
    - comparing the obtained features of the motion region with the lip motion image feature information and calculating a degree of similarity; and
      
      when the calculated degree of similarity is at least a predetermined value, determining that the motion region is the lip motion region and generating the lip motion signal.
  - 10. The method of claim 4, when it is determined that a lip motion signal is not detected from the image frames, further comprising the step of determining that the potential speech frame is dynamic noise.

11. A system for speech recognition, comprising:
- a sound receiver for converting a sound signal input by a user into a digital signal and framing the digital signal;
  
  an image receiver for framing an image signal obtained by an image recorder;
  
  a lip motion signal detector for detecting a motion region from the image frames output from the image receiver, applying lip motion image feature information to the detected motion region, and detecting a lip motion signal;
  
  a speech segment detector for determining whether the sound frame output from the sound receiver is a potential speech frame or is stationary noise, and when it is determined that the sound frame is a potential speech frame, detecting a speech segment according to the lip motion signal detected by the lip motion signal detector;
  
  a feature vector extractor for extracting a feature vector from the speech segment detected by the speech segment detector; and
  
  a speech recognizer for performing speech recognition using the feature vector extracted by the feature vector extractor to convert the sound signal to characters.
- View Dependent Claims (12)
- - 12. The system of claim 11, wherein the image recorder is a camera.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Hyundai Motor Company (Hyundai Motor Group), Kia Corp.
Original Assignee
Electronics and Telecommunications Research Institute
Inventors
Kim, Eung Kyeu, Lee, Young Jik, Lee, Soo Jong, Kim, Sang Hun
Primary Examiner(s)
Chawan; Vijay B

Application Number

US11/633,270
Publication Number

US 20070136071A1
Time in Patent Office

1,485 Days
Field of Search

704/270, 704/276, 704/208, 704/253, 704/241, 704/254, 704/223, 704/211, 704/233
US Class Current

704/276
CPC Class Codes

G10L 15/25 using position of the lips,...

G10L 25/87 Detection of discrete point...

Apparatus and method for speech segment detection and system for speech recognition

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

Citations

12 Claims

Specification

Solutions

Use Cases

Quick Links

Apparatus and method for speech segment detection and system for speech recognition

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

12 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links