Automatic generation of a database for speech recognition from video captions
First Claim
Patent Images
1. A system for automatic generation of a database for speech recognition, comprising:
- a text subsystem;
an audio subsystem configured to operate in synchronization with said text subsystem;
a matching module; and
a database of matching audio signals and text words;
wherein said text subsystem comprises;
a source of video frames comprising text;
a text detection module configured to receive a first video frame, detect the text therein by looking for text patterns and generate a first timestamp if the detected text in said first video frame is different than text detected in a previous video frame,said text detection module further configured to receive a second video frame,detect the text therein by looking for text patterns ad generate a second timestamp if the detected text in said second video frame is different than text detected in said first video frame; and
an Optical Character Recognition module configured to produce a string of text words representing said detected text;
wherein said audio subsystem comprises;
a source of audio signals comprising an audio representation of said detected text;
an audio buffering module configured to receive and store said audio signal between said first and second timestamps; and
an audio words separation module configured to separate said stored audio signal into a string of audio words;
said matching module configured to receive said string of text words and said string of audio words and store each pair of matching text word and audio word in said database.
0 Assignments
0 Petitions
Accused Products
Abstract
A system and method for automatic generation of a database for speech recognition, comprising: a source of text signals; a source of audio signals comprising an audio representation of said text signals; a text words separation module configured to separate said text into a string of text words; an audio words separation module configured to separate said audio signal into a string of audio words; and a matching module configured to receive said string of text words and said string of audio words and store each pair of matching text word and audio word in a database.
9 Citations
3 Claims
-
1. A system for automatic generation of a database for speech recognition, comprising:
-
a text subsystem; an audio subsystem configured to operate in synchronization with said text subsystem; a matching module; and a database of matching audio signals and text words; wherein said text subsystem comprises; a source of video frames comprising text; a text detection module configured to receive a first video frame, detect the text therein by looking for text patterns and generate a first timestamp if the detected text in said first video frame is different than text detected in a previous video frame, said text detection module further configured to receive a second video frame, detect the text therein by looking for text patterns ad generate a second timestamp if the detected text in said second video frame is different than text detected in said first video frame; and an Optical Character Recognition module configured to produce a string of text words representing said detected text; wherein said audio subsystem comprises; a source of audio signals comprising an audio representation of said detected text; an audio buffering module configured to receive and store said audio signal between said first and second timestamps; and an audio words separation module configured to separate said stored audio signal into a string of audio words; said matching module configured to receive said string of text words and said string of audio words and store each pair of matching text word and audio word in said database.
-
-
2. A method of automatic generation of a database for speech recognition, comprising:
-
a. producing in synchronization a string of text words and a corresponding string of audio words; b. matching pairs of text word and audio word in said respective strings; and c. storing said matched pairs in a database; wherein said producing in synchronization a string of text words and a corresponding string of audio words comprises; (i) receiving a first video frame comprising text; (ii) detecting the text in said first video frame by looking for text patterns; (iii) generating a first timestamp if the text detected in said first video frame is different than text detected in a previous video frame and storing said generated first timestamp in an audio signals buffer; (iv) producing a string of text words representing said detected text; (v) receiving a second video frame comprising text; (vi) detecting the text in said second video frame by looking for text patterns; (vii) generating a second timestamp if the text detected in said second video frame is different than text detected in said first video frame; (viii) receiving audio signals comprising an audio representation of said detected text between said first and second timestamps; (ix) storing said received audio signals and said second timestamp in said buffer; and (x) separating said stored-audio signal stored in said buffer between said first and second timestamps into a string of audio words.
-
-
3. A non-transitory computer-readable medium encoding instructions that, when executed by data processing apparatus, cause the data processing apparatus to perform operations comprising:
-
a. producing in synchronization a string of text words and a corresponding string of audio words; b. matching pairs of text word and audio word in said respective strings; and c. storing said matched pairs in a database; wherein said producing in synchronization a string of text words and a corresponding string of audio words comprises; (i) receiving a first video frame comprising text; (ii) detecting the text in said first video frame by looking for text patterns; (iii) generating a first timestamp if the text detected in said first video frame is different than text detected in a previous video frame and storing said generated first timestamp in an audio signals buffer; (iv) producing a string of text words representing said detected text; (v) receiving a second video frame comprising text; (vi) detecting the text in said second video frame by looking for text patterns; (vii) generating a second timestamp if the text detected in said second video frame is different than text detected in said first video; (viii) receiving audio signals comprising an audio representation of said detected text between said first and second timestamps; (ix) storing said received audio signals and said second timestamp in said buffer; and (x) separating said audio signal stored in said buffer between said first and second timestamps into a string of audio words.
-
Specification