System for grasping keyword extraction based speech content on recorded voice data, indexing method using the system, and method for grasping speech content
First Claim
1. A system for grasping speech content, comprising:
- an indexing unit, executed by a processor, for receiving voice data, performing per-frame voice recognition with reference to a phoneme to form a phoneme lattice, and generating divided indexing information for a frame of a limited time configured with a plurality of frames, the divided indexing information including a phoneme lattice formed for each frame of the limited time;
an indexing database, executed by a processor, for storing the divided indexing information generated by the indexing unit so as to be indexed by respective divided indexing information;
a searcher, executed by a processor, for using a keyword input by a user as a search word, performing a comparison on the divided indexing information stored in the indexing database with reference to a phoneme, and searching a phoneme string matching the search word; and
a grasper, executed by a processor, for grasping a representative word through a search result searched by the searcher and outputting it to the user so as to retrieve and display on a display device, speech content of the voice data corresponding to the keyword input together with the keyword input,wherein the indexing unit includes;
a featuring vector extractor for extracting a featuring vector from per-frame voice data;
a phoneme recognizer for performing phoneme recognition with reference to frame synchronization by use of the featuring vector extracted by the featuring vector extractor, and generating a phoneme string;
a candidate group forming unit for receiving the phoneme string generated by the phoneme recognizer, and generating candidate groups of phoneme recognition with respect to time for each frame;
a phoneme lattice forming unit for performing an operation in reverse order of time on the phoneme string candidate groups formed by the candidate group forming unit to select one phoneme string candidate group and form a corresponding phoneme lattice; and
an indexing controller for controlling the featuring vector extractor, the phoneme recognizer, the candidate group forming unit, and the phoneme lattice forming unit to perform control so as to form a phoneme based lattice for each limited time for the entire voice data and for each frame within the limited time and to perform control so as to store the phoneme lattice formed in this manner in the indexing database as divided indexing information for each limited time and thereby allow the same to be indexed for each limited time.
1 Assignment
0 Petitions
Accused Products
Abstract
Disclosed are a system for grasping keyword extraction based speech content on recorded voice data, an indexing method using the system, and a method for grasping speech content. An indexing unit receives voice data, performs per-frame voice recognition with reference to a phoneme to form a phoneme lattice, generates divided indexing information for a frame of a limited time configured with a plurality of frames, and stores the same in an indexing database, the divided indexing information including a phoneme lattice formed for each frame of the limited time. A searcher uses a keyword input by a user as a search word, performs a comparison on the divided indexing information stored in the indexing database with reference to a phoneme, searches a phoneme string matching the search word, and finds a voice portion corresponding to a search word through a precise acoustic analysis regarding the matching phoneme string, and the grasper grasps a representative word through a search result searched by the searcher and outputs it to the user so as to grasp speech content of the voice data.
14 Citations
17 Claims
-
1. A system for grasping speech content, comprising:
-
an indexing unit, executed by a processor, for receiving voice data, performing per-frame voice recognition with reference to a phoneme to form a phoneme lattice, and generating divided indexing information for a frame of a limited time configured with a plurality of frames, the divided indexing information including a phoneme lattice formed for each frame of the limited time; an indexing database, executed by a processor, for storing the divided indexing information generated by the indexing unit so as to be indexed by respective divided indexing information; a searcher, executed by a processor, for using a keyword input by a user as a search word, performing a comparison on the divided indexing information stored in the indexing database with reference to a phoneme, and searching a phoneme string matching the search word; and a grasper, executed by a processor, for grasping a representative word through a search result searched by the searcher and outputting it to the user so as to retrieve and display on a display device, speech content of the voice data corresponding to the keyword input together with the keyword input, wherein the indexing unit includes; a featuring vector extractor for extracting a featuring vector from per-frame voice data; a phoneme recognizer for performing phoneme recognition with reference to frame synchronization by use of the featuring vector extracted by the featuring vector extractor, and generating a phoneme string; a candidate group forming unit for receiving the phoneme string generated by the phoneme recognizer, and generating candidate groups of phoneme recognition with respect to time for each frame; a phoneme lattice forming unit for performing an operation in reverse order of time on the phoneme string candidate groups formed by the candidate group forming unit to select one phoneme string candidate group and form a corresponding phoneme lattice; and an indexing controller for controlling the featuring vector extractor, the phoneme recognizer, the candidate group forming unit, and the phoneme lattice forming unit to perform control so as to form a phoneme based lattice for each limited time for the entire voice data and for each frame within the limited time and to perform control so as to store the phoneme lattice formed in this manner in the indexing database as divided indexing information for each limited time and thereby allow the same to be indexed for each limited time. - View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
-
-
15. In a method for a speech content grasping system to grasp speech content of voice data, a method for grasping speech content comprising:
-
receiving a search word from a user; generating a pronunciation string with reference to a phoneme corresponding to the search word; searching a phoneme string matching divided indexing information stored in an indexing database by use of the pronunciation string, and selecting a voice section of a first candidate, the indexing database storing the phoneme lattice formed by performing voice recognition on the voice data for each frame with reference to a phoneme as divided indexing information for a plurality of respective frames of a limited time; determining a matching state of the voice section of the first candidate through an acoustic model to determine one voice section; additionally searching whether there is a phoneme string matching a pronunciation string of a context keyword corresponding to the search word within a predetermined time range with reference to the one voice section; and grasping a representative word for the voice data through the search word and the context keyword, by retrieving and displaying on a display device, speech content of the voice data corresponding to the keyword input, together with the keyword input to the user; wherein the phoneme lattice formed by performing voice recognition on the voice date for each frame with reference to a phoneme includes the steps of; extracting a featuring vector from per-frame voice data; performing phoneme recognition with reference to frame synchronization by use of the featuring vector extracted and generating a phoneme string; receiving the phoneme string generated and generating candidate groups of phoneme recognition with respect to time for each frame; and performing an operation in reverse order of time on the phoneme string candidate groups generated to select one phoneme string candidate group and form a corresponding phoneme lattice; performing control so as to form a phoneme based lattice for each limited time for the entire voice data and for each frame within the limited time and to perform control so as to store the phoneme lattice formed in this manner in the indexing database as divided indexing information for each limited time and thereby allow the same to be indexed for each limited time. - View Dependent Claims (16, 17)
-
Specification