System for grasping keyword extraction based speech content on recorded voice data, indexing method using the system, and method for grasping speech content

US 10,304,441 B2
Filed: 09/18/2014
Issued: 05/28/2019
Est. Priority Date: 11/06/2013
Status: Active Grant

First Claim

Patent Images

1. A system for grasping speech content, comprising:

an indexing unit, executed by a processor, for receiving voice data, performing per-frame voice recognition with reference to a phoneme to form a phoneme lattice, and generating divided indexing information for a frame of a limited time configured with a plurality of frames, the divided indexing information including a phoneme lattice formed for each frame of the limited time;

an indexing database, executed by a processor, for storing the divided indexing information generated by the indexing unit so as to be indexed by respective divided indexing information;

a searcher, executed by a processor, for using a keyword input by a user as a search word, performing a comparison on the divided indexing information stored in the indexing database with reference to a phoneme, and searching a phoneme string matching the search word; and

a grasper, executed by a processor, for grasping a representative word through a search result searched by the searcher and outputting it to the user so as to retrieve and display on a display device, speech content of the voice data corresponding to the keyword input together with the keyword input,wherein the indexing unit includes;

a featuring vector extractor for extracting a featuring vector from per-frame voice data;

a phoneme recognizer for performing phoneme recognition with reference to frame synchronization by use of the featuring vector extracted by the featuring vector extractor, and generating a phoneme string;

a candidate group forming unit for receiving the phoneme string generated by the phoneme recognizer, and generating candidate groups of phoneme recognition with respect to time for each frame;

a phoneme lattice forming unit for performing an operation in reverse order of time on the phoneme string candidate groups formed by the candidate group forming unit to select one phoneme string candidate group and form a corresponding phoneme lattice; and

an indexing controller for controlling the featuring vector extractor, the phoneme recognizer, the candidate group forming unit, and the phoneme lattice forming unit to perform control so as to form a phoneme based lattice for each limited time for the entire voice data and for each frame within the limited time and to perform control so as to store the phoneme lattice formed in this manner in the indexing database as divided indexing information for each limited time and thereby allow the same to be indexed for each limited time.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Disclosed are a system for grasping keyword extraction based speech content on recorded voice data, an indexing method using the system, and a method for grasping speech content. An indexing unit receives voice data, performs per-frame voice recognition with reference to a phoneme to form a phoneme lattice, generates divided indexing information for a frame of a limited time configured with a plurality of frames, and stores the same in an indexing database, the divided indexing information including a phoneme lattice formed for each frame of the limited time. A searcher uses a keyword input by a user as a search word, performs a comparison on the divided indexing information stored in the indexing database with reference to a phoneme, searches a phoneme string matching the search word, and finds a voice portion corresponding to a search word through a precise acoustic analysis regarding the matching phoneme string, and the grasper grasps a representative word through a search result searched by the searcher and outputs it to the user so as to grasp speech content of the voice data.

14 Citations

View as Search Results

17 Claims

1. A system for grasping speech content, comprising:
- an indexing unit, executed by a processor, for receiving voice data, performing per-frame voice recognition with reference to a phoneme to form a phoneme lattice, and generating divided indexing information for a frame of a limited time configured with a plurality of frames, the divided indexing information including a phoneme lattice formed for each frame of the limited time;
  
  an indexing database, executed by a processor, for storing the divided indexing information generated by the indexing unit so as to be indexed by respective divided indexing information;
  
  a searcher, executed by a processor, for using a keyword input by a user as a search word, performing a comparison on the divided indexing information stored in the indexing database with reference to a phoneme, and searching a phoneme string matching the search word; and
  
  a grasper, executed by a processor, for grasping a representative word through a search result searched by the searcher and outputting it to the user so as to retrieve and display on a display device, speech content of the voice data corresponding to the keyword input together with the keyword input,wherein the indexing unit includes;
  
  a featuring vector extractor for extracting a featuring vector from per-frame voice data;
  
  a phoneme recognizer for performing phoneme recognition with reference to frame synchronization by use of the featuring vector extracted by the featuring vector extractor, and generating a phoneme string;
  
  a candidate group forming unit for receiving the phoneme string generated by the phoneme recognizer, and generating candidate groups of phoneme recognition with respect to time for each frame;
  
  a phoneme lattice forming unit for performing an operation in reverse order of time on the phoneme string candidate groups formed by the candidate group forming unit to select one phoneme string candidate group and form a corresponding phoneme lattice; and
  
  an indexing controller for controlling the featuring vector extractor, the phoneme recognizer, the candidate group forming unit, and the phoneme lattice forming unit to perform control so as to form a phoneme based lattice for each limited time for the entire voice data and for each frame within the limited time and to perform control so as to store the phoneme lattice formed in this manner in the indexing database as divided indexing information for each limited time and thereby allow the same to be indexed for each limited time.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)
- - 2. The system of claim 1, whereinthe indexing controller includes:
    - a voice detector for indicating whether a voice is detected from voice data;
      
      a time counter for counting a temporal position of a phoneme for performing phoneme recognition on the voice data;
      
      a time limiter for counting the time starting from a time when a corresponding voice section is detected by the time counter when the voice section is detected by the voice detector, and counting the limited time; and
      
      an operation controller for performing control for performing per-frame phoneme recognition on a valid voice section detected by the voice detector within a limited time counted by the time limiter, forming a phoneme lattice, and storing the same in the indexing database as divided indexing information.
  - 3. The system of claim 2, whereinthe operation controller performs control so as to overlap a voice section of a specific time or a specific frame from among a previous voice section and perform phoneme recognition with a voice section corresponding to a new limited time starting from the corresponding frame when the limited time counted by the time limiter lapses regarding the valid voice section detected by the voice detector.
  - 4. The system of claim 3, whereinthe searcher includes:
    - a search result history database for storing a search result searched by the searcher, and when there is a search result after having processed the search word input by the user, transmitting the same to the grasper;
      
      a pronunciation string generator for generating a per-phoneme pronunciation string corresponding to the search word input by the user;
      
      a search word database for storing the search word and a plurality of context keywords corresponding to the search word;
      
      a dynamic time warping processor for searching a phoneme string matching divided indexing information stored in the indexing database by using a pronunciation string generated by the pronunciation string generator, and selecting a voice section of a first candidate; and
      
      a verifier for determining a matching state on the voice section of the first candidate selected by the dynamic time warping processor through an acoustic model to determine one voice section, storing the determined voice section and information relating to the voice section in the search result history database, and simultaneously outputting the same to the grasper.
  - 5. The system of claim 4, whereinthe dynamic time warping processor determines whether the phoneme string of the divided indexing information matches the pronunciation string through the dynamic time warping algorithm, and in the case of determination through the dynamic time warping algorithm, it determines that they match each other when their time warping degree is equal to or greater than a threshold value.
  - 6. The system of claim 4, whereinregarding the voice section that has become a candidate, the verifier allocates the voice section for each frame according to the phoneme string with state information of a phoneme model with reference to a tri-phone model for the phoneme string of the search word, finds an accumulated value of a ratio of an observation probability value on the tri-phone model and an observation probability value on a mono-phone model, normalizes the same to calculate a reliability value, and determines whether to output the same as a finally searched result for the voice section according to the normalized reliability value.
  - 7. The system of claim 4, whereinwhen the one voice section is determined, the verifier performs an additional search to find whether there is a phoneme string matching the pronunciation string of the context keyword extracted from the search database corresponding to the search word within a predetermined time range with respect to the one voice section.
  - 8. The system of claim 4, whereininformation relating to the voice section stored in the search result history database includes a file name including the one voice section, a starting position and an ending position of voice data, the search word, a normalized reliability value on a searched section, a matching context keyword, and a sex of a speaker.
  - 9. The system of claim 4, whereinthe grasper includes:
    - a representative word database for setting a representative word for each search word and context keyword corresponding to the search word;
      
      a representative word grasper for extracting a search word and a context keyword from among search result information output by the searcher, and searching a corresponding representative word through the representative word database; and
      
      an output unit for receiving the search result information and the representative word from the representative word grasper and displaying the same to the user.
  - 10. The system of claim 4, whereinthe context keyword sets a plurality of words carrying a same meaning according to a category method.
  - 11. The system of claim 4, whereinthe searcher generates a pronunciation string for a byname with the same meaning as the search word and simultaneously searches the same.
  - 12. The system of claim 1, whereinthe phoneme recognizer performs a Viterbi algorithm and a token passing algorithm for each phoneme to generate a corresponding phoneme string.
  - 13. The system of claim 1, whereinthe phoneme lattice forming unit forms information including a starting point and an ending point of a phoneme and a duration of the corresponding phoneme string.
  - 14. The system of claim 1, whereinthe divided indexing information includes a number of frames, a number of phonemes, a featuring vector, observation probability values for respective states of phonemes, a time stamp, a phoneme string, and durations of respective phonemes.

15. In a method for a speech content grasping system to grasp speech content of voice data, a method for grasping speech content comprising:
- receiving a search word from a user;
  
  generating a pronunciation string with reference to a phoneme corresponding to the search word;
  
  searching a phoneme string matching divided indexing information stored in an indexing database by use of the pronunciation string, and selecting a voice section of a first candidate, the indexing database storing the phoneme lattice formed by performing voice recognition on the voice data for each frame with reference to a phoneme as divided indexing information for a plurality of respective frames of a limited time;
  
  determining a matching state of the voice section of the first candidate through an acoustic model to determine one voice section;
  
  additionally searching whether there is a phoneme string matching a pronunciation string of a context keyword corresponding to the search word within a predetermined time range with reference to the one voice section; and
  
  grasping a representative word for the voice data through the search word and the context keyword, by retrieving and displaying on a display device, speech content of the voice data corresponding to the keyword input, together with the keyword input to the user;
  
  wherein the phoneme lattice formed by performing voice recognition on the voice date for each frame with reference to a phoneme includes the steps of;
  
  extracting a featuring vector from per-frame voice data;
  
  performing phoneme recognition with reference to frame synchronization by use of the featuring vector extracted and generating a phoneme string;
  
  receiving the phoneme string generated and generating candidate groups of phoneme recognition with respect to time for each frame; and
  
  performing an operation in reverse order of time on the phoneme string candidate groups generated to select one phoneme string candidate group and form a corresponding phoneme lattice;
  
  performing control so as to form a phoneme based lattice for each limited time for the entire voice data and for each frame within the limited time and to perform control so as to store the phoneme lattice formed in this manner in the indexing database as divided indexing information for each limited time and thereby allow the same to be indexed for each limited time.
- View Dependent Claims (16, 17)
- - 16. The method of claim 15, whereinin the grasping of a representative word and providing the same to the user, the grasping of a representative word is performed through a representative word database in which representative words are set for respective search words and corresponding context keywords.
  - 17. The method of claim 15, whereinthe search words are plural, a plurality of search words are used by a logical operator, and a search is performed by a combination of the plurality of search words.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Systran International Co., Ltd.
Original Assignee
Systran International Co., Ltd.
Inventors
Ji, Chang Jin
Primary Examiner(s)
Adesanya, Olujimi A

Application Number

US15/033,959
Publication Number

US 20160284345A1
Time in Patent Office

1,713 Days
Field of Search
US Class Current
CPC Class Codes

G06F 16/328   Management therefor

G06F 16/61   Indexing; Data structures t...

G06F 16/638   Presentation of query results

G06F 16/683   using metadata automaticall...

G10L 15/01   Assessment or evaluation of...

G10L 15/08   Speech classification or se...

G10L 15/12   using dynamic programming t...

G10L 15/187   Phonemic context, e.g. pron...

G10L 2015/025   Phonemes, fenemes or fenone...

G10L 2015/088   Word spotting

System for grasping keyword extraction based speech content on recorded voice data, indexing method using the system, and method for grasping speech content

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

14 Citations

17 Claims

Specification

Solutions

Use Cases

Quick Links

System for grasping keyword extraction based speech content on recorded voice data, indexing method using the system, and method for grasping speech content

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

14 Citations

17 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links