Method and system for adding punctuation to voice files

US 9,442,910 B2
Filed: 03/19/2014
Issued: 09/13/2016
Est. Priority Date: 05/24/2013
Status: Active Grant

First Claim

Patent Images

1. An improved method for adding punctuations to a voice file, comprising:

executing by a processor, program codes stored in a memory to configure a computing device to add punctuations to a voice file, comprising performing the following steps;

utilizing silence or pause duration detection to divide the voice file into a plurality of speech segments for processing, wherein respective speech segments form respective sentences within the voice file, and each respective sentence of the voice file comprising a plurality of features units, wherein each feature unit comprises a single term or multi-terms expression having semantic features corresponding to the single term or multi-terms expression;

identifying the plurality of features units that appear in the voice file according to every term or expression, and according to the semantic features corresponding to the every single term or multi-terms expression that form each of the plurality of speech segments, the semantic features comprising a word attribute and a composition within each respective sentence and wherein identifying the plurality of feature units is based on taking the respective location of each term as the current reference location, determine a single term whose relative location relationship with the current reference location comprises the semantic features of the single term feature or expression template according to the single term feature template and further continuing the identifying for multi-terms expression comprising the term based on each of the identified feature units;

assigning a corresponding weight to each punctuation mode which is associated to the single term or multi-terms expression in each respective identified feature unit, wherein a punctuation mode being either no punctuation used or a particular punctuation being used in the single term or multi-terms expression;

using a linguistic model to determine a maximum sum of weight as ultimate punctuation modes for the respective speech segments which form the respective sentences within the voice file, wherein a sum of weight is determined by summing all corresponding weights on occurrences of each of various possible punctuation modes in the voice file and according to all the respective identified feature units, wherein the linguistic model is built upon the semantic features of parsed out various single terms or multi-terms expressions from a body text of a spoken sentence according to a language library;

adding respective punctuations to form respective punctuated sentences within the voice file based on the determined maximum sum of weight of the various punctuation modes; and

transcribing the voice file with the added respective punctuations to output the punctuated sentences as text.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and system for adding punctuation to a voice file is disclosed. The method includes: utilizing silence or pause duration detection to divide a voice file into a plurality of speech segments for processing, the voice file includes a plurality of features units; identifying all features units that appear in the voice file according to every term or expression and semantics features of the every term or expression that form each of the plurality of speech segments; using a linguistic model to determine a sum of weight of various punctuation modes in the voice file according to all the feature units, the linguistic model is built upon semantics features of various parsed out terms or expressions from a body text of a spoken sentence according to a language library; and adding punctuations to the voice file based on the determined sum of weight of the various punctuation modes.

18 Citations

View as Search Results

18 Claims

1. An improved method for adding punctuations to a voice file, comprising:
- executing by a processor, program codes stored in a memory to configure a computing device to add punctuations to a voice file, comprising performing the following steps;
  
  utilizing silence or pause duration detection to divide the voice file into a plurality of speech segments for processing, wherein respective speech segments form respective sentences within the voice file, and each respective sentence of the voice file comprising a plurality of features units, wherein each feature unit comprises a single term or multi-terms expression having semantic features corresponding to the single term or multi-terms expression;
  
  identifying the plurality of features units that appear in the voice file according to every term or expression, and according to the semantic features corresponding to the every single term or multi-terms expression that form each of the plurality of speech segments, the semantic features comprising a word attribute and a composition within each respective sentence and wherein identifying the plurality of feature units is based on taking the respective location of each term as the current reference location, determine a single term whose relative location relationship with the current reference location comprises the semantic features of the single term feature or expression template according to the single term feature template and further continuing the identifying for multi-terms expression comprising the term based on each of the identified feature units;
  
  assigning a corresponding weight to each punctuation mode which is associated to the single term or multi-terms expression in each respective identified feature unit, wherein a punctuation mode being either no punctuation used or a particular punctuation being used in the single term or multi-terms expression;
  
  using a linguistic model to determine a maximum sum of weight as ultimate punctuation modes for the respective speech segments which form the respective sentences within the voice file, wherein a sum of weight is determined by summing all corresponding weights on occurrences of each of various possible punctuation modes in the voice file and according to all the respective identified feature units, wherein the linguistic model is built upon the semantic features of parsed out various single terms or multi-terms expressions from a body text of a spoken sentence according to a language library;
  
  adding respective punctuations to form respective punctuated sentences within the voice file based on the determined maximum sum of weight of the various punctuation modes; and
  
  transcribing the voice file with the added respective punctuations to output the punctuated sentences as text.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method according to claim 1, wherein the silence or pause detection comprises:
    - determining a silence or pause duration threshold according to a current application scenario;
      
      detecting the silence or pause duration in the voice file to be processed, andwhen the silence or pause duration is longer than the silence threshold;
      
      separating the speech segments in the voice file at locations that correspond to the silence or pause duration.
  - 3. The method according to claim 1, wherein the identifying of the plurality of features units that appears in the voice file, comprising:
    - gathering into a set, the respective identified feature units that appear in the plurality of speech segments.
  - 4. The method according to claim 1, wherein the building of the linguistic model comprises:
    - parsing out the various single terms or multi-terms expressions from the body text of the spoken sentence, wherein punctuations have already been added in advance to the spoken sentence according to the language library;
      
      searching the respective identified feature unit according to the semantic features of each parsed out single term or multi-terms expression, and according to a preset feature template;
      
      recording a number of occurrences of each punctuation mode in each respective identified feature unit in the body text of the spoken sentence, according to the punctuation mode that follows the single term or multi-terms expression in the respective identified feature unit;
      
      determining a corresponding weight of each punctuation mode of each respective identified feature unit according to the number of occurrences of each punctuation mode of each respective identified feature unit; and
      
      building the linguistic model which comprises every respective identified feature unit and its respective punctuation mode with a corresponding weight relationship.
  - 5. The method according to claim 1, whereinthe single term feature unit is acquired according to a single term feature template, and the multi-term expression feature unit is acquired according to a multi-term expression feature template;
    - the single term or multi-terms expression feature template perform functions, comprising;
      
      acquiring the single term or multi-terms expression whose current reference location in relation to its relative location fulfills a predetermined requirement, and the semantic features of the single term or multi-terms expression, andthe acquiring of the single term feature unit which is based on the single term feature template, comprising;
      
      taking a respective location of each single term as the current reference location, determining the single term whose relative location relationship with the current reference location fulfills the requirements of the single term feature template according to the single term feature template;
      
      identifying the single term feature unit according to the semantic features of the single term, the single term feature unit comprises the single term, the semantic features of the single term and the relative location relationship of the single term with the current reference location; and
      
      the multi-terms expression feature template includes acquiring the multi-terms expression whose relative location relationship with the current reference location fulfills the predetermined requirements and the semantics features of each of the multi-terms expression, andthe multi-terms expression feature units perform functions, comprising;
      
      acquiring the respective location of each multi-terms expression as the current reference location, determining the multi-terms expression whose relative location relationship with the current reference location fulfills the requirements of the multi-terms expression feature template according to the multi-terms expression feature template;
      
      identifying the multi-terms expression feature unit according to the semantic features of each multi-terms expression, the multi-term expression feature unit comprises the multi-terms expression the semantic features of the multi-terms expression and the relative location relationship of the multi-terms expression with the current reference location.
  - 6. The method according to claim 1, wherein the determining of the maximum sum of weight on each of the various possible punctuation modes in the voice file and according to all the respective identified feature units, comprises:
    - acquiring from the linguistic model corresponding relationships between each respective identified feature unit among all the respective identified feature units and the corresponding weights of the respective various possible punctuation modes;
      
      determining the corresponding weight of the punctuation mode of each single term or multi-terms expression in the voice file to be processed according to the acquired corresponding relationships, and determining the maximum sum of weight of the various possible punctuation modes of the voice file to be processed according to the corresponding weight of the punctuation mode of each single term or multi-terms expression.

7. A system for adding punctuations to a voice file, comprises at least a processor which executes program codes stored in a memory to configure a computing device to add punctuations to a voice file, wherein the computing device is configured to:
- divide the voice file into a plurality of speech segments to be processed based on silence or pause detection for processing, wherein respective speech segments form respective sentences within the voice file, and each respective sentence of the voice file comprising a plurality of features units, wherein each feature unit comprises a single term or multi-terms expression having semantic features corresponding to the single term or multi-terms expression;
  
  identify the plurality of features units that appear in the voice file according to every term or expression, and according to the semantic features corresponding to the every single term or multi-terms expression that form each of the plurality of speech segments, the semantic features comprising a word attribute and a composition within each respective sentence and wherein identifying the plurality of feature units is based on taking the respective location of each term as the current reference location, determine a single term whose relative location relationship with the current reference location comprises the semantic features of the single term feature or expression template according to the single term feature template and further continuing the identifying for multi-terms expression comprising the term based on each of the identified feature units;
  
  assign a corresponding weight to each punctuation mode which is associated to the single term or multi-terms expression in each respective identified feature unit, wherein a punctuation mode being either no punctuation used or a particular punctuation being used in the single term or multi-terms expression;
  
  use a linguistic model to determine a maximum sum of weight as ultimate punctuation modes for the respective speech segments which form the respective sentences within the voice file, wherein a sum of weight is determined by summing all corresponding weights on occurrences of each of various possible punctuation modes in the voice file and according to all the respective identified feature units, wherein the linguistic model is built upon the semantic features of parsed out various single terms and multi-terms expressions from a body text of a spoken sentence according to a language library;
  
  add respective punctuations to form respective punctuated sentences within the voice file based on the determined maximum sum of weight of the various punctuation modes; and
  
  transcribe the voice file with the added respective punctuations to output the punctuated sentences as text.
- View Dependent Claims (8, 9, 10, 11, 12)
- - 8. The system according to claim 7, wherein:
    - the computing device is further configured to;
      
      determine the silence or pause duration threshold according to a current application scenario;
      
      detect the silence or pause duration in the voice file to be processed, and when the silence duration is longer than the silence threshold; and
      
      separate the voice segments to be processed in the voice file at locations that correspond to the silence or pause duration.
  - 9. The system according to claim 7, wherein:
    - the computing device is further configured to;
      
      gather into a set, the feature units that appear in the plurality of speech segments.
  - 10. The system according to claim 7, wherein the computing device is further configured to utilize a linguistic model which is built through performing the following steps by the processor:
    - parsing out the various single terms or multi-terms expressions from the body text of the spoken sentence, wherein punctuations have already been added in advance to the spoken sentence according to the language library;
      
      searching the respective identified feature unit according to the semantic features of each parsed out single term or multi-terms expression, and according to a preset feature template;
      
      recording a number of occurrences of each punctuation mode in each respective identified feature unit in the body text of the spoken sentence, according to the punctuation mode that follows the single term or multi-terms expression in the respective identified feature unit;
      
      determining a corresponding weight of each punctuation mode of each respective identified feature unit according to the number of occurrences of each punctuation mode of each feature unit; and
      
      building the linguistic model which comprises every respective identified feature unit and its respective punctuation mode with a corresponding weight relationship.
  - 11. The system according to claim 7, whereinthe computing device is further configured to:
    - take a respective location of each single term as the current reference location, determining the single term whose relative location relationship with the current reference location fulfills the requirements of the single term feature template according to the single term feature template;
      
      identify the single term feature unit according to the semantic features of the single term, the single term feature unit comprises the single term, the semantic features of the single term and the relative location relationship of the single term with the current reference location.
  - 12. The system according to claim 7, wherein the computing device is further configured to:
    - acquire from the linguistic model corresponding relationships between each respective identified feature unit among all the respective identified feature units and the corresponding weights of the respective various possible punctuation modes;
      
      determine the corresponding weight of the punctuation mode of each single term or multi-terms expression in the voice file to be processed according to the acquired corresponding relationships, and determining the maximum sum of weight of the various possible punctuation modes of the voice file to be processed according to the corresponding weight of the punctuation mode of each single term or multi-terms expression.

13. A non-transitory computer-readable medium having stored thereon, a computer program having at least one code section being executable by a machine for causing the machine to add punctuations to a voice file by performing steps comprising:
- utilizing silence or pause duration detection to divide the voice file into a plurality of speech segments for processing, wherein respective speech segments form respective sentences within the voice file, and each respective sentence of the voice file comprising a plurality of features units, wherein each feature unit comprises a single term or multi-terms expression having semantic features corresponding to the single term or multi-terms expression;
  
  identifying the plurality of features units that appear in the voice file according to every term or expression, and according to the semantic features corresponding to the every single term or multi-terms expression that form each of the plurality of speech segments, the semantic features comprising a word attribute and a composition within each respective sentence and wherein identifying the plurality of feature units is based on taking the respective location of each term as the current reference location, determine a single term whose relative location relationship with the current reference location comprises the semantic features of the single term feature or expression template according to the single term feature template and further continuing the identifying for multi-terms expression comprising the term based on each of the identified feature units;
  
  assigning a corresponding weight to each punctuation mode which is associated to the single term or multi-terms expression in each respective identified feature unit, wherein a punctuation mode being either no punctuation used or a particular punctuation being used in the single term or multi-terms expression;
  
  using a linguistic model to determine a maximum sum of weight as ultimate punctuation modes for the respective speech segments which form the respective sentences within the voice file, wherein a sum of weight is determined by summing all corresponding weights on occurrences of each of various possible punctuation modes in the voice file and according to all the respective identified feature units, wherein the linguistic model is built upon the semantic features of parsed out various single terms and multi-terms expressions from a body text of a spoken sentence according to a language library;
  
  adding respective punctuations to form respective punctuated sentences within the voice file based on the determined maximum sum of weight of the various punctuation modes; and
  
  transcribing the voice file with the added respective punctuations to output the punctuated sentences as text.
- View Dependent Claims (14, 15, 16, 17, 18)
- - 14. The non-transitory computer-readable medium according to claim 13, wherein the silence or pause detection comprises:
    - determining a silence or pause duration threshold according to a current application scenario;
      
      detecting the silence or pause duration in the voice file to be processed, andwhen the silence or pause duration is longer than the silence threshold;
      
      separating the speech segments in the voice file at locations that correspond to the silence or pause duration.
  - 15. The non-transitory computer-readable medium according to claim 13, wherein the identifying of the plurality of features units that appears in the voice file, comprises:
    - gathering into a set, the respective identified feature units that appear in the plurality of speech segments.
  - 16. The non-transitory computer-readable medium according to claim 13, wherein the building of the linguistic model comprises:
    - parsing out the various single terms or multi-terms expressions from the body text of the spoken sentence, wherein punctuations have already been added in advance to the spoken sentence according to the language library;
      
      searching the respective identified feature unit according to the semantic features of each parsed out single term or multi-terms expression, and according to a preset feature template;
      
      recording a number of occurrences of each punctuation mode in each respective identified feature unit in the body text of the spoken sentence, according to the punctuation mode that follows the single term or multi-terms expression in the respective identified feature unit;
      
      determining a corresponding weight of each punctuation mode of each respective identified feature unit according to the number of occurrences of each punctuation mode of each respective identified feature unit; and
      
      building the linguistic model which comprises every respective identified feature unit and its respective punctuation mode with a corresponding weight relationship.
  - 17. The non-transitory computer-readable medium according to claim 13, whereinthe single term feature unit is acquired according to a single term feature template, and the multi-term expression feature unit is acquired according to a multi-term expression feature template;
    - the single term or multi-terms expression feature template perform functions, comprising;
      
      acquiring the single term or multi-terms expression whose current reference location in relation to its relative location fulfills a predetermined requirement, and the semantic features of the single term or multi-terms expression, andthe acquiring of the single term feature unit which is based on the single term feature template, comprising;
      
      taking a respective location of each single term as the current reference location, determining the single term whose relative location relationship with the current reference location fulfills the requirements of the single term feature template according to the single term feature template;
      
      identifying the single term feature unit according to the semantic features of the single term, the single term feature unit comprises the single term, the semantic features of the single term and the relative location relationship of the single term with the current reference location; and
      
      the multi-terms expression feature template includes acquiring the multi-terms expression whose relative location relationship with the current reference location fulfills the predetermined requirements and the semantics features of each of the multi-terms expression, andthe multi-terms expression feature units perform functions, comprising;
      
      acquiring the respective location of each multi-terms expression as the current reference location, determining the multi-terms expression whose relative location relationship with the current reference location fulfills the requirements of the multi-terms expression feature template according to the multi-terms expression feature template;
      
      identifying the multi-terms expression feature unit according to the semantic features of each multi-terms expression, the multi-term expression feature unit comprises the multi-terms expression the semantic features of the multi-terms expression and the relative location relationship of the multi-terms expression with the current reference location.
  - 18. The non-transitory computer-readable medium according to claim 13, wherein the determining of the maximum sum of weight on each of the various possible punctuation modes in the voice file and according to all the respective identified feature units, comprises:
    - acquiring from the linguistic model corresponding relationships between each respective identified feature unit among all the respective identified feature units and the corresponding weights of the respective various possible punctuation modes;
      
      determining the corresponding weight of the punctuation mode of each single term or multi-terms expression in the voice file to be processed according to the acquired corresponding relationships, and determining the maximum sum of weight of the various possible punctuation modes of the voice file to be processed according to the corresponding weight of the punctuation mode of each single term or multi-terms expression.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Tencent Technology Company Limited (Tencent Holdings Limited)
Original Assignee
Tencent Technology Company Limited (Tencent Holdings Limited)
Inventors
Liu, Haibo, Wang, Eryu, Zhang, Xiang, Lu, Li, Yue, Shuai, Chen, Bo, Li, Lou, Liu, Jian
Primary Examiner(s)
Shah, Paras D

Application Number

US14/219,704
Publication Number

US 20140350918A1
Time in Patent Office

909 Days
Field of Search

704/1, 704/9, 704/257
US Class Current

1/1
CPC Class Codes

G06F 40/166   Editing, e.g. inserting or ...

G06F 40/30   Semantic analysis

G10L 15/04   Segmentation; Word boundary...

Method and system for adding punctuation to voice files

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

18 Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for adding punctuation to voice files

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

18 Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links