Method and system of adding punctuation and establishing language model using a punctuation weighting applied to chinese speech recognized text

US 9,811,517 B2
Filed: 01/06/2014
Issued: 11/07/2017
Est. Priority Date: 01/29/2013
Status: Active Grant

First Claim

Patent Images

1. A computer-implemented method of adding punctuation marks to a Chinese sentence based on a Chinese language punctuation model, wherein the Chinese language punctuation model was pre-generated from a training corpus of Chinese sentences having punctuation marks and includes multiple predefined characteristic units, each predefined characteristic unit including a series of Chinese expressions, possible punctuation marks present in the series of Chinese expressions and their respective probabilities, the method comprising:

at a computer having one or more processors and memory for storing programs to be executed by the one or more processors;

extracting the Chinese sentence from a speech input through speech recognition;

identifying a plurality of expressions in the Chinese sentence by segmenting the Chinese sentence according to their semantic features, each of the plurality of expressions including one or more Chinese characters;

grouping the plurality of expressions in the Chinese sentence into a plurality of characteristic units according to the semantic features of the plurality of expressions using one or more predefined characteristic templates;

extracting, from the Chinese language punctuation model, a plurality of possible punctuation marks appearing in the corresponding series of Chinese expressions and their respective probabilities for each of the plurality of characteristic units;

determining a punctuation mark and its weight for each of the plurality of expressions in the Chinese sentence according to the plurality of possible punctuation marks extracted from the Chinese language punctuation model;

calculating an overall weight for each possible arrangement of punctuation marks in the Chinese sentence based on the weights of punctuation marks at each of the plurality of expressions in the Chinese sentence; and

adding the punctuation marks corresponding to an arrangement of a maximum overall weight into the Chinese sentence.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method of processing information content based on a Chinese language model is performed at a computer, the method including: identifying a plurality of expressions in the information content extracted from a speech input through speech recognition that is queued to be processed; dividing the expressions into a plurality of characteristic units according to semantic features and predetermined characteristics associated with each characteristic unit, each including a subset of the expressions and the predetermined characteristics at least including a respective integer number of expressions that are included in the characteristic unit; extracting, from the Chinese language model, a plurality of probabilities for punctuation marks associated with each characteristic unit; and in accordance with the probabilities, associating a respective punctuation mark with each characteristic unit included in the information content. The method further comprises adding punctuation marks based on a weight determined for each punctuation mark.

Citations

16 Claims

1. A computer-implemented method of adding punctuation marks to a Chinese sentence based on a Chinese language punctuation model, wherein the Chinese language punctuation model was pre-generated from a training corpus of Chinese sentences having punctuation marks and includes multiple predefined characteristic units, each predefined characteristic unit including a series of Chinese expressions, possible punctuation marks present in the series of Chinese expressions and their respective probabilities, the method comprising:
- at a computer having one or more processors and memory for storing programs to be executed by the one or more processors;
  
  extracting the Chinese sentence from a speech input through speech recognition;
  
  identifying a plurality of expressions in the Chinese sentence by segmenting the Chinese sentence according to their semantic features, each of the plurality of expressions including one or more Chinese characters;
  
  grouping the plurality of expressions in the Chinese sentence into a plurality of characteristic units according to the semantic features of the plurality of expressions using one or more predefined characteristic templates;
  
  extracting, from the Chinese language punctuation model, a plurality of possible punctuation marks appearing in the corresponding series of Chinese expressions and their respective probabilities for each of the plurality of characteristic units;
  
  determining a punctuation mark and its weight for each of the plurality of expressions in the Chinese sentence according to the plurality of possible punctuation marks extracted from the Chinese language punctuation model;
  
  calculating an overall weight for each possible arrangement of punctuation marks in the Chinese sentence based on the weights of punctuation marks at each of the plurality of expressions in the Chinese sentence; and
  
  adding the punctuation marks corresponding to an arrangement of a maximum overall weight into the Chinese sentence.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method of claim 1, wherein the plurality of probabilities are determined according to one method selected from the group consisting of a Newton'"'"'s method, a conventional Broyden-Fletcher-Goldfarb-Shanno (BFGS) method, a limited-memory (L-BFGS) method, and a quasi-Newton method.
  - 3. The method of claim 1, wherein each characteristic unit defines, for each of the corresponding series of Chinese expressions, a respective position in the Chinese sentence and a respective relative position of at least one other expression in the series of Chinese expressions.
  - 4. The method of claim 1, wherein the series of Chinese expressions associated with a respective characteristic unit only include a single Chinese expression, and the corresponding characteristic unit defines a position of the single Chinese expression in the Chinese sentence and semantic features of the single Chinese expression.
  - 5. The method of claim 1, wherein the semantic features of a respective characteristic unit are determined based on at least one of semantic properties, syntactic properties and grammatical category of the corresponding series of Chinese expressions.
  - 6. The method of claim 1, wherein grouping the plurality of Chinese expressions into a plurality of characteristic units according to the semantic features of the plurality of expressions using one or more predefined characteristic templates further comprises:
    - determining the semantic features of each of the plurality of Chinese expressions based on a meaning of the respective expression in the Chinese sentence;
      
      identifying a punctuation state for each expression in the plurality of Chinese expressions; and
      
      in accordance with the predefined characteristic templates, determining the plurality of characteristic units based on the semantic features and the punctuation state of each of the plurality of Chinese expressions, each predefined characteristic template further defining a number of Chinese expressions in a corresponding characteristic unit.

7. A computer-implemented method of establishing a Chinese language punctuation model from a training corpus of Chinese sentences having a plurality of punctuation marks, comprising:
- at a computer having one or more processors and memory for storing programs to be executed by the one or more processors;
  
  identifying, within the training corpus of Chinese sentences having the plurality of punctuation marks, a plurality of expressions, each of the plurality of expressions including one or more Chinese characters, wherein the plurality of expressions are separated and grouped by a plurality of punctuation marks that are located at predetermined locations in the training corpus of Chinese sentences;
  
  grouping the plurality of expressions in the training corpus of Chinese sentences into a plurality of characteristic units according to semantic features of the plurality of expressions and predefined characteristic templates of the plurality of characteristic units, each characteristic unit including a respective series of expressions and a unique identification number assigned to the characteristic unit; and
  
  for each of the plurality of characteristic units,recording a respective frequency of occurrence for each of the plurality of punctuation marks appearing in the corresponding series of expressions found in the training corpus of Chinese sentences; and
  
  wherein a plurality of probabilities for a plurality of punctuation marks associated with each of the plurality of characteristic units are used to determine a punctuation mark for a corresponding characteristic unit found in a Chinese sentence extracted from a speech input through speech recognition and that is not yet segmented by punctuation marks;
  
  determining a punctuation mark and its weight for each of a plurality of expressions in a Chinese sentence according to a plurality of possible punctuation marks extracted from the Chinese language punctuation model;
  
  calculating an overall weight for each possible arrangement of punctuation marks in the Chinese sentence based on the weights of punctuation marks at each of the plurality of expressions in the Chinese sentence; and
  
  adding the punctuation marks corresponding to an arrangement of a maximum overall weight into the Chinese sentence.
- View Dependent Claims (8, 9, 10, 11, 12, 13)
- - 8. The method of claim 7, wherein grouping the plurality of expressions in the training corpus of Chinese sentences into a plurality of characteristic units further comprises:
    - determining the semantic features of each of the plurality of Chinese expressions based on a meaning of the respective expression within a corresponding Chinese sentence;
      
      identifying a punctuation state for each of the plurality of Chinese expressions; and
      
      in accordance with the predefined characteristic templates, determining the plurality of characteristic units based on the semantic features and the punctuation state of each of the plurality of Chinese expressions, each predefined characteristic template further defining a number of Chinese expressions in a corresponding characteristic unit.
  - 9. The method of claim 7, wherein recording the respective frequency of occurrence for each of the plurality of punctuation marks appearing in the corresponding series of expressions found in the training corpus of Chinese sentences further comprises:
    - searching for each of the plurality of characteristic units in the training corpus of Chinese sentences; and
      
      when a respective characteristic unit is identified, defining a punctuation state for the respective characteristic unit, wherein the punctuation state comprises a subset of punctuation states for the series of expressions included in the respective characteristic unit; and
      
      in accordance with the punctuation state, updating the respective frequency of occurrence for the respective characteristic unit.
  - 10. The method of claim 7, wherein each characteristic unit defines, for each of the corresponding series of Chinese expressions, a respective position in the training corpus of Chinese sentences, and a relative position of at least one other expression in the series of Chinese expressions.
  - 11. The method of claim 7, wherein the series of Chinese expressions associated with a characteristic unit only includes a single Chinese expression, and the corresponding characteristic unit defines a position of the single Chinese expression in the training corpus of Chinese sentences and semantic features of the single Chinese expression.
  - 12. The method of claim 7, wherein the semantic features of a respective characteristic unit are determined based on at least one of semantic properties, syntactic properties and grammatical category of the corresponding series of Chinese expressions.
  - 13. The method of claim 7, wherein in the language model, the plurality of probabilities are determined according to one method selected from a group that consists of a Newton'"'"'s method, a conventional Broyden-Fletcher-Goldfarb-Shanno (BFGS) method, a limited-memory (L-BFGS) method, and a quasi-Newton method.

14. A computer system, comprising:
- one or more processors; and
  
  memory having instructions and a Chinese language punctuation model stored thereon, wherein the Chinese language punctuation model was pre-generated from a training corpus of Chinese sentences having punctuation marks and includes multiple predefined characteristic units, each predefined characteristic unit including a series of Chinese expressions, possible punctuation marks present in the series of Chinese expressions and their respective probabilities, the instructions when executed by the one or more processors cause the processors to perform operations, comprising;
  
  extracting a Chinese sentence from a speech input through speech recognition;
  
  identifying a plurality of expressions in the Chinese sentence by segmenting the Chinese sentence according to their semantic features, each of the plurality of expressions including one or more Chinese characters;
  
  grouping the plurality of expressions in the Chinese sentence into a plurality of characteristic units according to the semantic features of the plurality of expressions using one or more predefined characteristic templates;
  
  extracting, from the Chinese language punctuation model, a plurality of possible punctuation marks appearing in the corresponding series of Chinese expressions and their respective probabilities for each of the plurality of characteristic units;
  
  determining a punctuation mark and its weight for each of the plurality of expressions in the Chinese sentence according to the plurality of possible punctuation marks extracted from the Chinese language punctuation model;
  
  calculating an overall weight for each possible arrangement of punctuation marks in the Chinese sentence based on the weights of punctuation marks at each of the plurality of expressions in the Chinese sentence; and
  
  adding the punctuation marks corresponding to an arrangement of a maximum overall weight into the Chinese sentence.
- View Dependent Claims (15, 16)
- - 15. The computer system of claim 14, wherein in the language model, the plurality of probabilities are determined according to one method selected from a group that consists of a Newton'"'"'s method, a conventional Broyden-Fletcher-Goldfarb-Shanno (BFGS) method, a limited-memory (L-BFGS) method, and a quasi-Newton method.
  - 16. The computer system of claim 14, wherein each characteristic unit defines, for each of the corresponding series of Chinese expressions, a respective position in the Chinese sentence and a respective relative position of at least one other expression in the series of Chinese expressions.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Tencent Technology Shenzhen Company Limited (Tencent Holdings Limited)
Original Assignee
Tencent Technology Shenzhen Company Limited (Tencent Holdings Limited)
Inventors
Liu, Haibo, Wang, Eryu, Zhang, Xiang, Lu, Li, Yue, Shuai, Liu, Qiuge, Chen, Bo, Liu, Jian, Li, Lu
Primary Examiner(s)
Kazeminezhad, Farzad

Application Number

US14/148,579
Publication Number

US 20140214406A1
Time in Patent Office

1,401 Days
Field of Search

704 9, 704235, 704 2
US Class Current
CPC Class Codes

G06F 40/232   Orthographic correction, e....

G06F 40/289   Phrasal analysis, e.g. fini...

G06F 40/30   Semantic analysis

G06F 40/58   Use of machine translation,...

G10L 15/26   Speech to text systems G10L...

Method and system of adding punctuation and establishing language model using a punctuation weighting applied to chinese speech recognized text

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

16 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system of adding punctuation and establishing language model using a punctuation weighting applied to chinese speech recognized text

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

16 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links