Monte Carlo method for natural language understanding and speech recognition language models

US 7,039,579 B2
Filed: 09/14/2001
Issued: 05/02/2006
Est. Priority Date: 09/14/2001
Status: Active Grant

First Claim

Patent Images

1. A Monte Carlo method of developing a training corpus for use with natural language understanding or speech recognition language models, said method comprising:

identifying at least one phrase embedded in a body of text, said phrase belonging to a phrase class;

determining at least one subject matter attribute corresponding to said identified phrase; and

augmenting the training corpus by copying said body of text and replacing said identified phrase with a different phrase selected from a plurality of phrases, said different phrase belonging to said phrase class and having said determined subject matter attribute.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A Monte Carlo method for use with natural language understanding and speech recognition language models can include a series of steps. The steps can include identifying at least one phrase embedded in a body of text wherein the phrase can belong to a phrase class. An additional attribute corresponding to the identified phrase can be determined. The body of text can be copied and the identified phrase can be replaced with a different phrase selected from a plurality of phrases. The different phrase can belong to the phrase class and correspond to the attribute.

Citations

25 Claims

1. A Monte Carlo method of developing a training corpus for use with natural language understanding or speech recognition language models, said method comprising:
- identifying at least one phrase embedded in a body of text, said phrase belonging to a phrase class;
  
  determining at least one subject matter attribute corresponding to said identified phrase; and
  
  augmenting the training corpus by copying said body of text and replacing said identified phrase with a different phrase selected from a plurality of phrases, said different phrase belonging to said phrase class and having said determined subject matter attribute.
- View Dependent Claims (2, 3, 4, 5, 6)
- - 2. The method of claim 1, wherein said plurality of phrases are included within a single data source selected from the group consisting of a grammar, selected non-terminal within a grammar, and a list.
  - 3. The method of claim 1, wherein said plurality of phrases are included within at least two data sources wherein at least one of said data sources is selected from the group consisting of a grammar, selected non-terminals within a grammar, and a list.
  - 4. The method of claim 1, wherein said subject matter attribute is selected from the group comprising at least one phrase category and at least one boundary condition.
  - 5. The method of claim 1, wherein said subject matter attribute corresponds to at least one of a date attribute, a time attribute, a geographical attribute, and a name attribute.
  - 6. The method of claim 1, wherein said different phrase has a probability value which exceeds a predetermined threshold value.

7. A Monte Carlo method of developing a training corpus for use with natural language understanding or speech recognition language models, said method comprising;
- identifying at least one phrase embedded within a body of text;
  
  locating a second phrase within a plurality of phrases, said second phrase identically matching said identified phrase, wherein said second phrase belongs to a phrase class and has at least one subject matter attribute corresponding to said phrase class; and
  
  copying said body of text and replacing said identified phrase with a different phrase selected from said plurality of phrases, said different phrase having a subiect matter attribute that matches the subject matter attribute of said second phrase.
- View Dependent Claims (8, 9, 10, 11, 12)
- - 8. The method of claim 7, wherein said plurality of phrases are included within a single data source selected from the group consisting of a grammar, selected non-terminals within a grammar, and a list.
  - 9. The method of claim 7, wherein said plurality of phrases are included within at least two data sources selected from the group consisting of a grammar, selected non-terminals within a grammar, and a list.
  - 10. The method of claim 7, wherein said subject matter attribute is selected from the group comprising at least one phrase category and at least one boundary condition.
  - 11. The method of claim 7, wherein said subject matter attribute corresponds to at least one of a date attribute, a time attribute, a geographical attribute, and a name attribute.
  - 12. The method of claim 7, wherein said different phrase has a probability value which exceeds a predetermined threshold value.

13. A machine readable storage, having stored thereon a computer program having a plurality of code sections executable by a machine for causing the machine to perform the steps of:
- identifying at least one phrase embedded in a body of text, said phrase belonging to a phrase class;
  
  determining at least one subject matter attribute corresponding to said identified phrase; and
  
  augmenting the training corpus by copying said body of text and replacing said identified phrase with a different phrase selected from a plurality of phrases, said different phrase belonging to said phrase class and having said determined subject matter attribute.
- View Dependent Claims (14, 15, 16, 17, 18)
- - 14. The machine-readable storage of claim 13, wherein said plurality of phrases are included within a single data source selected from the group consisting of a grammar, selected non-terminals within a grammar, and a list.
  - 15. The machine-readable storage of claim 13, wherein said plurality of phrases are included within at least two data sources wherein at least one of said data sources is selected from the group consisting of a grammar, selected non-terminal within a grammar, and a list.
  - 16. The machine-readable storage of claim 13, wherein said subject matter attribute is selected from the group comprising at least one phrase category and at least one boundary condition.
  - 17. The machine-readable storage of claim 13, wherein said subject matter attribute corresponds to at least one of a date attribute, a time attribute, a geographical attribute, and a name attribute.
  - 18. The machine-readable storage of claim 13, wherein said different phrase has a probability value which exceeds a predetermined threshold value.

19. A machine readable storage, having stored thereon a computer program having a plurality of code sections executable by a machine for causing the machine to perform the steps of:
- identifying at least one phrase embedded within a body of text;
  
  locating a second phrase within a plurality of phrases, said second phrase identically matching said identified phrase, wherein said second phrase belongs to a phrase class and has at least one subject matter attribute corresponding to said phrase class; and
  
  copying said body of text and replacing said identified phrase with a different phrase selected from said plurality of phrases, said different phrase having a subject mattet attribute that matches the subject matter attribute of said second phrase.
- View Dependent Claims (20, 21, 22, 23, 24)
- - 20. The machine-readable storage of claim 19, wherein said plurality of phrases are included within a single data source selected from the group consisting of a grammar, selected non-terminals within a grammar, and a list.
  - 21. The machine-readable storage of claim 19, wherein said plurality of phrases are included within at least two data sources selected from the group consisting of a grammar, selected non-terminals within a grammar, and a list.
  - 22. The machine-readable storage of claim 19, wherein said subject matter attribute is selected from the group comprising at least one phrase category and at least one boundary condition.
  - 23. The machine-readable storage of claim 19, wherein said subject matter attribute corresponds to at least one of a date attribute, a time attribute, a geographical attribute, and a name attribute.
  - 24. The machine-readable storage of claim 19, wherein said different phrase has a probability value which exceeds a predetermined threshold value.

25. A Monte Carlo method of developing a training corpus for use with natural language understanding or speech recognition language models, said method comprising:
- identifying at least one phrase embedded in a body of text, said phrase belonging to a phrase class;
  
  determining at least one syntax-independent and semantics-independent subject matter attribute corresponding to said identified phrase; and
  
  augmenting the training corpus by copying said body of text and replacing said identified phrase with a different phrase selected from a plurality of phrases, said different phrase belonging to said phrase class and having said determined subject matter attribute.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
International Business Machines Corporation
Inventors
Epstein, Mark E., Smith, Kevin B., Marcadet, Jean-Christophe
Primary Examiner(s)
Young, W. R.
Assistant Examiner(s)
Sked, Matthew J.

Application Number

US09/952,974
Publication Number

US 20030055623A1
Time in Patent Office

1,691 Days
Field of Search

704/9, 704/1
US Class Current

704/9
CPC Class Codes

G06F 40/289   Phrasal analysis, e.g. fini...

G10L 15/183   using context dependencies,...

G10L 15/197   Probabilistic grammars, e.g...

Monte Carlo method for natural language understanding and speech recognition language models

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

Citations

25 Claims

Specification

Solutions

Use Cases

Quick Links

Monte Carlo method for natural language understanding and speech recognition language models

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

25 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links