Systems and methods for improving the accuracy of a transcription using auxiliary data such as personal data

US 9,626,969 B2
Filed: 04/13/2015
Issued: 04/18/2017
Est. Priority Date: 07/26/2011
Status: Active Grant

First Claim

Patent Images

1. A method of generating a personalized transcription from an audio recording, wherein the method is performed by a mobile device in communication with a server, wherein computational resources of the server are greater than computational resources of the mobile device, the method comprising:

maintaining a personal vocabulary of words on the mobile device associated with a user of the mobile device, wherein the personal vocabulary is based on personal data associated with the user;

receiving, from the server, a first transcription of an audio recording,wherein the first transcription is generated by a server automatic speech recognition (ASR) engine at the server and using an ASR vocabulary associated with a population of users,wherein the first transcription includes a first word list and confidence scores associated with a plurality of words in the first word list, andwherein the first transcription includes both words that the server ASR engine identified as most likely spoken as well as alternatives to those words;

receiving, from the server, audio data corresponding to at least the portion of the audio recording;

generating a second transcription,wherein the second transcription is of the received audio data,wherein the second transcription comprises a second word list and confidence scores associated with a plurality of words in the second word list, andwherein the second transcription is generated by a mobile device ASR engine located on the mobile device using the maintained personal vocabulary and an acoustic model associated with the user of the mobile device;

re-scoring the first transcription, the re-scoring comprising;

comparing the first transcription with the second transcription, and modifying a confidence score associated with an alternative word in the first word list when the mobile device ASR engine indicates a higher confidence score for the alternative word than the confidence score attributed by the server ASR engine to the alternative word; and

generating a final transcription based on the re-scored first transcription, the final transcription including a combination of most likely spoken words identified by the UASR engine as well as the re-scored alternative words identified by the mobile device ASR engine.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method is described for improving the accuracy of a transcription generated by an automatic speech recognition (ASR) engine. A personal vocabulary is maintained that includes replacement words. The replacement words in the personal vocabulary are obtained from personal data associated with a user. A transcription is received of an audio recording. The transcription is generated by an ASR engine using an ASR vocabulary and includes a transcribed word that represents a spoken word in the audio recording. Data is received that is associated with the transcribed word. A replacement word from the personal vocabulary is identified, which is used to re-score the transcription and replace the transcribed word.

59 Citations

View as Search Results

24 Claims

1. A method of generating a personalized transcription from an audio recording, wherein the method is performed by a mobile device in communication with a server, wherein computational resources of the server are greater than computational resources of the mobile device, the method comprising:
- maintaining a personal vocabulary of words on the mobile device associated with a user of the mobile device, wherein the personal vocabulary is based on personal data associated with the user;
  
  receiving, from the server, a first transcription of an audio recording,wherein the first transcription is generated by a server automatic speech recognition (ASR) engine at the server and using an ASR vocabulary associated with a population of users,wherein the first transcription includes a first word list and confidence scores associated with a plurality of words in the first word list, andwherein the first transcription includes both words that the server ASR engine identified as most likely spoken as well as alternatives to those words;
  
  receiving, from the server, audio data corresponding to at least the portion of the audio recording;
  
  generating a second transcription,wherein the second transcription is of the received audio data,wherein the second transcription comprises a second word list and confidence scores associated with a plurality of words in the second word list, andwherein the second transcription is generated by a mobile device ASR engine located on the mobile device using the maintained personal vocabulary and an acoustic model associated with the user of the mobile device;
  
  re-scoring the first transcription, the re-scoring comprising;
  
  comparing the first transcription with the second transcription, and modifying a confidence score associated with an alternative word in the first word list when the mobile device ASR engine indicates a higher confidence score for the alternative word than the confidence score attributed by the server ASR engine to the alternative word; and
  
  generating a final transcription based on the re-scored first transcription, the final transcription including a combination of most likely spoken words identified by the UASR engine as well as the re-scored alternative words identified by the mobile device ASR engine.
- View Dependent Claims (2, 3)
- - 2. The method of claim 1, wherein the personal data associated with the user includes data from at least one of an address book of the user, an SMS message sent or received by the user, an email sent or received by the user, a social network of the user, or a website visited by the user.
  - 3. The method of claim 1, wherein the audio recording is of a second user, the first transcription includes metadata associated with the second user, and the word from the second word list is added to the first word list based on the metadata.

4. A non-transitory computer-readable medium encoded with instructions that, when executed by a processor, perform a method in a computing system of generating a personalized transcription from an audio recording, wherein the method is performed by a mobile device in communication with a server, wherein computational resources of the server are greater than computational resources of the mobile device, the method comprising:
- maintaining a personal vocabulary of words on the mobile device associated with a user, wherein the personal vocabulary is based on personal data associated with the user;
  
  receiving, from the server, a first transcription of an audio recording,wherein the first transcription is generated by a server automatic speech recognition (ASR) engine at the server and using an ASR vocabulary associated with a population of users,wherein the first transcription includes a first word list and confidence scores associated with a plurality of words in the first word list, andwherein the first transcription includes both words that the ASR engine identified as most likely spoken as well as alternatives to those words;
  
  receiving, from the server, audio data corresponding to at least the portion of the audio recording;
  
  generating a second transcription,wherein the second transcription is of the received audio data,wherein the second transcription comprises a second word list and confidence scores associated with a plurality of words in the second word list, andwherein the second transcription is generated by a mobile device ASR engine located on the mobile device using the maintained personal vocabulary and an acoustic model associated with the user of the mobile device;
  
  re-scoring the first transcription, the re-scoring comprising;
  
  comparing the first transcription with the second transcription, and modifying a confidence score associated with an alternative word in the first word list when the mobile device ASR engine indicates a higher confidence score for the alternative word than the confidence score attributed by server the ASR engine to the alternative word; and
  
  generating a final transcription based on the re-scored first transcription, the final transcription including a combination of most likely spoken words identified by the server ASR engine as well as the re-scored alternative words identified by the mobile device ASR engine.
- View Dependent Claims (5, 6)
- - 5. The non-transitory computer-readable medium of claim 4, wherein the personal data associated with the user includes data from at least one of an address book of the user, an SMS message sent or received by the user, an email sent or received by the user, a social network of the user, or a website visited by the user.
  - 6. The non-transitory computer-readable medium of claim 4, wherein the audio recording is of a second user, the first transcription includes metadata associated with the second user, and the word from the second word list is added to the first word list is based on the metadata.

7. A method of replacing a word in a transcription of an audio recording, wherein the method is performed by a mobile device in communication with a server, wherein computational resources of the server are greater than computational resources of the mobile device, the method comprising:
- maintaining a personal vocabulary of words on the mobile device associated with a user of the mobile device, wherein the personal vocabulary is based on personal data associated with the user and includes an acoustic model associated with the user of the mobile device;
  
  receiving, from the server, a first transcription of an audio recording,wherein the first transcription data is generated by a server automatic speech recognition (ASR) engine at the server using an ASR vocabulary associated with a population of users that does not include the personal vocabulary of the user of the mobile device,wherein the first transcription includes confidence scores associated with certain words in the transcription;
  
  receiving, from the server, audio data corresponding to the first transcription;
  
  identifying, at the mobile device, a replaceable word from the first transcription;
  
  generating a second transcription of a portion of the received audio data corresponding to the replaceable word,wherein the second transcription includes phonetic data, andwherein the second transcription is generated by a mobile device ASR engine on the mobile device using the maintained personal vocabulary and an acoustic model associated with the user of the mobile device; and
  
  identifying a replacement word for the replaceable word,wherein the replacement word is identified based on a comparison between the phonetic data of the second transcription and the personal vocabulary, andwherein the replacement word is from the personal vocabulary;
  
  identifying, at the mobile device, a non-replaceable word from the first transcription partially based on the maintained personal vocabulary;
  
  producing a modified confidence score associated with the portion of the received first transcript based at least in part on the comparison; and
  
  generating a final transcription using the modified confidence score and the non-replaceable word, wherein the replacement word appears in the final transcription in place of at least one word from the first transcription, and wherein the non-replaceable word appears in the final transcription.
- View Dependent Claims (8, 9, 10, 11, 12, 13, 14, 15)
- - 8. The method of claim 7, wherein the personal data associated with the user includes data from at least one of an address book of the user, an SMS message sent or received by the user, an email sent or received by the user, a social network of the user, or a website visited by the user.
  - 9. The method of claim 7, wherein the audio recording is of a second user, the first transcription includes metadata associated with the second user, and the replacement word is based on metadata.
  - 10. The method of claim 7, wherein identifying a replaceable word comprises identifying a word from the first transcription having a confidence score that is below a threshold level.
  - 11. The method of claim 10, wherein the threshold level is based on a weighting associated with the replacement word or based on a word in the personal vocabulary having a similar phonetic spelling to the replaceable word.
  - 12. The method of claim 7, wherein identifying a replaceable word comprises identifying a word from the first transcription that has a similar phonetic spelling to a word in the personal vocabulary.
  - 13. The method of claim 7, wherein a confidence score associated with the replacement word is greater than a confidence score associated with the replaceable word.
  - 14. The method of claim 7, further comprising generating a report based on the identified replacement word, and wherein identifying the replacement word is further based on a previously generated report.
  - 15. The method of claim 7, wherein identifying the non-replaceable word further comprises determining the non-replaceable word has an identical transcription as in the first transcription based on a local transcription generated by the mobile device ASR engine.

16. A non-transitory computer-readable medium encoded with instructions that, when executed by a processor, perform a method in a computing system of replacing a word in a transcription of an audio recording, wherein the method is performed by a mobile device in communication with a server, wherein computational resources of the server are greater than computational resources of the mobile device, the method comprising:
- maintaining a personal vocabulary of words on the mobile device associated with a user of the mobile device, wherein the personal vocabulary is based on personal data associated with the user and includes an acoustic model associated with the user of the mobile device;
  
  receiving, from the server, a first transcription of an audio recording,wherein the first transcription data is generated by a server automatic speech recognition (ASR) engine at the server using an ASR vocabulary associated with a population of users, andwherein the first transcription includes confidence scores associated with certain words in the transcription;
  
  receiving, from the server, audio data corresponding to the first transcription;
  
  identifying, at the mobile device, a replaceable word from the first transcription;
  
  generating a second transcription of a portion of the received audio data corresponding to the replaceable word,wherein the second transcription includes phonetic data, andwherein the second transcription is generated by a mobile device ASR engine on the mobile device using the maintained personal vocabulary; and
  
  identifying a replacement word for the replaceable word,wherein the replacement word is identified based on a comparison between the phonetic data of the second transcription and the personal vocabulary, andwherein the replacement word is from the personal vocabulary;
  
  identifying, at the mobile device, a non-replaceable word from the first transcription partially based on the maintained personal vocabulary and the acoustic model associated with the user of the mobile device;
  
  producing a modified confidence score associated with the portion of the received first transcript based at least in part on the comparison; and
  
  generating a final transcription using the modified confidence score and the non-replaceable word, wherein the replacement word appears in the final transcription in place of at least one word from the first transcription, and wherein the non-replaceable word appears in the final transcription.
- View Dependent Claims (17, 18, 19, 20, 21, 22, 23, 24)
- - 17. The non-transitory computer-readable medium of claim 16, wherein the personal data associated with the user includes data from at least one of an address book of the user, an SMS message sent or received by the user, an email sent or received by the user, a social network of the user, or a website visited by the user.
  - 18. The non-transitory computer-readable medium of claim 16, wherein the audio recording is of a second user, the first transcription includes metadata associated with the second user, and the replacement word is based on the metadata.
  - 19. The non-transitory computer-readable medium of claim 16, wherein identifying a replaceable word comprises identifying a word from the first transcription having a confidence score that is below a threshold level.
  - 20. The non-transitory computer-readable medium of claim 19, wherein the threshold level is based on a weighting associated with the replacement word or based on a word in the personal vocabulary having a similar phonetic spelling to the replaceable word.
  - 21. The non-transitory computer-readable medium of claim 16, wherein identifying the replaceable word comprises identifying a word from the first transcription that has a phonetic spelling similar to a word in the personal vocabulary.
  - 22. The non-transitory computer-readable medium of claim 16, wherein a confidence score associated with the replacement word is greater than a confidence score associated with the replaceable word.
  - 23. The non-transitory computer-readable medium of claim 16, further comprising instructions for generating a report based on the identified replacement word, and wherein identifying the replacement word is further based on a previously generated report.
  - 24. The non-transitory computer-readable medium of claim 16, wherein identifying the non-replaceable word further comprises determining the non-replaceable word has an identical transcription as in the first transcription based on a local transcription generated by the mobile device ASR engine.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Nuance Communications, Inc. (Microsoft Corporation)
Inventors
Zavaliagkos, George, Ganong, III, William F., Jost, Uwe H., Madhavapeddi, Shreedhar, Clayton, Gary B.
Primary Examiner(s)
Sirjani, Fariba

Application Number

US14/685,364
Publication Number

US 20150221306A1
Time in Patent Office

736 Days
Field of Search
US Class Current
CPC Class Codes

G10L 15/065   Adaptation

G10L 15/08   Speech classification or se...

G10L 15/24   Speech recognition using no...

G10L 15/26   Speech to text systems G10L...

G10L 15/30   Distributed recognition, e....

G10L 2015/227   of the speaker; Human-fact...

Systems and methods for improving the accuracy of a transcription using auxiliary data such as personal data

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

59 Citations

24 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods for improving the accuracy of a transcription using auxiliary data such as personal data

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

59 Citations

24 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links