SYSTEM AND METHOD FOR COGNITIVE MULTILINGUAL SPEECH TRAINING AND RECOGNITION

US 20190303797A1
Filed: 03/30/2018
Published: 10/03/2019
Est. Priority Date: 03/30/2018
Status: Active Grant

First Claim

Patent Images

1. A method of analyzing human speech during natural language processing interactions between humans and computers, the method comprising:

(A) selecting, by a computer system, multiple human language tutorial videos from a plurality of human language tutorial videos, each of the plurality of human language tutorial videos having a visual track, a corresponding audio track and captions, wherein the visual track contains visual information, the audio track contains pronunciations of words or phrases spoken by humans regarding the visual information, and the captions include text of the spoken words or phrases of the audio track;

(B) stream processing simultaneously in parallel channels, by the computer system, the selected multiple human language tutorial videos by chunking frames of the multiple videos into processor modules, said processor modules each including a video recognition module, a text recognition module, and an audio recognition module, wherein for each said frame of the multiple videos said video recognition module analyzes and identifies the visual information of the video track, said audio recognition module analyzes and identifies the pronunciations of the spoken words or phrases of the audio track, and said text recognition module analyzes and identifies the captions using optical character recognition;

(C) correlating, by correlation modules of the computer system for each said frame of the multiple videos, the identified visual information with the identified captions and the identified pronunciations of the spoken words or phrases;

(D) determining, by determination modules of the computer system, confidence scores of accuracy in the identifying the pronunciations of the spoken words or phrases, by comparing the identified audio pronunciations of the spoken words or phrases with a list of pronunciations of benchmark words or phrases stored in files on the computer;

(E) assigning, by the computer system, the identified pronunciations of the words or phrases having confidence scores equal to or above a predetermined threshold value to the list of pronunciations of benchmark words or phrases stored in the files on the computer; and

(F) selecting, by the computer system, different human language tutorial videos from the plurality of human language tutorial videos, then repeating steps (B) through (E) on the selected different human language tutorial videos.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention provides a method and system for analyzing human speech during natural language processing interactions between humans and computers to aid in computer learning. The method processes human language tutorial videos each having a visual track, an audio track and captions. Multiple videos are simultaneously processed in parallel using stream processing to identify spoken words or phrases in the videos by comparing them with benchmark words/phrases stored on a computer. Confidence scores are determined for each of the spoken words/phrases which are assigned to a list of the benchmark words/phrases on the computer when a threshold value is met. A system administrator can identify spoken words/phrases to which the threshold value is not met.

Citations

20 Claims

1. A method of analyzing human speech during natural language processing interactions between humans and computers, the method comprising:
- (A) selecting, by a computer system, multiple human language tutorial videos from a plurality of human language tutorial videos, each of the plurality of human language tutorial videos having a visual track, a corresponding audio track and captions, wherein the visual track contains visual information, the audio track contains pronunciations of words or phrases spoken by humans regarding the visual information, and the captions include text of the spoken words or phrases of the audio track;
  
  (B) stream processing simultaneously in parallel channels, by the computer system, the selected multiple human language tutorial videos by chunking frames of the multiple videos into processor modules, said processor modules each including a video recognition module, a text recognition module, and an audio recognition module, wherein for each said frame of the multiple videos said video recognition module analyzes and identifies the visual information of the video track, said audio recognition module analyzes and identifies the pronunciations of the spoken words or phrases of the audio track, and said text recognition module analyzes and identifies the captions using optical character recognition;
  
  (C) correlating, by correlation modules of the computer system for each said frame of the multiple videos, the identified visual information with the identified captions and the identified pronunciations of the spoken words or phrases;
  
  (D) determining, by determination modules of the computer system, confidence scores of accuracy in the identifying the pronunciations of the spoken words or phrases, by comparing the identified audio pronunciations of the spoken words or phrases with a list of pronunciations of benchmark words or phrases stored in files on the computer;
  
  (E) assigning, by the computer system, the identified pronunciations of the words or phrases having confidence scores equal to or above a predetermined threshold value to the list of pronunciations of benchmark words or phrases stored in the files on the computer; and
  
  (F) selecting, by the computer system, different human language tutorial videos from the plurality of human language tutorial videos, then repeating steps (B) through (E) on the selected different human language tutorial videos.
- View Dependent Claims (2, 3, 4, 5, 6, 7)
- - 2. The method of claim 1, wherein the different human language tutorial videos include videos having same said visual information and same said captions as previously processed videos, and having audio tracks different from the audio tracks of the previously processed videos, said different audio tracks including different voices, different pronunciations, different language dialects, or different combinations of languages of said words or phrases spoken by the humans.
  - 3. The method of claim 1, wherein the different human language tutorial videos include audio tracks having different audio background noise from the previously processed videos.
  - 4. The method of claim 1, wherein the plurality of human language tutorial videos are elementary school videos for learning language skills.
  - 5. The method of claim 1, further comprising:
    - sending an alert, by the computer, to request a human curator to assist in understanding pronunciations of the identified words or phrases having confidence scores below the predetermined threshold value; and
      
      assigning, by the human curator, the identified pronunciations of the spoken words or phrases having confidence scores below the predetermined threshold value to the list of pronunciations of benchmark words or phrases stored in the files on the computer.
  - 6. The method of claim 1, wherein the audio track for one of the selected human language tutorial videos includes pronunciations of words or phrases from multiple human languages, dialects, or human accents.
  - 7. The method of claim 1, wherein the captions are included on the visual track with the visual information.

8. A computer program product, comprising one or more computer readable hardware storage devices having computer readable program code stored therein, said program code containing instructions executable by a computing device to implement a method of analyzing human speech during natural language processing interactions between humans and computers, the method comprising:
- (A) selecting, by a computer system, multiple human language tutorial videos from a plurality of human language tutorial videos, each of the plurality of human language tutorial videos having a visual track, a corresponding audio track and captions, wherein the visual track contains visual information, the audio track contains pronunciations of words or phrases spoken by humans regarding the visual information, and the captions include text of the spoken words or phrases of the audio track;
  
  (B) stream processing simultaneously in parallel channels, by the computer system, the selected multiple human language tutorial videos by chunking frames of the multiple videos into processor modules, said processor modules each including a video recognition module, a text recognition module, and an audio recognition module, wherein for each said frame of the multiple videos said video recognition module analyzes and identifies the visual information of the video track, said audio recognition module analyzes and identifies the pronunciations of the spoken words or phrases of the audio track, and said text recognition module analyzes and identifies the captions using optical character recognition;
  
  (C) correlating, by correlation modules of the computer system for each said frame of the multiple videos, the identified visual information with the identified captions and the identified pronunciations of the spoken words or phrases;
  
  (D) determining, by determination modules of the computer system, confidence scores of accuracy in the identifying the pronunciations of the spoken words or phrases, by comparing the identified audio pronunciations of the spoken words or phrases with a list of pronunciations of benchmark words or phrases stored in files on the computer;
  
  (E) assigning, by the computer system, the identified pronunciations of the words or phrases having confidence scores equal to or above a predetermined threshold value to the list of pronunciations of benchmark words or phrases stored in the files on the computer; and
  
  (F) selecting, by the computer system, different human language tutorial videos from the plurality of human language tutorial videos, then repeating steps (B) through (E) on the selected different human language tutorial videos.
- View Dependent Claims (9, 10, 11, 12, 13, 14)
- - 9. The computer program product of claim 8, wherein the different human language tutorial videos include videos having same said visual information and same said captions as previously processed videos, and having audio tracks different from the audio tracks of the previously processed videos, said different audio tracks including different voices, different pronunciations, different language dialects, or different combinations of languages of said words or phrases spoken by the humans.
  - 10. The computer program product of claim 8, wherein the different human language tutorial videos include audio tracks having different audio background noise from the previously processed videos.
  - 11. The computer program product of claim 8, wherein the plurality of human language tutorial videos are elementary school videos for learning language skills.
  - 12. The computer program product of claim 8, wherein the method further comprises:
    - sending an alert, by the computer, to request a human curator to assist in understanding pronunciations of the identified words or phrases having confidence scores below the predetermined threshold value; and
      
      assigning, by the human curator, the identified pronunciations of the spoken words or phrases having confidence scores below the predetermined threshold value to the list of pronunciations of benchmark words or phrases stored in the files on the computer.
  - 13. The computer program product of claim 8, wherein the audio track for one of the selected human language tutorial videos includes pronunciations of words or phrases from multiple human languages, dialects, or human accents.
  - 14. The computer program product of claim 8, wherein the captions are included on the visual track with the visual information.

15. A system, comprising a computing device, said computing device comprising one or more processors, one or more memories, and one or more computer readable hardware storage devices, said one or more hardware storage devices containing program code executable by the one or more processors via the one or more memories to implement a method of analyzing human speech during natural language processing interactions between humans and computers, the method comprising:
- (A) selecting, by a computer system, multiple human language tutorial videos from a plurality of human language tutorial videos, each of the plurality of human language tutorial videos having a visual track, a corresponding audio track and captions, wherein the visual track contains visual information, the audio track contains pronunciations of words or phrases spoken by humans regarding the visual information, and the captions include text of the spoken words or phrases of the audio track;
  
  (B) stream processing simultaneously in parallel channels, by the computer system, the selected multiple human language tutorial videos by chunking frames of the multiple videos into processor modules, said processor modules each including a video recognition module, a text recognition module, and an audio recognition module, wherein for each said frame of the multiple videos said video recognition module analyzes and identifies the visual information of the video track, said audio recognition module analyzes and identifies the pronunciations of the spoken words or phrases of the audio track, and said text recognition module analyzes and identifies the captions using optical character recognition;
  
  (C) correlating, by correlation modules of the computer system for each said frame of the multiple videos, the identified visual information with the identified captions and the identified pronunciations of the spoken words or phrases;
  
  (D) determining, by determination modules of the computer system, confidence scores of accuracy in the identifying the pronunciations of the spoken words or phrases, by comparing the identified audio pronunciations of the spoken words or phrases with a list of pronunciations of benchmark words or phrases stored in files on the computer;
  
  (E) assigning, by the computer system, the identified pronunciations of the words or phrases having confidence scores equal to or above a predetermined threshold value to the list of pronunciations of benchmark words or phrases stored in the files on the computer; and
  
  (F) selecting, by the computer system, different human language tutorial videos from the plurality of human language tutorial videos, then repeating steps (B) through (E) on the selected different human language tutorial videos.
- View Dependent Claims (16, 17, 18, 19, 20)
- - 16. The system of claim 15, wherein the different human language tutorial videos include videos having same said visual information and same said captions as previously processed videos, and having audio tracks different from the audio tracks of the previously processed videos, said different audio tracks including different voices, different pronunciations, different language dialects, or different combinations of languages of said words or phrases spoken by the humans.
  - 17. The system of claim 15, wherein the different human language tutorial videos of the plurality of human language tutorial videos include audio tracks having different audio background noise from the previously processed videos.
  - 18. The system of claim 15, wherein the plurality of human language tutorial videos are elementary school videos for learning language skills.
  - 19. The system of claim 15, further comprising:
    - sending an alert, by the computer, to request a human curator to assist in understanding pronunciations of the identified words or phrases having confidence scores below the predetermined threshold value; and
      
      assigning, by the human curator, the identified pronunciations of the spoken words or phrases having confidence scores below the predetermined threshold value to the list of pronunciations of benchmark words or phrases stored in the files on the computer.
  - 20. The system of claim 15, wherein the audio track for one of the selected human language tutorial videos includes pronunciations of words or phrases from multiple human languages, dialects, or human accents.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Javali, Praveen

Granted Patent

US 11,443,227 B2
Time in Patent Office

Days
Field of Search
US Class Current
CPC Class Codes

G06F 40/20   Natural language analysis s...

G06F 40/42   Data-driven translation

G06N 20/00   Machine learning

G09B 19/06   Foreign languages with audi...

G09B 5/065   Combinations of audio and v...

G10L 15/063   Training

G10L 15/187   Phonemic context, e.g. pron...

G10L 15/22   Procedures used during a sp...

G10L 25/51   for comparison or discrimin...

SYSTEM AND METHOD FOR COGNITIVE MULTILINGUAL SPEECH TRAINING AND RECOGNITION

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

SYSTEM AND METHOD FOR COGNITIVE MULTILINGUAL SPEECH TRAINING AND RECOGNITION

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links