METHOD FOR AUTOMATICALLY IDENTIFYING SENTENCE BOUNDARIES IN NOISY CONVERSATIONAL DATA

US 20090063150A1
Filed: 08/27/2007
Published: 03/05/2009
Est. Priority Date: 08/27/2007
Status: Active Grant

First Claim

Patent Images

1. A method for automatically identifying sentence boundaries in noisy conversational transcription data, comprising:

pre-processing the conversational data for removing transcription symbols and noise;

marking sentence boundaries based on long silences in the transcription data or using manually marked sentence boundaries in the transcription data, wherein said marked transcription data forms a training set;

determining frequencies of head and tail n-grams that occur at the beginning and ending of sentences in the training set;

filtering out from the training set n-grams that occur a significant number of times in the middle of sentences in relation to the frequencies at which the n-gram occur at the beginning or ending of sentences;

marking a boundary in the conversational data before every head n-gram and after every tail n-gram that occurs in the conversational data and that also remains in the training set after filtering;

identifying turns occurring in the conversational data indicating a speaker change in the conversational data;

marking a boundary in the conversational data after each turn, unless the turn ends with an impermissible tail word or includes a word indicating an incomplete turn; and

removing false boundaries from the conversational data,wherein the steps of marking identify sentence boundaries in the conversational data.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Sentence boundaries in noisy conversational transcription data are automatically identified. Noise and transcription symbols are removed, and a training set is formed with sentence boundaries marked based on long silences or on manual markings in the transcribed data. Frequencies of head and tail n-grams that occur at the beginning and ending of sentences are determined from the training set. N-grams that occur a significant number of times in the middle of sentences in relation to their occurrences at the beginning or ending of sentences are filtered out. A boundary is marked before every head n-gram and after every tail n-gram occurring in the conversational data and remaining after filtering. Turns are identified. A boundary is marked after each turn, unless the turn ends with an impermissible tail word or is an incomplete turn. The marked boundaries in the conversational data identify sentence boundaries.

42 Citations

View as Search Results

2 Claims

1. A method for automatically identifying sentence boundaries in noisy conversational transcription data, comprising:
- pre-processing the conversational data for removing transcription symbols and noise;
  
  marking sentence boundaries based on long silences in the transcription data or using manually marked sentence boundaries in the transcription data, wherein said marked transcription data forms a training set;
  
  determining frequencies of head and tail n-grams that occur at the beginning and ending of sentences in the training set;
  
  filtering out from the training set n-grams that occur a significant number of times in the middle of sentences in relation to the frequencies at which the n-gram occur at the beginning or ending of sentences;
  
  marking a boundary in the conversational data before every head n-gram and after every tail n-gram that occurs in the conversational data and that also remains in the training set after filtering;
  
  identifying turns occurring in the conversational data indicating a speaker change in the conversational data;
  
  marking a boundary in the conversational data after each turn, unless the turn ends with an impermissible tail word or includes a word indicating an incomplete turn; and
  
  removing false boundaries from the conversational data,wherein the steps of marking identify sentence boundaries in the conversational data.

2-4. -4. (canceled)

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Nasukawa, Tetsuya, Punjani, Diwakar, Roy, Shourya, Subramaniam, L. Venkata, Takeuchi, Hironori

Granted Patent

US 8,364,485 B2
Time in Patent Office

Days
Field of Search
US Class Current

704/253
CPC Class Codes

G10L 15/26 Speech to text systems G10L...

METHOD FOR AUTOMATICALLY IDENTIFYING SENTENCE BOUNDARIES IN NOISY CONVERSATIONAL DATA

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

42 Citations

2 Claims

Specification

Solutions

Use Cases

Quick Links

METHOD FOR AUTOMATICALLY IDENTIFYING SENTENCE BOUNDARIES IN NOISY CONVERSATIONAL DATA

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

42 Citations

2 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links