Method and apparatus for adaptively generating field of application dependent language models for use in intelligent systems

US 5,444,617 A
Filed: 12/14/1993
Issued: 08/22/1995
Est. Priority Date: 12/17/1992
Status: Expired due to Fees

First Claim

Patent Images

1. An improved method for constructing a target field dependent model in the form of a decision tree for an intelligent machine, the operation of said machine is based on statistical approaches for converting input data from a source type of information into a target type of information using said decision tree, said method including:

storing in a data base a set of application field dependent files including words and symbols, thereby constituting a corpus;

performing a vocabulary selection by deriving from said corpus, a list of most frequent words and symbols;

scanning said words and symbols, and deriving therefrom a plurality of frequencies of occurrence of n-grams, which are sequences of a predefined number "n" of words and symbols, and storing said plurality of frequencies into an n-grams table;

constructing said decision tree by;

a) putting all selected vocabulary words and symbols into a first unique class C, said class initially constituting the only element of a set of classes;

then,b) splitting each class of said set of classes into two subclasses C₁ and C₂, and assigning, through an iterative process, each word and symbol to one of said subclasses C₁ and C₂, based on the plurality of frequencies in said n-grams table;

c) computing for each subclass C₁ and C₂ word and symbol "x", a distance d₁ and a distance d₂ relative to each subclass C₁ and C₂, respectively, wherein said distances d₁ and d₂ are derived as follows;

##EQU11## wherein V is the number of words in the vocabulary, and ##EQU12## wherein C is a counter of all n-grams among x₁, . . . X_n-1, y and where the summation is taken over all contexts (x₁ . . . x_n-1) such that x_j =x, and N_Total is the size of the class to be partitioned, ##EQU13## the summation in the numerator being taken over all contexts (x₁ . . . x_n-1), where x_j belongs to C₁ ; and

,the summation in the denominator being taken over all contexts where x_j belongs to C₁ and over all possible values of z from 0 to V-1, ##EQU14## the summation in the numerator being taken over all contexts (x₁ . . . x_n-1), where x_j belongs to C₂ ; and

,the summation in the denominator being taken over all contexts (x₁ . . . x_n-1) where x_j belongs to C₂ and over all possible values of z from 0 to V-1;
space="preserve" listing-type="equation">Φ

[p]=Log.sub.2 p if p>

ε
space="preserve" listing-type="equation">Φ

[p]=(p/ε

)-1+Log.sub.2 (ε

) if p<

ε

with ε

=[min p(x,y)]²where the minimum is taken over all non-zero values of p(x,y), in which case,
space="preserve" listing-type="equation">Φ

[]= 2Log [min p(x,y)]-1d) reclassifying "x" based on the shorter distance of d₁ and d₂ ; and

e) testing each subclass C₁ and C₂ and deciding based on a predefined criteria, whether each class of the set of classes should be split any further; and

, in case of any further split requirement, repeating said steps b) through e) thus increasing the number of elements in said set of classes.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A system architecture for providing human intelligible information by processing a flow of input data; e.g., converting speech (source information) into printable data (target information) based on target-dependent probabilistic models; and for enabling efficient switching from one target field of information into another. To that end, the system is provided with a language modeling device including a data base loadable with an application-dependent corpus of words and/or symbols through a workstation; and a language modeling processor programmed to refresh, in practice, a tree-organized model, efficiently, with no blocking situations, and at a reasonable cost.

Citations

12 Claims

1. An improved method for constructing a target field dependent model in the form of a decision tree for an intelligent machine, the operation of said machine is based on statistical approaches for converting input data from a source type of information into a target type of information using said decision tree, said method including:
- storing in a data base a set of application field dependent files including words and symbols, thereby constituting a corpus;
  
  performing a vocabulary selection by deriving from said corpus, a list of most frequent words and symbols;
  
  scanning said words and symbols, and deriving therefrom a plurality of frequencies of occurrence of n-grams, which are sequences of a predefined number "n" of words and symbols, and storing said plurality of frequencies into an n-grams table;
  
  constructing said decision tree by;
  
  a) putting all selected vocabulary words and symbols into a first unique class C, said class initially constituting the only element of a set of classes;
  
  then,b) splitting each class of said set of classes into two subclasses C₁ and C₂, and assigning, through an iterative process, each word and symbol to one of said subclasses C₁ and C₂, based on the plurality of frequencies in said n-grams table;
  
  c) computing for each subclass C₁ and C₂ word and symbol "x", a distance d₁ and a distance d₂ relative to each subclass C₁ and C₂, respectively, wherein said distances d₁ and d₂ are derived as follows;
  
  ##EQU11## wherein V is the number of words in the vocabulary, and ##EQU12## wherein C is a counter of all n-grams among x₁, . . . X_n-1, y and where the summation is taken over all contexts (x₁ . . . x_n-1) such that x_j =x, and N_Total is the size of the class to be partitioned, ##EQU13## the summation in the numerator being taken over all contexts (x₁ . . . x_n-1), where x_j belongs to C₁ ; and
  
  ,the summation in the denominator being taken over all contexts where x_j belongs to C₁ and over all possible values of z from 0 to V-1, ##EQU14## the summation in the numerator being taken over all contexts (x₁ . . . x_n-1), where x_j belongs to C₂ ; and
  
  ,
  the summation in the denominator being taken over all contexts (x₁ . . . x_n-1) where x_j belongs to C₂ and over all possible values of z from 0 to V-1;
  space="preserve" listing-type="equation">Φ
  
  [p]=Log.sub.2 p if p>
  
  ε
  space="preserve" listing-type="equation">Φ
  
  [p]=(p/ε
  
  )-1+Log.sub.2 (ε
  
  ) if p<
  
  ε
  with ε
  
  =[min p(x,y)]²
  where the minimum is taken over all non-zero values of p(x,y), in which case,
  space="preserve" listing-type="equation">Φ
  
  []= 2Log [min p(x,y)]-1
  d) reclassifying "x" based on the shorter distance of d₁ and d₂ ; and
  
  e) testing each subclass C₁ and C₂ and deciding based on a predefined criteria, whether each class of the set of classes should be split any further; and
  
  , in case of any further split requirement, repeating said steps b) through e) thus increasing the number of elements in said set of classes.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. The method according to claim 1 wherein said predefined criteria for further split is based on the considered class size with respect to a predefined class threshold.
  - 3. The method according to claim 1 wherein said predefined criteria for further split is based on the total number of classes resulting from the tree construction, with respect to a predefined total number of classes.
  - 4. The method according to claim 1 wherein said storing into the data {B} base is made amendable to enable focusing the field dependent model to the specific target field of application considered.
  - 5. The method according to claim 1 wherein said "n" value is set to 3.
  - 6. The method according to claim 1 wherein said "n" value is set to 5.
  - 7. The method according to claim 1 wherein said "n" value is set to 10.
  - 8. The method according to claim 1 further including:
    - constructing a binary representation of the corpus by replacing each word and symbol by its position number in said selected vocabulary;
      
      compressing the binary representation using data compression techniques; and
      
      scanning said binary representation of words and symbols, and deriving therefrom said plurality of frequencies of occurrence of n-grams, and storing said plurality of frequencies into said n-grams table.
  - 9. The method according to claim 1 further including the step of modifying said selected vocabulary by an operator at a workstation to tailor said selected vocabulary to the application field.

10. In a speech recognition system to convert speech source information into a displayable target information, said speech recognition system including an acoustic processor (AP) for converting the speech signal into a string of labels, a stack decoder connected to said acoustic processor, a fast match (FM) processor connected to said stack decoder (SD), a detailed match (DM) processor connected to said stack decoder, and a language modeling (LM) device connected to said stack decoder, said language modeling device comprising:
- a data base;
  
  a language modeling processor connected to said data base;
  
  storage means connected to said language modeling processor;
  
  a workstation or terminal connected to said data base;
  
  means for storing into said data base a set of application field dependent files including words and symbols, thereby constituting a corpus,means for performing within said language modeling processor, a vocabulary selection by deriving from said corpus, a list of most frequent words and symbols;
  
  means for scanning said words, and symbols to derive therefrom a plurality of frequencies of occurrence of n-grams, which are sequences of a predefined number "n" of words and symbols, and means for storing said plurality of frequencies into an n-grams table within said language modeling storage means;
  
  decision tree generating means within said language modeling processor for generating, and storing into said language modeling storage means, a tree-based construction derived from said n-grams table, said tree-based construction including;
  
  a) putting all selected vocabulary words and symbols into a first unique class C, said class initially constituting the only element of a set of classes;
  
  then,b) splitting each class of said set of classes into two subclasses C₁ and C₂, and assigning, through an iterative process, each word and symbol to one of said classes C₁ and C₂, based on the plurality of frequencies in said n-grams table;
  
  c) computing for each subclass C₁ and C₂ word and symbol "x", a distance d₁ and a distance d₂ relative to each subclass C₁ and C₂, respectively, wherein said distances d₁ and d₂ are derived as follows;
  
  ##EQU15## wherein V is the number of words in the vocabulary, and ##EQU16## wherein c is a counter of all n-grams among x₁, . . . x_n-1, y and where the summation is taken over all contexts (x₁, . . . x_n-1) such that x_j =x, and N_Total is the size of the class to be partitioned, ##EQU17## in the numerator, the summation being taken over all contexts (x₁ . . . x_n-1), where x_j belongs to C₁ ; and
  
  ,in the denominator, the summation being taken over all contexts (x₁ . . . x_n-1) where x_j belongs to C₁ and over all possible values of z from 0 to V-1, ##EQU18## in the numerator, the summation being taken over all contexts (x₁ . . . x_n-1), where x_j belongs to C₂ ; and
  
  ,
  in the denominator, the summation being taken over all contexts (x₁ . . . x_n-1) where x_j belongs to C₂ and over all possible values of z from 0 to V-1,
  space="preserve" listing-type="equation">Φ
  
  [p]=Log.sub.2 p if p>
  
  ε
  space="preserve" listing-type="equation">Φ
  
  [p]=(p/ε
  
  )-1+Log.sub.2 (ε
  
  ) if p<
  
  ε
  with ε
  
  =[min p(x,y]²d) reclassifying "x" based on the shorter distance of d₁ and d₂ ; and
  
  e) testing each subclass C₁ and C₂ and deciding based on a predefined criteria, whether the considered class should be split any further; and
  
  , in case of any further split requirement, repeating steps b) through e) thus increasing the number of elements in said set of classes.
- View Dependent Claims (11, 12)
- - 11. The language modeling device of claim 10, further comprising:
    - means for constructing a binary representation of the corpus by replacing each word and symbol by its position number in said selected vocabulary;
      
      means for compressing said binary representation using conventional data compression techniques;
      
      means for scanning said binary representation of words and symbols to derive therefrom said plurality of frequencies of occurrence of n-grams; and
      
      means for storing said plurality of frequencies into said n-grams table.
  - 12. The language modeling device of claim 10, further comprising means for modifying said selected vocabulary by an operator at said workstation to tailor said selected vocabulary to the application field.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Merialdo, Bernard
Primary Examiner(s)
Weinhardt, Robert A.

Application Number

US08/166,777
Time in Patent Office

616 Days
Field of Search

364/419.01, 364/419.02, 364/419.03, 364/419.08, 364/419.19, 364/419.10, 364/419.11, 395/2.44, 395/2.47, 395/2.49, 395/2.51, 395/2.54, 395/2.64, 395/2.66, 395/2.79, 395/2.86
US Class Current

704/9
CPC Class Codes

G06F 40/253   Grammatical analysis; Style...

G10L 15/183   using context dependencies,...

G10L 15/197   Probabilistic grammars, e.g...

Method and apparatus for adaptively generating field of application dependent language models for use in intelligent systems

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

12 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for adaptively generating field of application dependent language models for use in intelligent systems

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

12 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links