Method and apparatus for generating and managing a language model data structure

US 7,020,587 B1
Filed: 06/30/2000
Issued: 03/28/2006
Est. Priority Date: 06/30/2000
Status: Expired due to Fees

First Claim

Patent Images

1. A method comprising:

assigning each of a plurality of segments comprising a received corpus to a node in a data structure denoting dependencies between nodes;

calculating a transitional probability between each of the nodes in the data structure; and

managing storage of the data structure across a system memory of a computer system and an extended memory of the computer system such that at least one said node is stored in the system memory and another said node is stored in the extended memory simultaneously.

View all claims

3 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The generation and management of a language model data structure include assigning each segment of a received corpus to a node in a data structure that denotes dependencies between the respective nodes. A transitional probability between each of the nodes in the data structure is calculated. A frequency of occurrence is calculated for each item of the respective segments, and those nodes of the data structure associated with items that do not meet a minimum frequency of occurrence threshold are removed. The data structure may be managed across a system memory of a computer system and an extended memory of the computer system.

34 Citations

View as Search Results

26 Claims

1. A method comprising:
- assigning each of a plurality of segments comprising a received corpus to a node in a data structure denoting dependencies between nodes;
  
  calculating a transitional probability between each of the nodes in the data structure; and
  
  managing storage of the data structure across a system memory of a computer system and an extended memory of the computer system such that at least one said node is stored in the system memory and another said node is stored in the extended memory simultaneously.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. A method according to claim 1, further comprising:
    - calculating a frequency of occurrence for each elemental item of the segment; and
      
      removing nodes of the data structure associated with items which do not meet a minimum threshold for the frequency of occurrence.
  - 3. A method according to claim 2, wherein the frequency of the item is calculated by counting item occurrences throughout the subset and/or corpus.
  - 4. A method according to claim 2, wherein the minimum threshold is three (3).
  - 5. A method according to claim 1, wherein the step of managing storage of the data structure comprises:
    - identifying least recently used nodes of the data structure; and
      
      storing the least recently used nodes of the data structure in the extended memory of the computer system when the data structure is too large to store completely within the system memory.
  - 6. A method according to claim 5, wherein the extended memory of the computer system comprises one or more files on an accessible mass storage device.
  - 7. A method according to claim 6, wherein the data structure represents a language model, spread across one or more elements of a computing system memory subsystem.
  - 8. A method according to claim 1, wherein calculating a transition probability includes calculating a Markov transitional probability between nodes.
  - 9. A storage medium comprising a plurality of executable instructions including at least a subset of which that, when executed by a processor, implement a method according to claim 1.

10. A method for predicting a likelihood of an item in a corpus comprised of a plurality of items, the method comprising:
- building a data structure, across a system memory of a computer system and an extended memory of the computer system, of corpus segments representing a dynamic context of item dependencies within the segments;
  
  calculating the likelihood of each item based, at least in part, on a likelihood of preceding items within the dynamic context;
  
  iteratively re-segmenting the corpus; and
  
  predicting a likelihood of an item in the re-segmented corpus.
- View Dependent Claims (11, 12, 13, 14, 15)
- - 11. A method according to claim 10, wherein the method of building a dynamic context of preceding dependent items comprises:
    - analyzing the data structure representing the language model;
      
      identifying all items with dependencies to or from the item; and
      
      using all items with dependencies to or from the item as the dynamic context.
  - 12. A method according to claim 10, wherein the language model includes frequency information for each item within the model.
  - 13. A method according to claim 12, wherein calculating the likelihood of the item comprises:
    - calculating a Markov transition probability for the item based, at least in part, on the frequency of the items comprising the dynamic context.
  - 14. A method according to claim 10, wherein calculating the likelihood of the item comprises:
    - calculating a Markov transition probability for the item given the dynamic context of items.
  - 15. A storage medium having stored thereon a plurality of executable instructions including instructions which, when executed by a host computer, implement a method according to claim 10.

16. A storage medium comprising executable instructions that are configured to generate, from a corpus, a data structure representing a statistical language model, the data structure for storage across a system memory and an extended memory, the data structure including:
- one or more root nodes; and
  
  a plurality of subordinate nodes, ultimately linked to a root node, cumulatively comprising one or more sub-trees, wherein each node of a sub-tree represents, one or more items of a corpus and includes a measure of a Markov transition probability between the node and another linked node.
- View Dependent Claims (17, 18, 19, 20, 21)
- - 17. A storage medium according to claim 16, wherein the root node represents a common root item for all subordinate nodes in the one or more sub-trees.
  - 18. A storage medium according to claim 16, wherein the Markov transition probability is a measure of the likelihood of a transition from one node to another node based, at least in part, on the one or more items represented by each of the nodes.
  - 19. A storage medium according to claim 16, wherein the items include one or more of a character, a letter, a number, and combinations thereof.
  - 20. A storage medium according to claim 16, wherein the data structure represents a dynamic order Markov model (DOMM) language model of the textual source.
  - 21. A computer system having the storage medium and a processor configured to interpret the computer executable instructions according to claim 16.

22. A modeling agent comprising:
- a controller, to receive a corpus; and
  
  a data structure generator, responsive to and selectively invoked by the controller, to assign each of a plurality of segments comprising the received corpus to a node in a data structure denoting dependencies between nodes;
  
  wherein the modeling agent calculates a transitional probability between each of the nodes of the data structure to determine a predictive capability of a language model represented by the data structure and iteratively re-segments the received corpus until a threshold predictive capability is reached.
- View Dependent Claims (23, 24, 25)
- - 23. A modeling agent according to claim 22, the data structure generator comprising:
    - a dynamic segmentation function, to iteratively re-segment the received Corpus.
  - 24. A modeling agent according to claim 22, the data structure generator comprising:
    - a frequency analysis function, to analyze a frequency of occurrence of segments within the corpus.
  - 25. A modeling agent according to claim 24, wherein segments that do not meet a frequency of occurrence threshold are removed from the data structure, thus reducing data structure size.

26. A storage medium comprising a plurality of executable instructions including at least a subset of which, when executed, implement a language modeling agent to assign each of a plurality of segments of a received corpus to a node in a data structure denoting dependencies between nodes, and to calculate a transitional probability between each of the nodes in the data structure to determine a predictive capability of a language model denoted by the data structure, wherein the modeling agent dynamically re-segments the received corpus to remove segments which do not meet a minimum frequency threshold.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Chen, Zheng, Chien, Lee-Feng, Gao, Jianfeng, Di, Shuo, Lee, Kai-Fu
Primary Examiner(s)
Thomson, William
Assistant Examiner(s)
STEVENS, THOMAS H

Application Number

US09/608,526
Time in Patent Office

2,097 Days
Field of Search

703/2, 704/10, 704/235, 704/245, 704/251, 704/231, 704/256
US Class Current

703/2
CPC Class Codes

G06F 40/20 Natural language analysis s...

G10L 15/285 Memory allocation or algori...

Method and apparatus for generating and managing a language model data structure

First Claim

3 Assignments

0 Petitions

Accused Products

Abstract

34 Citations

26 Claims

Specification

Solutions

Use Cases

Quick Links

Method and apparatus for generating and managing a language model data structure

First Claim

3 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

34 Citations

26 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links