Linguistic disambiguation system and method using string-based pattern training to learn to resolve ambiguity sites

US 6,684,201 B1
Filed: 03/31/2000
Issued: 01/27/2004
Est. Priority Date: 03/31/2000
Status: Expired due to Term

First Claim

Patent Images

1. A computer-implemented method comprising:

defining a set of reduced regular expressions for particular patterns in strings, wherein the set of reduced regular expressions are defined over a finite alphabet Σ

as follows;

(1) ∀

aε

Σ

;

“

a”

is a reduced regular expression and denotes a set {a};

“

a+”

is a reduced regular expression and denotes a positive closure of the set {a};

“

a*”

is a reduced regular expression and denotes a Kleene closure of the set {a};

“

˜

a”

is a reduced regular expression and denotes a set Σ

-a;

“

˜

a+”

is a reduced regular expression and denotes the positive closure of the set Σ

-a;

“

˜

a*”

is a reduced regular expression and denotes the Kleene closure of the set Σ

-a;

(2) “

.”

is a reduced regular expression denoting a set Σ

;

(3) “

.+”

is a reduced regular expression denoting the positive closure of the set Σ

;

(4) “

.*”

is a reduced regular expression denoting the Kleene closure of the set Σ

; and

(5) if r and s are reduced regular expressions denoting languages R and S, respectively, then “

rs”

is a reduced regular expression denoting a set RS; and

learning, from a training set, a knowledge base that uses the reduced regular expressions to resolve ambiguity based upon the strings in which the ambiguity occurs.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A linguistic disambiguation system and method creates a knowledge base by training on patterns in strings that contain ambiguity sites. The string patterns are described by a set of reduced regular expressions (RREs) or very reduced regular expressions (VRREs). The knowledge base utilizes the RREs or VRREs to resolve ambiguity based upon the strings in which the ambiguity occurs. The system is trained on a training set, such as a properly labeled corpus. Once trained, the system may then apply the knowledge base to raw input strings that contain ambiguity sites. The system uses the RRE- and VRRE-based knowledge base to disambiguate the sites.

96 Citations

View as Search Results

23 Claims

1. A computer-implemented method comprising:
- defining a set of reduced regular expressions for particular patterns in strings, wherein the set of reduced regular expressions are defined over a finite alphabet Σ
  
  as follows;
  
  (1) ∀
  
  aε
  
  Σ
  
  ;
  
  “
  
  a”
  
  is a reduced regular expression and denotes a set {a};
  
  “
  
  a+”
  
  is a reduced regular expression and denotes a positive closure of the set {a};
  
  “
  
  a*”
  
  is a reduced regular expression and denotes a Kleene closure of the set {a};
  
  “
  
  ˜
  
  a”
  
  is a reduced regular expression and denotes a set Σ
  
  -a;
  
  “
  
  ˜
  
  a+”
  
  is a reduced regular expression and denotes the positive closure of the set Σ
  
  -a;
  
  “
  
  ˜
  
  a*”
  
  is a reduced regular expression and denotes the Kleene closure of the set Σ
  
  -a;
  
  (2) “
  
  .”
  
  is a reduced regular expression denoting a set Σ
  
  ;
  
  (3) “
  
  .+”
  
  is a reduced regular expression denoting the positive closure of the set Σ
  
  ;
  
  (4) “
  
  .*”
  
  is a reduced regular expression denoting the Kleene closure of the set Σ
  
  ; and
  
  (5) if r and s are reduced regular expressions denoting languages R and S, respectively, then “
  
  rs”
  
  is a reduced regular expression denoting a set RS; and
  
  learning, from a training set, a knowledge base that uses the reduced regular expressions to resolve ambiguity based upon the strings in which the ambiguity occurs.

2. A computer-implemented method comprising:
- defining a set of reduced regular expressions for particular patterns in strings, wherein the defining includes defining a set of very reduced regular expressions that are a proper subset of the reduced regular expressions; and
  
  learning, from a training set a knowledge base that uses the very reduced regular expressions to resolve ambiguity based upon the strings in which the ambiguity occurs.
- View Dependent Claims (3)
- - 3. A computer-implemented method as recited in claim 2, wherein the set of very reduced regular expressions are defined over a finite alphabet Σ
    - as follows;

4. A method implemented by an apparatus, the method comprising:
- defining a set of reduced regular expressions for particular patterns in strings, such that given a finite alphabet Σ
  
  , the set of reduced regular expressions over the alphabet is defined as;
  
  (1) ∀
  
  aε
  
  Σ
  
  ;
  
  “
  
  a”
  
  is a reduced regular expression and denotes a set {a};
  
  “
  
  a+”
  
  is a reduced regular expression and denotes a positive closure of the set {a};
  
  “
  
  a*”
  
  is a reduced regular expression and denotes a Kleene closure of the set {a};
  
  “
  
  ˜
  
  a”
  
  is a reduced regular expression and denotes a set Σ
  
  -a;
  
  “
  
  ˜
  
  a+”
  
  is a reduced regular expression and denotes the positive closure of the set Σ
  
  -a;
  
  “
  
  ˜
  
  a*”
  
  is a reduced regular expression and denotes the Kleene closure of the set Σ
  
  -a;
  
  (2) “
  
  .”
  
  is a reduced regular expression denoting a set Σ
  
  ;
  
  (3) “
  
  .+”
  
  is a reduced regular expression denoting the positive closure of the set Σ
  
  ;
  
  (4) “
  
  .*”
  
  is a reduced regular expression denoting the Kleene closure of the set Σ
  
  ; and
  
  (5) if r and s are reduced regular expressions denoting languages R and S, respectively, then “
  
  rs”
  
  is a reduced regular expression denoting a set RS; and
  
  learning, from a training set, rules that use the reduce regular expressions to resolve ambiguity based upon the strings in which the ambiguity occurs.
- View Dependent Claims (5, 6, 7, 8, 9)
- - 5. A method implemented by an apparatus as recited in claim 4, wherein the defining comprises defining the reduced regular expressions over a finite alphabet Σ
    - that is a union of two sets Σ
      
      ′ and
      
      Σ
      
      ″
      
      , wherein the set Σ
      
      ′
      
      is a set of symbols in the alphabet and the set Σ
      
      ″
      
      is a set of sets of symbols from the set Σ
      
      ′
      
      .
  - 6. A method implemented by an apparatus as recited in claim 4, wherein the learning comprises transformation sequence learning to create a set of rules that use the reduced regular expressions to resolve ambiguity based upon the strings in which the ambiguity occurs.
  - 7. A method implemented by an apparatus as recited in claim 6, wherein the transformation sequence learning comprises:
8. A method implemented by an apparatus as recited in claim 6, wherein the transformation sequence learning comprises:
- constructing a graph having a root node that contains a primary position set of the training set and multiple paths from the root node to secondary nodes that represents a reduced regular expression, the secondary node containing a secondary position set to which the reduced regular expression maps; and
  
  scoring the secondary nodes to identify a particular secondary node; and
  
  identifying the reduced regular expression that maps the path from the root node to the particular secondary node.
9. A method implemented by an apparatus as recited in claim 4, wherein the training set comprises a labeled corpus.

10. A computer readable medium having computer-executable instructions that, when executed on a processor, perform a method comprising:
- defining a set of reduced regular expressions for particular patterns in strings, such that given a finite alphabet Σ
  
  , the set of reduced regular expressions over the alphabet is defined as;
  
  (1) ∀
  
  aε
  
  Σ
  
  ;
  
  “
  
  a”
  
  is a reduced regular expression and denotes a set {a};
  
  “
  
  a+”
  
  is a reduced regular expression and denotes a positive closure of the set {a};
  
  “
  
  a*”
  
  is a reduced regular expression and denotes a Kleene closure of the set {a};
  
  “
  
  ˜
  
  a”
  
  is a reduced regular expression and denotes a set Σ
  
  -a;
  
  “
  
  ˜
  
  a+”
  
  is a reduced regular expression and denotes the positive closure of the set Σ
  
  -a;
  
  “
  
  ˜
  
  a*”
  
  is a reduced regular expression and denotes the Kleene closure of the set Σ
  
  -a;
  
  (2) “
  
  .”
  
  is a reduced regular expression denoting a set Σ
  
  ;
  
  (3) “
  
  .+”
  
  is a reduced regular expression denoting the positive closure of the set Σ
  
  ;
  
  (4) “
  
  .*”
  
  is a reduced regular expression denoting the Kleene closure of the set Σ
  
  ; and
  
  (5) if r and s are reduced regular expressions denoting languages R and S, respectively, then “
  
  rs”
  
  is a reduced regular expression denoting a set RS; and
  
  learning, from a training set, rules that use the reduce regular expressions to resolve ambiguity based upon the strings in which the ambiguity occurs.

11. A method implemented by an apparatus, the method comprising:
- defining a set of very reduced regular expressions for particular patterns in strings, such that given an alphabet Σ
  
  , the set of very reduced regular expressions over the alphabet is defined as;
  
  (1) ∀
  
  aε
  
  Σ
  
  ;
  
  “
  
  a”
  
  is a very reduced regular expression and denotes a set {a};
  
  (2) “
  
  .”
  
  is a very reduced regular expression denoting a set Σ
  
  ;
  
  (3) “
  
  .*”
  
  is a very reduced regular expression denoting a Kleene closure of the set Σ
  
  ;
  
  (4) if r and s are very reduced regular expressions denoting languages R and S, respectively, then “
  
  rs”
  
  is a very reduced regular expression denoting a set RS; and
  
  learning, from a training set, rules that use the very reduce regular expressions to resolve ambiguity based up on the strings in which the ambiguity occurs.
- View Dependent Claims (12, 13, 14, 15)
- - 12. A method implemented by an apparatus as recited in claim 11, wherein the learning comprises transformation sequence learning to create a set of rules that use the very reduce regular expressions to resolve ambiguity based upon the strings in which the ambiguity occurs.
  - 13. A method implemented by an apparatus as recited in claim 12, wherein the transformation sequence learning comprises:
14. A method implemented by an apparatus as recited in claim 12, wherein the transformation sequence learning comprises:
- constructing a graph having a root node that contains a primary position set of the training set and multiple paths from the root node to secondary nodes that represents a very reduced regular expression, the secondary node containing a secondary position set to which the reduced regular expression maps; and
  
  scoring the secondary nodes to identify a particular secondary node; and
  
  identifying the very reduced regular expression that maps the path from the root node to the particular secondary node.
15. A method implemented by an apparatus as recited in claim 11, wherein the training set comprises a labeled corpus.

16. A computer readable medium having computer-executable instructions that, when executed on a processor, perform a method comprising:
- defining a set of very reduced regular expressions for particular patterns in strings, such that given an alphabet Σ
  
  , the set of very reduced regular expressions over the alphabet is defined as;
  
  (1) ∀
  
  aε
  
  Σ
  
  ;
  
  “
  
  a”
  
  is a very reduced regular expression and denotes a set {a};
  
  (2) “
  
  .”
  
  is a very reduced regular expression denoting a set Σ
  
  ;
  
  (3) “
  
  .*”
  
  is a very reduced regular expression denoting a Kleene closure of the set Σ
  
  ;
  
  (4) if r and s are very reduced regular expressions denoting languages R and S, respectively, then “
  
  rs”
  
  is a very reduced regular expression denoting a set RS; and
  
  learning, from a training set, rules that use the very reduce regular expressions to resolve ambiguity based up on the strings in which the ambiguity occurs.

17. A computer-implemented method comprising:
- receiving a string with an ambiguity site;
  
  applying reduced regular expressions to describe a pattern in the string, wherein the reduced regular expressions are defined over a finite alphabet Σ
  
  as follows;
  
  (1) ∀
  
  aε
  
  Σ
  
  ;
  
  “
  
  a”
  
  is a reduced regular expression and denotes a set {a};
  
  “
  
  a+”
  
  is a reduced regular expression and denotes a positive closure of the set {a};
  
  “
  
  a*”
  
  is a reduced regular expression and denotes a Kleene closure of the set {a};
  
  “
  
  ˜
  
  a”
  
  is a reduced regular expression and denotes a set Σ
  
  -a;
  
  “
  
  ˜
  
  a+”
  
  is a reduced regular expression and denotes the positive closure of the set Σ
  
  -a;
  
  “
  
  ˜
  
  a*”
  
  is a reduced regular expression and denotes the Kleene closure of the set Σ
  
  -a;
  
  (2) “
  
  .”
  
  is a reduced regular expression denoting a set Σ
  
  ;
  
  (3) “
  
  .+”
  
  is a reduced regular expression denoting the positive closure of the set Σ
  
  ;
  
  (4) “
  
  .*”
  
  is a reduced regular expression denoting the Kleene closure of the set Σ
  
  ; and
  
  (5) if r and s are reduced regular expressions denoting languages R and S, respectively, then “
  
  rs”
  
  is a reduced regular expression denoting a set RS; and
  
  selecting one of the reduced regular expressions to resolve the ambiguity site.

18. A computer-implemented method comprising:
- receiving a string with an ambiguity site;
  
  applying reduced regular expressions to describe a pattern in the string; and
  
  selecting one of the reduced regular expressions to resolve the ambiguity site, wherein the applying comprises applying a set of very reduced regular expressions that are a proper subset of the reduced regular expressions, and wherein the set of very reduced regular expressions are defined over a finite alphabet Σ
  
  as follows;
  
  (1) ∀
  
  aε
  
  Σ
  
  ;
  
  “
  
  a”
  
  is a very reduced regular expression and denotes a set {a};
  
  (2) “
  
  .”
  
  is a very reduced regular expression denoting a set Σ
  
  ;
  
  (3) “
  
  .*”
  
  is a very reduced regular expression denoting a Kleene closure of the set Σ
  
  ; and
  
  (4) if r and s are very reduced regular expressions denoting languages R and S, respectively, then “
  
  rs”
  
  is a very reduced regular expression denoting a set RS.

19. A computer readable medium having computer-executable instructions that, when executed on a processor, perform a method comprising:
- receiving a string with an ambiguity site;
  
  applying reduced regular expressions to describe a pattern in the string, wherein the reduced regular expression is selected from a set of reduced regular expressions defined over a finite alphabet Σ
  
  as follows;
  
  (1) ∀
  
  aε
  
  Σ
  
  ;
  
  “
  
  a”
  
  is a reduced regular expression and denotes a set {a};
  
  “
  
  a+”
  
  is a reduced regular expression and denotes a positive closure of the set {a};
  
  “
  
  a*”
  
  is a reduced regular expression and denotes a Kleene closure of the set {a};
  
  “
  
  ˜
  
  a”
  
  is a reduced regular expression and denotes a set Σ
  
  -a;
  
  “
  
  ˜
  
  a+”
  
  is a reduced regular expression and denotes the positive closure of the set Σ
  
  -a;
  
  “
  
  ˜
  
  a*”
  
  is a reduced regular expression and denotes the Kleene closure of the set Σ
  
  -a;
  
  (2) “
  
  .”
  
  is a reduced regular expression denoting a set Σ
  
  ;
  
  (3) “
  
  .+”
  
  is a reduced regular expression denoting the positive closure of the set Σ
  
  ;
  
  (4) “
  
  .*”
  
  is a reduced regular expression denoting the Kleene closure of the set Σ
  
  ; and
  
  (5) if r and s are reduced regular expressions denoting languages R and S, respectively, then “
  
  rs”
  
  is a reduced regular expression denoting a set RS; and
  
  selecting one of the reduced regular expressions to resolve the ambiguity site.

20. A computer readable medium having computer-executable instructions that, when executed on a processor, perform a method comprising:
- receiving a string with an ambiguity site;
  
  applying reduced regular expressions to describe a pattern in the string; and
  
  selecting one of the reduced regular expressions to resolve the ambiguity site, wherein the reduced regular expression is selected from a set of very reduced regular expressions defined over a finite alphabet Σ
  
  as follows;
  
  (1) ∀
  
  aε
  
  Σ
  
  ;
  
  “
  
  a”
  
  is a very reduced regular expression and denotes a set {a};
  
  (2) “
  
  .”
  
  is a very reduced regular expression denoting a set Σ
  
  ;
  
  (3) “
  
  .*”
  
  is a very reduced regular expression denoting a Kleene closure of the set Σ
  
  ; and
  
  (4) if r and s are very reduced regular expressions denoting languages R and S, respectively, then “
  
  rs”
  
  is a very reduced regular expression denoting a set RS.

21. A training system comprising:
- a memory to store a training set;
  
  a processing unit; and
  
  a disambiguation trainer, executable on the processing unit, to define a set of reduced regular expressions for particular patterns in strings of the training set and learn a knowledge base that uses the reduced regular expressions to describe the strings, to resolve an ambiguity site wherein the reduced regular expressions are defined over a finite alphabet Σ
  
  as follows;
  
  (1) ∀
  
  aε
  
  Σ
  
  ;
  
  “
  
  a”
  
  is a reduced regular expression and denotes a set {a};
  
  “
  
  a+”
  
  is a reduced regular expression and denotes a positive closure of the set {a};
  
  “
  
  a*”
  
  is a reduced regular expression and denotes a Kleene closure of the set {a};
  
  “
  
  ˜
  
  a”
  
  is a reduced regular expression and denotes a set Σ
  
  -a;
  
  “
  
  ˜
  
  a+”
  
  is a reduced regular expression and denotes the positive closure of the set Σ
  
  -a;
  
  “
  
  ˜
  
  a*”
  
  is a reduced regular expression and denotes the Kleene closure of the set Σ
  
  -a;
  
  (2) “
  
  .”
  
  is a reduced regular expression denoting a set Σ
  
  ;
  
  (3) “
  
  .+”
  
  is a reduced regular expression denoting the positive closure of the set Σ
  
  ;
  
  (4) “
  
  .*”
  
  is a reduced regular expression denoting the Kleene closure of the set Σ
  
  ; and
  
  (5) if r and s are reduced regular expressions denoting languages R and S, respectively, then “
  
  rs”
  
  is a reduced regular expression denoting a set RS.

22. A training system comprising:
- a memory to store a training set;
  
  a processing unit; and
  
  a disambiguation trainer, executable on the processing unit, to define a set of reduced regular expressions for particular patterns in strings of the training set to and learn a knowledge base that uses the reduced regular expressions to describe the strings, wherein the reduced regular expressions comprise very reduced regular expressions, to resolve an ambiguity site which are defined over a finite alphabet Σ
  
  as follows;
  
  (1) ∀
  
  aε
  
  Σ
  
  ;
  
  “
  
  a”
  
  is a very reduced regular expression and denotes a set {a};
  
  (2) “
  
  .”
  
  is a very reduced regular expression denoting a set Σ
  
  ;
  
  (3) “
  
  .*”
  
  is a very reduced regular expression denoting a Kleene closure of the set Σ
  
  ; and
  
  (4) if r and s are very reduced regular expressions denoting languages R and S, respectively, then “
  
  rs”
  
  is a very reduced regular expression denoting a set RS.

23. A training system comprising:
- a memory to store a training set;
  
  a processing unit; and
  
  a disambiguation trainer executable on the processing unit, to define a set of reduced regular expressions for particular patterns in strings to resolve an ambiguity site of the training set and learn a knowledge base that uses the reduced regular expressions to describe the strings, wherein;
  
  the disambiguator trainer is configured to construct a graph having a root node that contains a primary position set of the training set, paths from the root node that introduce transformations from the root node based on elements in the training set or operations on the elements, and secondary nodes that contain secondary position sets that result from the transformations along associated paths; and
  
  the disambiguator trainer is firer configured to identify a reduced regular expression that is represented by a node in the graph.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Brill, Eric D.
Primary Examiner(s)
Khatri, Anil
Assistant Examiner(s)
Hirl, Joseph P.

Application Number

US09/539,356
Time in Patent Office

1,397 Days
Field of Search

706/45, 706/46, 706/47, 704/3
US Class Current

706/45
CPC Class Codes

G06F 40/20   Natural language analysis s...

G06F 40/30   Semantic analysis

G06N 20/00   Machine learning

Linguistic disambiguation system and method using string-based pattern training to learn to resolve ambiguity sites

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

96 Citations

23 Claims

Specification

Use Cases

Quick Links

Others

Linguistic disambiguation system and method using string-based pattern training to learn to resolve ambiguity sites

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

96 Citations

23 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others