Linguistic disambiguation system and method using string-based pattern training to learn to resolve ambiguity sites
First Claim
Patent Images
1. A computer-implemented method comprising:
- defining a set of reduced regular expressions for particular patterns in strings, wherein the set of reduced regular expressions are defined over a finite alphabet Σ
as follows;
(1) ∀
aε
Σ
;
“
a”
is a reduced regular expression and denotes a set {a};
“
a+”
is a reduced regular expression and denotes a positive closure of the set {a};
“
a*”
is a reduced regular expression and denotes a Kleene closure of the set {a};
“
˜
a”
is a reduced regular expression and denotes a set Σ
-a;
“
˜
a+”
is a reduced regular expression and denotes the positive closure of the set Σ
-a;
“
˜
a*”
is a reduced regular expression and denotes the Kleene closure of the set Σ
-a;
(2) “
.”
is a reduced regular expression denoting a set Σ
;
(3) “
.+”
is a reduced regular expression denoting the positive closure of the set Σ
;
(4) “
.*”
is a reduced regular expression denoting the Kleene closure of the set Σ
; and
(5) if r and s are reduced regular expressions denoting languages R and S, respectively, then “
rs”
is a reduced regular expression denoting a set RS; and
learning, from a training set, a knowledge base that uses the reduced regular expressions to resolve ambiguity based upon the strings in which the ambiguity occurs.
2 Assignments
0 Petitions
Accused Products
Abstract
A linguistic disambiguation system and method creates a knowledge base by training on patterns in strings that contain ambiguity sites. The string patterns are described by a set of reduced regular expressions (RREs) or very reduced regular expressions (VRREs). The knowledge base utilizes the RREs or VRREs to resolve ambiguity based upon the strings in which the ambiguity occurs. The system is trained on a training set, such as a properly labeled corpus. Once trained, the system may then apply the knowledge base to raw input strings that contain ambiguity sites. The system uses the RRE- and VRRE-based knowledge base to disambiguate the sites.
96 Citations
23 Claims
-
1. A computer-implemented method comprising:
-
defining a set of reduced regular expressions for particular patterns in strings, wherein the set of reduced regular expressions are defined over a finite alphabet Σ
as follows;
(1) ∀
aε
Σ
;
“
a”
is a reduced regular expression and denotes a set {a};
“
a+”
is a reduced regular expression and denotes a positive closure of the set {a};
“
a*”
is a reduced regular expression and denotes a Kleene closure of the set {a};
“
˜
a”
is a reduced regular expression and denotes a set Σ
-a;
“
˜
a+”
is a reduced regular expression and denotes the positive closure of the set Σ
-a;
“
˜
a*”
is a reduced regular expression and denotes the Kleene closure of the set Σ
-a;
(2) “
.”
is a reduced regular expression denoting a set Σ
;
(3) “
.+”
is a reduced regular expression denoting the positive closure of the set Σ
;
(4) “
.*”
is a reduced regular expression denoting the Kleene closure of the set Σ
; and
(5) if r and s are reduced regular expressions denoting languages R and S, respectively, then “
rs”
is a reduced regular expression denoting a set RS; and
learning, from a training set, a knowledge base that uses the reduced regular expressions to resolve ambiguity based upon the strings in which the ambiguity occurs.
-
-
2. A computer-implemented method comprising:
-
defining a set of reduced regular expressions for particular patterns in strings, wherein the defining includes defining a set of very reduced regular expressions that are a proper subset of the reduced regular expressions; and
learning, from a training set a knowledge base that uses the very reduced regular expressions to resolve ambiguity based upon the strings in which the ambiguity occurs. - View Dependent Claims (3)
(1) ∀
aε
Σ
;
“
a”
is a very reduced regular expression and denotes a set {a};
(2) “
.”
is a very reduced regular expression denoting a set Σ
;
(3) “
.*”
is a very reduced regular expression denoting a Kleene closure of the set Σ
; and
(4) if r and s are very reduced regular expressions denoting languages R and S, respectively, then “
rs”
is a very reduced regular expression denoting a set RS.
-
-
4. A method implemented by an apparatus, the method comprising:
-
defining a set of reduced regular expressions for particular patterns in strings, such that given a finite alphabet Σ
, the set of reduced regular expressions over the alphabet is defined as;
(1) ∀
aε
Σ
;
“
a”
is a reduced regular expression and denotes a set {a};
“
a+”
is a reduced regular expression and denotes a positive closure of the set {a};
“
a*”
is a reduced regular expression and denotes a Kleene closure of the set {a};
“
˜
a”
is a reduced regular expression and denotes a set Σ
-a;
“
˜
a+”
is a reduced regular expression and denotes the positive closure of the set Σ
-a;
“
˜
a*”
is a reduced regular expression and denotes the Kleene closure of the set Σ
-a;
(2) “
.”
is a reduced regular expression denoting a set Σ
;
(3) “
.+”
is a reduced regular expression denoting the positive closure of the set Σ
;
(4) “
.*”
is a reduced regular expression denoting the Kleene closure of the set Σ
; and
(5) if r and s are reduced regular expressions denoting languages R and S, respectively, then “
rs”
is a reduced regular expression denoting a set RS; and
learning, from a training set, rules that use the reduce regular expressions to resolve ambiguity based upon the strings in which the ambiguity occurs. - View Dependent Claims (5, 6, 7, 8, 9)
annotating a string with an initial start-state;
identifying a rule from the set of rules that tests whether the start-state should be changed; and
applying the identified rule to the string.
-
-
8. A method implemented by an apparatus as recited in claim 6, wherein the transformation sequence learning comprises:
-
constructing a graph having a root node that contains a primary position set of the training set and multiple paths from the root node to secondary nodes that represents a reduced regular expression, the secondary node containing a secondary position set to which the reduced regular expression maps; and
scoring the secondary nodes to identify a particular secondary node; and
identifying the reduced regular expression that maps the path from the root node to the particular secondary node.
-
-
9. A method implemented by an apparatus as recited in claim 4, wherein the training set comprises a labeled corpus.
-
10. A computer readable medium having computer-executable instructions that, when executed on a processor, perform a method comprising:
-
defining a set of reduced regular expressions for particular patterns in strings, such that given a finite alphabet Σ
, the set of reduced regular expressions over the alphabet is defined as;
(1) ∀
aε
Σ
;
“
a”
is a reduced regular expression and denotes a set {a};
“
a+”
is a reduced regular expression and denotes a positive closure of the set {a};
“
a*”
is a reduced regular expression and denotes a Kleene closure of the set {a};
“
˜
a”
is a reduced regular expression and denotes a set Σ
-a;
“
˜
a+”
is a reduced regular expression and denotes the positive closure of the set Σ
-a;
“
˜
a*”
is a reduced regular expression and denotes the Kleene closure of the set Σ
-a;
(2) “
.”
is a reduced regular expression denoting a set Σ
;
(3) “
.+”
is a reduced regular expression denoting the positive closure of the set Σ
;
(4) “
.*”
is a reduced regular expression denoting the Kleene closure of the set Σ
; and
(5) if r and s are reduced regular expressions denoting languages R and S, respectively, then “
rs”
is a reduced regular expression denoting a set RS; and
learning, from a training set, rules that use the reduce regular expressions to resolve ambiguity based upon the strings in which the ambiguity occurs.
-
-
11. A method implemented by an apparatus, the method comprising:
-
defining a set of very reduced regular expressions for particular patterns in strings, such that given an alphabet Σ
, the set of very reduced regular expressions over the alphabet is defined as;
(1) ∀
aε
Σ
;
“
a”
is a very reduced regular expression and denotes a set {a};
(2) “
.”
is a very reduced regular expression denoting a set Σ
;
(3) “
.*”
is a very reduced regular expression denoting a Kleene closure of the set Σ
;
(4) if r and s are very reduced regular expressions denoting languages R and S, respectively, then “
rs”
is a very reduced regular expression denoting a set RS; and
learning, from a training set, rules that use the very reduce regular expressions to resolve ambiguity based up on the strings in which the ambiguity occurs. - View Dependent Claims (12, 13, 14, 15)
annotating a string with an initial start-state;
identifying a rule from the set of rules that tests whether the start-state should be changed; and
applying the identified rule to the string.
-
-
14. A method implemented by an apparatus as recited in claim 12, wherein the transformation sequence learning comprises:
-
constructing a graph having a root node that contains a primary position set of the training set and multiple paths from the root node to secondary nodes that represents a very reduced regular expression, the secondary node containing a secondary position set to which the reduced regular expression maps; and
scoring the secondary nodes to identify a particular secondary node; and
identifying the very reduced regular expression that maps the path from the root node to the particular secondary node.
-
-
15. A method implemented by an apparatus as recited in claim 11, wherein the training set comprises a labeled corpus.
-
16. A computer readable medium having computer-executable instructions that, when executed on a processor, perform a method comprising:
-
defining a set of very reduced regular expressions for particular patterns in strings, such that given an alphabet Σ
, the set of very reduced regular expressions over the alphabet is defined as;
(1) ∀
aε
Σ
;
“
a”
is a very reduced regular expression and denotes a set {a};
(2) “
.”
is a very reduced regular expression denoting a set Σ
;
(3) “
.*”
is a very reduced regular expression denoting a Kleene closure of the set Σ
;
(4) if r and s are very reduced regular expressions denoting languages R and S, respectively, then “
rs”
is a very reduced regular expression denoting a set RS; and
learning, from a training set, rules that use the very reduce regular expressions to resolve ambiguity based up on the strings in which the ambiguity occurs.
-
-
17. A computer-implemented method comprising:
-
receiving a string with an ambiguity site;
applying reduced regular expressions to describe a pattern in the string, wherein the reduced regular expressions are defined over a finite alphabet Σ
as follows;
(1) ∀
aε
Σ
;
“
a”
is a reduced regular expression and denotes a set {a};
“
a+”
is a reduced regular expression and denotes a positive closure of the set {a};
“
a*”
is a reduced regular expression and denotes a Kleene closure of the set {a};
“
˜
a”
is a reduced regular expression and denotes a set Σ
-a;
“
˜
a+”
is a reduced regular expression and denotes the positive closure of the set Σ
-a;
“
˜
a*”
is a reduced regular expression and denotes the Kleene closure of the set Σ
-a;
(2) “
.”
is a reduced regular expression denoting a set Σ
;
(3) “
.+”
is a reduced regular expression denoting the positive closure of the set Σ
;
(4) “
.*”
is a reduced regular expression denoting the Kleene closure of the set Σ
; and
(5) if r and s are reduced regular expressions denoting languages R and S, respectively, then “
rs”
is a reduced regular expression denoting a set RS; and
selecting one of the reduced regular expressions to resolve the ambiguity site.
-
-
18. A computer-implemented method comprising:
-
receiving a string with an ambiguity site;
applying reduced regular expressions to describe a pattern in the string; and
selecting one of the reduced regular expressions to resolve the ambiguity site, wherein the applying comprises applying a set of very reduced regular expressions that are a proper subset of the reduced regular expressions, and wherein the set of very reduced regular expressions are defined over a finite alphabet Σ
as follows;
(1) ∀
aε
Σ
;
“
a”
is a very reduced regular expression and denotes a set {a};
(2) “
.”
is a very reduced regular expression denoting a set Σ
;
(3) “
.*”
is a very reduced regular expression denoting a Kleene closure of the set Σ
; and
(4) if r and s are very reduced regular expressions denoting languages R and S, respectively, then “
rs”
is a very reduced regular expression denoting a set RS.
-
-
19. A computer readable medium having computer-executable instructions that, when executed on a processor, perform a method comprising:
-
receiving a string with an ambiguity site;
applying reduced regular expressions to describe a pattern in the string, wherein the reduced regular expression is selected from a set of reduced regular expressions defined over a finite alphabet Σ
as follows;
(1) ∀
aε
Σ
;
“
a”
is a reduced regular expression and denotes a set {a};
“
a+”
is a reduced regular expression and denotes a positive closure of the set {a};
“
a*”
is a reduced regular expression and denotes a Kleene closure of the set {a};
“
˜
a”
is a reduced regular expression and denotes a set Σ
-a;
“
˜
a+”
is a reduced regular expression and denotes the positive closure of the set Σ
-a;
“
˜
a*”
is a reduced regular expression and denotes the Kleene closure of the set Σ
-a;
(2) “
.”
is a reduced regular expression denoting a set Σ
;
(3) “
.+”
is a reduced regular expression denoting the positive closure of the set Σ
;
(4) “
.*”
is a reduced regular expression denoting the Kleene closure of the set Σ
; and
(5) if r and s are reduced regular expressions denoting languages R and S, respectively, then “
rs”
is a reduced regular expression denoting a set RS; and
selecting one of the reduced regular expressions to resolve the ambiguity site.
-
-
20. A computer readable medium having computer-executable instructions that, when executed on a processor, perform a method comprising:
-
receiving a string with an ambiguity site;
applying reduced regular expressions to describe a pattern in the string; and
selecting one of the reduced regular expressions to resolve the ambiguity site, wherein the reduced regular expression is selected from a set of very reduced regular expressions defined over a finite alphabet Σ
as follows;
(1) ∀
aε
Σ
;
“
a”
is a very reduced regular expression and denotes a set {a};
(2) “
.”
is a very reduced regular expression denoting a set Σ
;
(3) “
.*”
is a very reduced regular expression denoting a Kleene closure of the set Σ
; and
(4) if r and s are very reduced regular expressions denoting languages R and S, respectively, then “
rs”
is a very reduced regular expression denoting a set RS.
-
-
21. A training system comprising:
-
a memory to store a training set;
a processing unit; and
a disambiguation trainer, executable on the processing unit, to define a set of reduced regular expressions for particular patterns in strings of the training set and learn a knowledge base that uses the reduced regular expressions to describe the strings, to resolve an ambiguity site wherein the reduced regular expressions are defined over a finite alphabet Σ
as follows;
(1) ∀
aε
Σ
;
“
a”
is a reduced regular expression and denotes a set {a};
“
a+”
is a reduced regular expression and denotes a positive closure of the set {a};
“
a*”
is a reduced regular expression and denotes a Kleene closure of the set {a};
“
˜
a”
is a reduced regular expression and denotes a set Σ
-a;
“
˜
a+”
is a reduced regular expression and denotes the positive closure of the set Σ
-a;
“
˜
a*”
is a reduced regular expression and denotes the Kleene closure of the set Σ
-a;
(2) “
.”
is a reduced regular expression denoting a set Σ
;
(3) “
.+”
is a reduced regular expression denoting the positive closure of the set Σ
;
(4) “
.*”
is a reduced regular expression denoting the Kleene closure of the set Σ
; and
(5) if r and s are reduced regular expressions denoting languages R and S, respectively, then “
rs”
is a reduced regular expression denoting a set RS.
-
-
22. A training system comprising:
-
a memory to store a training set;
a processing unit; and
a disambiguation trainer, executable on the processing unit, to define a set of reduced regular expressions for particular patterns in strings of the training set to and learn a knowledge base that uses the reduced regular expressions to describe the strings, wherein the reduced regular expressions comprise very reduced regular expressions, to resolve an ambiguity site which are defined over a finite alphabet Σ
as follows;
(1) ∀
aε
Σ
;
“
a”
is a very reduced regular expression and denotes a set {a};
(2) “
.”
is a very reduced regular expression denoting a set Σ
;
(3) “
.*”
is a very reduced regular expression denoting a Kleene closure of the set Σ
; and
(4) if r and s are very reduced regular expressions denoting languages R and S, respectively, then “
rs”
is a very reduced regular expression denoting a set RS.
-
-
23. A training system comprising:
-
a memory to store a training set;
a processing unit; and
a disambiguation trainer executable on the processing unit, to define a set of reduced regular expressions for particular patterns in strings to resolve an ambiguity site of the training set and learn a knowledge base that uses the reduced regular expressions to describe the strings, wherein;
the disambiguator trainer is configured to construct a graph having a root node that contains a primary position set of the training set, paths from the root node that introduce transformations from the root node based on elements in the training set or operations on the elements, and secondary nodes that contain secondary position sets that result from the transformations along associated paths; and
the disambiguator trainer is firer configured to identify a reduced regular expression that is represented by a node in the graph.
-
Specification