Method for named-entity recognition and verification

US 7,171,350 B2
Filed: 08/26/2002
Issued: 01/30/2007
Est. Priority Date: 05/03/2002
Status: Expired due to Fees

First Claim

Patent Images

1. A method for named-entity recognition and verification, comprising the steps of:

(A) segmenting text data from an article into at least one to-be-tested segments according to a text window;

(B) parsing the to-be-tested segments to remove ill-formed segments from the to-be-tested segments according to a predefined grammar;

(C) using a hypothesis test to assess a confidence measure of each to-be-tested segment, wherein the confidence measure is determined from dividing a probability $P (o \begin{matrix} L, x \\ L, 1 \end{matrix}, o \begin{matrix} C, y \\ C, 1 \end{matrix}, o \begin{matrix} R, z \\ R, 1 \end{matrix} | H_{0})$ of assuming that the to-be-teated tested segment has a named-entity by a probability $P (o \begin{matrix} L, x \\ L, 1 \end{matrix}, o \begin{matrix} C, y \\ C, 1 \end{matrix}, o \begin{matrix} R, z \\ R, 1 \end{matrix} | H_{1})$ of assuming that the to-be-tested segment doesn'"'"'t have a named-entity, where $O \begin{matrix} C, y \\ C, 1 \end{matrix}$ is a candidate, $O \begin{matrix} L, x \\ L, 1 \end{matrix}$ is the left context of the candidate, and $O \begin{matrix} R, z \\ R, 1 \end{matrix}$ is the right context of the candidate; and

(D) determining that the to-be-tested segment has a named-entity if the confidence measure is greater than a predefined threshold, wherein the confidence measure is expressed by a log likelihood ratio, $LLR (O \begin{matrix} L, x \\ L, 1 \end{matrix}, O \begin{matrix} C, y \\ C, 1 \end{matrix}, O \begin{matrix} R, z \\ R, 1 \end{matrix}) = \log \frac{P (O \begin{matrix} L, x \\ L, 1 \end{matrix}, O \begin{matrix} C, y \\ C, 1 \end{matrix}, O \begin{matrix} R, z \\ R, 1 \end{matrix} | H_{0})}{P (O \begin{matrix} L, x \\ L, 1 \end{matrix}, O \begin{matrix} C, y \\ C, 1 \end{matrix}, O \begin{matrix} R, z \\ R, 1 \end{matrix} | H_{1})} .$

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for named-entity (NE) recognition and verification is provided. The method can extract at least one to-be-tested segments from an article according to a text window, and use a predefined grammar to parse the at least one to-be-tested segments to remove ill-formed ones. Then, a statistical verification model is used to calculate the confidence measurement of each to-be-tested segment to determine where the to-be-tested segment has a named-entity or not. If the confidence measurement is less than a predefined threshold, the to-be-tested segment will be rejected. Otherwise, it will be accepted.

129 Citations

18 Claims

1. A method for named-entity recognition and verification, comprising the steps of:
- (A) segmenting text data from an article into at least one to-be-tested segments according to a text window;
  
  (B) parsing the to-be-tested segments to remove ill-formed segments from the to-be-tested segments according to a predefined grammar;
  
  (C) using a hypothesis test to assess a confidence measure of each to-be-tested segment, wherein the confidence measure is determined from dividing a probability $P (o \begin{matrix} L, x \\ L, 1 \end{matrix}, o \begin{matrix} C, y \\ C, 1 \end{matrix}, o \begin{matrix} R, z \\ R, 1 \end{matrix} | H_{0})$ of assuming that the to-be-teated tested segment has a named-entity by a probability $P (o \begin{matrix} L, x \\ L, 1 \end{matrix}, o \begin{matrix} C, y \\ C, 1 \end{matrix}, o \begin{matrix} R, z \\ R, 1 \end{matrix} | H_{1})$ of assuming that the to-be-tested segment doesn'"'"'t have a named-entity, where $O \begin{matrix} C, y \\ C, 1 \end{matrix}$ is a candidate, $O \begin{matrix} L, x \\ L, 1 \end{matrix}$ is the left context of the candidate, and $O \begin{matrix} R, z \\ R, 1 \end{matrix}$ is the right context of the candidate; and
  
  (D) determining that the to-be-tested segment has a named-entity if the confidence measure is greater than a predefined threshold, wherein the confidence measure is expressed by a log likelihood ratio, $LLR (O \begin{matrix} L, x \\ L, 1 \end{matrix}, O \begin{matrix} C, y \\ C, 1 \end{matrix}, O \begin{matrix} R, z \\ R, 1 \end{matrix}) = \log \frac{P (O \begin{matrix} L, x \\ L, 1 \end{matrix}, O \begin{matrix} C, y \\ C, 1 \end{matrix}, O \begin{matrix} R, z \\ R, 1 \end{matrix} | H_{0})}{P (O \begin{matrix} L, x \\ L, 1 \end{matrix}, O \begin{matrix} C, y \\ C, 1 \end{matrix}, O \begin{matrix} R, z \\ R, 1 \end{matrix} | H_{1})} .$
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18)
- - 2. The method as claimed in claim 1, wherein the text window has a plurality of random variables.
  - 3. The method as claimed in claim 2, wherein the random variables have the candidate and its left and right contexts, and the named-entity of the to-be-tested segment corresponds to the candidate.
  - 4. The method as claimed in claim 3, wherein the text window is
5. The method as claimed in claim 4, wherein in step (D), the confidence measure is determined by using Neyman-Pearson Lemma.
6. The method as claimed in claim 1, wherein a named-entity model (NE model) is used to determine log $P$
- ( o L , 1 L , x , o C1 C , y , o R , 1 R , z | H 0 ) , where $P (O \begin{matrix} L, x \\ L, 1 \end{matrix}, O \begin{matrix} C, y \\ C, 1 \end{matrix}, O \begin{matrix} R, z \\ R, 1 \end{matrix} | H_{0})$ approximates to $P_{0} (o_{L, 1}^{L, x}, o_{C1}^{C, y}, o_{R, 1}^{R, z}),$ and $P_{0} (o_{L, 1}^{L, x}, o_{C1}^{C, y}, o_{R, 1}^{R, z})$ approximates to $P_{0} (o_{L, 1}^{L, x}) P_{0} (o_{C1}^{C, y}) P_{0} (o_{R, 1}^{R, z}) .$
7. The method as claimed in claim 6, wherein $P_{0}$
- ( o L , 1 L , x ) approximates to $\prod_{i = 1}^{x} P_{0} (o_{L, i} | o_{L, i - N + 1}^{L, i - 1}),$ and $P_{0} (o_{L, i} | o_{L, i - N + 1}^{L, i - 1})$ equals to ${\begin{matrix} P_{0} (o_{L, i} | o_{L, 1}^{L, i - 1}), & if N > 1 and i > 1 & and i - N \leq 0 \\ P_{0} (o_{L, i}), & if N = 1 or i = 1 \end{matrix},$ where N is a positive integer.
8. The method as claimed in claim 6, wherein $P_{0}$
- ( o R , 1 R , z ) approximates to $\prod_{i = 1}^{2} P_{0} (o_{R, i} \langle o_{R, i - N + 1}^{R, i - 1}),$ and $P_{0} (o_{R, i} \langle o_{R, i - N + 1}^{R, i - 1})$ equals to ${\begin{matrix} P_{0} (o_{R, i} \langle o_{R, 1}^{R, i - 1}), & if N > 1 and i > 1 and i - N \leq 0 \\ P_{0} (o_{R, i}), & if N = 1 or i = 1 \end{matrix},$ where N is a positive integer.
9. The method as claimed in claim 6, wherein $P_{0}$
- ( o C , 1 C , y ) equals to $\sum_{T} P_{0} (T),$ and $\sum_{T} P_{0} (T)$ approximates to $\max_{T} P_{0} (T) = \max_{T} \prod_{A \to α \in T}^{} P_{0} (α | A),$ where T is a possible parsing tree, and A→
  
  α
  
  is a rule in the parsing tree T.
10. The method as claimed in claim 9, wherein the NE model $S_{NE}$
- ( o L , ⁢
  
  1 L , ⁢
  
  x , ⁢
  
  o C , ⁢
  
  1 C , ⁢
  
  y , ⁢
  
  o R , ⁢
  
  1 R , ⁢
  
  z ) ⁢
  
  ⁢
  
  is ⁢
  
  ⁢
  
  ∑
  
  i = 1 x ⁢
  
  ⁢
  
  log ⁢
  
  ⁢
  
  P 0 ⁡
  
  ( o L , ⁢
  
  i | o L , ⁢
  
  i - N + 1 L , ⁢
  
  i - 1 ) + ∑
  
  i = 1 z ⁢
  
  ⁢
  
  log ⁢
  
  ⁢
  
  P 0 ⁡
  
  ( o R , ⁢
  
  i | o R , ⁢
  
  i - N + 1 R , ⁢
  
  i - 1 ) + max T ⁢
  
  ∑
  
  A ->
  
  α
  
  ∈
  
  T ⁢
  
  ⁢
  
  log ⁢
  
  ⁢
  
  P 0 ⁡
  
  ( α
  
  | A ) ⁢
  
  .
11. The method as claimed in claim 1, wherein an anti-named-entity model (anti-NE model) is used to determine $P$
- ( o L , ⁢
  
  1 L , ⁢
  
  x , ⁢
  
  o C , ⁢
  
  1 C , ⁢
  
  y , ⁢
  
  o R , ⁢
  
  1 R , ⁢
  
  z | H 1 ) , where $P (o_{L, 1}^{L, x}, o_{C, 1}^{C, y}, o_{R, 1}^{R, z} \langle H_{1}) is P_{1} (o_{L, 1}^{L, x}, o_{C, 1}^{C, y}, o_{R, 1}^{R, z}), P_{1} (o_{L, 1}^{L, x}, o_{C, 1}^{C, y}, o_{R, 1}^{R, z})$ approximates to $\prod_{i = 1}^{x} P_{1} (o_{L, i} \langle o_{L, i - N + 1}^{L, i - 1}) \times \prod_{i = 1}^{y} P_{1} (o_{C, i} \langle o_{C, i - N + 1}^{C, i - 1}) \times \prod_{i = 1}^{z} P_{1} (o_{R, i} \langle o_{R, i - N + 1}^{R, i - 1}),$ and N is a positive integer.
12. The method as claimed in claim 11, wherein o_R,jequals to o_C,y+jif j=0, −
- 1, −
  
  2, . . . , o_C,jequals to o_L,x+jif j=0, −
  
  1, −
  
  2, . . . , and $P_{1} (o_{L, i} \langle o_{L, i - N + 1}^{L, i - 1})$ equals to ${\begin{matrix} P_{1} (o_{L, i} \langle o_{L, 1}^{L, i - 1}), & if N > 1 and i > 1 and i - N \leq 0 \\ P_{1} (o_{L, i}), & if N = 1 or i = 1 \end{matrix} .$
13. The method as claimed in claim 11, wherein the anti-NE model $S_{anti - NE}$
- ( o L , ⁢
  
  1 L , ⁢
  
  x , ⁢
  
  o C , ⁢
  
  1 C , ⁢
  
  y , ⁢
  
  o R , ⁢
  
  1 R ⁢
  
  , ⁢
  
  z ) ⁢
  
  ⁢
  
  is ⁢
  
  ⁢
  
  ∑
  
  i = 1 x ⁢
  
  ⁢
  
  log ⁢
  
  ⁢
  
  P 1 ⁡
  
  ( o L , ⁢
  
  i | o L , ⁢
  
  i - N + 1 L , ⁢
  
  i - 1 ) + ∑
  
  i = 1 y ⁢
  
  ⁢
  
  log ⁢
  
  ⁢
  
  P 1 ⁡
  
  ( o C , ⁢
  
  i | o C , ⁢
  
  i - N + 1 C , ⁢
  
  i - 1 ) + ∑
  
  i = 1 z ⁢
  
  ⁢
  
  log ⁢
  
  ⁢
  
  P 1 ⁡
  
  ( o R , ⁢
  
  i | o R , ⁢
  
  i - N + 1 R , ⁢
  
  i - 1 ) ⁢
  
  . ⁢
14. The method as claimed in claim 1, wherein the candidate $o_{C,}$
- 1 C , ⁢
  
  y is composed of random variables o_c,1, o_c,2. . . , and o_c,y, where y is the number of characters of the candidate.
15. The method as claimed in claim 1, wherein the left context $o_{L,}$
- 1 L , ⁢
  
  x is composed of random variables o_L,1, o_L,2. . . , and o_L,x, where x is the number of characters of the left context.
16. The method as claimed in claim 1, wherein the right context $o_{R,}$
- 1 R ⁢
  
  , ⁢
  
  ⁢
  
  z is composed of random variables o_R,1, o_R,2. . . , and o_R,z, where z is the number of characters of the right context.
17. The method as claimed in claim 2, wherein each random variable is a Chinese character.
18. The method as claimed in claim 2, wherein each random variable is an English word.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Industrial Technology Research Institute
Original Assignee
Industrial Technology Research Institute
Inventors
Lin, Yi-Chung, Hung, Peng-Hsiang
Primary Examiner(s)
Hudspeth; David
Assistant Examiner(s)
ALBERTALLI, BRIAN LOUIS

Application Number

US10/227,470
Publication Number

US 20030208354A1
Time in Patent Office

1,618 Days
Field of Search

None
US Class Current

704/9
CPC Class Codes

G06F 40/295 Named entity recognition

Method for named-entity recognition and verification

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

129 Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Method for named-entity recognition and verification

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

129 Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links