Method and device for automatic error detection and correction for computerized text files

US 6,167,367 A
Filed: 08/09/1997
Issued: 12/26/2000
Est. Priority Date: 08/09/1997
Status: Expired due to Fees

First Claim

Patent Images

1. A device for the detection and correction of errors contained in an original text file that includes text consisting essentially of non-alphabetic blocked characters, comprising:

an original text file obtaining means, comprising an original text file memory and, said text file obtaining means being used to obtain a series of non-alphabetic blocked characters from a text file to be processed as a "sentence" and to store said sentence in said original text file memory.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method and device for automatic error detection and correction for computerized text files uses a two-step segmentation method. A sentence of the computerized text file is first segmented at the first segmentation step into an original format and then converted into a correct sentence in the second segmentation step. In the first segmentation step the original sentence is segmented into a series of characters and the characters are analyzed so that the original phonetic or pictographic codes of the characters are revealed. The sentence in the original format is then converted into a series of phonetic representative codes and/or pictographic representative codes. Words consisting the sentence are then selected from a lexicon to reconstruct the sentence. The reconstructed sentence is then segmented again so that the errors in the original sentence are detected and corrections thereof are suggested.

Citations

14 Claims

1. A device for the detection and correction of errors contained in an original text file that includes text consisting essentially of non-alphabetic blocked characters, comprising:
- an original text file obtaining means, comprising an original text file memory and, said text file obtaining means being used to obtain a series of non-alphabetic blocked characters from a text file to be processed as a "sentence" and to store said sentence in said original text file memory.

2. The device according to claim 1 wherein said representative-to-target code converter calculates the scores according to the factors comprising:
- |W_i |;
  
  Number of characters of character stream W_i, i=1 to n;
  
  Prob(W_i);
  
  Frequency of W_i to exist in articles;
  
  Prob(POS_i |POS_i-1);
  
  Probability of a term with POS (part-of speech) of W_i to follow a term with POS of W_i-1 in articles; and
  
  C_i ;
  
  Number of characters in the picked-up character stream that are different from its corresponding character in the original character stream.

3. The device according to claim 2 wherein said representative-to-target code converter calculates the scores according to the following equation:
- ##EQU7##

4. The device according to any one of claims 1, 2 and 3, wherein said representative-to-target code converter further comprises an interface means which comprises a display such that the character streams with higher scores are displayed in said display.

5. The device according to any one of claims 1, 2 and 3, wherein said representative code table is a phonetic code table having characters encoded according to their pronunciations wherein characters with same pronunciation are encoded with the same code.

6. The device according to any one of claims 1, 2 and 3, wherein said representative code table is a pictorial code table having characters encoded according to their stroke structures wherein characters with similar stroke structure are encoded with the same code.

7. The device according to any one of claims 1, 2 and 3, wherein said representative code table is a conversional code table having characters encoded according to their relation between the complex Chinese system and the simplified Chinese system wherein characters that are corresponding in the complex Chinese system and the simplified Chinese system are encoded with the same code.

8. A method for the detection and correction of errors contained in a text file that includes text consisting essentially of non-alphabetic blocked characters, comprising:
- obtaining a series of blocked characters from a text file to be processed and treating said series of character stream as a "sentence";
  
  fractionating said sentence into a series of non-overlapping character streams;
  
  converting, according to a representative code table, codes of the blocked characters of said sentence into representative codes;
  
  fractionating said representative coded sentence into at least one series of non-overlapping character streams according to a representative coded term book by picking up terms with character streams having same representative codes from said representative coded term book;
  
  calculating a score of a combination of said at least one series of character streams; and
  
  outputting the series of character streams with the a highest score.

9. The method according to claim 8 wherein said score is calculated according to at least the following factors:
- |W_i |;
  
  Number of characters of character stream W_i, i=1 to n;
  
  Prob(W_i);
  
  Frequency of W_i to exist in articles;
  
  Prob(POS_i |POS_i-1);
  
  Probability of a term with POS of W₁ to follow a term with POS of W_i-1 in articles; and
  
  C_i ;
  
  Number of characters in the picked-up character stream that are different from its corresponding character in the original character stream.

10. The method according to claim 9 wherein said score is calculated according to the following equation:
- ##EQU8##

11. The method according to any one of claims 8, 9, and 10, further comprising a step of displaying series of character streams with highest scores.

12. The method according to any one of claims 8, 9, and 10, wherein said representative code table is a phonetic code table having characters encoded according to their pronunciations wherein characters with same pronunciation are encoded with the same code.

13. The method according to any one of claims 8, 9, and 10, wherein said representative code table is a pictorial code table having characters encoded according to their stroke structures wherein characters with similar stroke structure are encoded with the same code.

14. The device according to any one of claims 8, 9, and 10, wherein said representative code table is a conversional code table having characters encoded according to their relation between the complex Chinese system and the simplified Chinese system wherein characters that are corresponding in the complex Chinese system and the simplified Chinese system are encoded with the same code.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Galaxy Software Services Limited, National Tsing Hua University
Original Assignee
Galaxy Software Services Limited, National Tsing Hua University
Inventors
Chang, Jyun-Sheng, Lin, Tsuey-Fen
Primary Examiner(s)
Chang, Vivian
Assistant Examiner(s)
EDOUARD, PATRICK NESTOR

Application Number

US08/914,225
Time in Patent Office

1,235 Days
Field of Search

704/1, 704/8, 704/9, 704/10, 707/535, 707/536, 707/532, 382/185, 382/178, 382/229, 382/230, 341/28
US Class Current

704/8
CPC Class Codes

G06F 40/232   Orthographic correction, e....

G06F 40/53   Processing of non-Latin tex...

G06V 30/10   Character recognition

G06V 30/262   using context analysis, e.g...

Method and device for automatic error detection and correction for computerized text files

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

14 Claims

Specification

Solutions

Use Cases

Quick Links

Method and device for automatic error detection and correction for computerized text files

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

14 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links