Method and device for automatic error detection and correction for computerized text files
First Claim
1. A device for the detection and correction of errors contained in an original text file that includes text consisting essentially of non-alphabetic blocked characters, comprising:
- an original text file obtaining means, comprising an original text file memory and, said text file obtaining means being used to obtain a series of non-alphabetic blocked characters from a text file to be processed as a "sentence" and to store said sentence in said original text file memory.
1 Assignment
0 Petitions
Accused Products
Abstract
A method and device for automatic error detection and correction for computerized text files uses a two-step segmentation method. A sentence of the computerized text file is first segmented at the first segmentation step into an original format and then converted into a correct sentence in the second segmentation step. In the first segmentation step the original sentence is segmented into a series of characters and the characters are analyzed so that the original phonetic or pictographic codes of the characters are revealed. The sentence in the original format is then converted into a series of phonetic representative codes and/or pictographic representative codes. Words consisting the sentence are then selected from a lexicon to reconstruct the sentence. The reconstructed sentence is then segmented again so that the errors in the original sentence are detected and corrections thereof are suggested.
-
Citations
14 Claims
-
1. A device for the detection and correction of errors contained in an original text file that includes text consisting essentially of non-alphabetic blocked characters, comprising:
an original text file obtaining means, comprising an original text file memory and, said text file obtaining means being used to obtain a series of non-alphabetic blocked characters from a text file to be processed as a "sentence" and to store said sentence in said original text file memory.
-
2. The device according to claim 1 wherein said representative-to-target code converter calculates the scores according to the factors comprising:
-
|Wi |;
Number of characters of character stream Wi, i=1 to n;Prob(Wi);
Frequency of Wi to exist in articles;Prob(POSi |POSi-1);
Probability of a term with POS (part-of speech) of Wi to follow a term with POS of Wi-1 in articles; andCi ;
Number of characters in the picked-up character stream that are different from its corresponding character in the original character stream.
-
-
3. The device according to claim 2 wherein said representative-to-target code converter calculates the scores according to the following equation:
- ##EQU7##
-
4. The device according to any one of claims 1, 2 and 3, wherein said representative-to-target code converter further comprises an interface means which comprises a display such that the character streams with higher scores are displayed in said display.
-
5. The device according to any one of claims 1, 2 and 3, wherein said representative code table is a phonetic code table having characters encoded according to their pronunciations wherein characters with same pronunciation are encoded with the same code.
-
6. The device according to any one of claims 1, 2 and 3, wherein said representative code table is a pictorial code table having characters encoded according to their stroke structures wherein characters with similar stroke structure are encoded with the same code.
-
7. The device according to any one of claims 1, 2 and 3, wherein said representative code table is a conversional code table having characters encoded according to their relation between the complex Chinese system and the simplified Chinese system wherein characters that are corresponding in the complex Chinese system and the simplified Chinese system are encoded with the same code.
-
8. A method for the detection and correction of errors contained in a text file that includes text consisting essentially of non-alphabetic blocked characters, comprising:
- obtaining a series of blocked characters from a text file to be processed and treating said series of character stream as a "sentence";
fractionating said sentence into a series of non-overlapping character streams; converting, according to a representative code table, codes of the blocked characters of said sentence into representative codes; fractionating said representative coded sentence into at least one series of non-overlapping character streams according to a representative coded term book by picking up terms with character streams having same representative codes from said representative coded term book; calculating a score of a combination of said at least one series of character streams; and outputting the series of character streams with the a highest score.
- obtaining a series of blocked characters from a text file to be processed and treating said series of character stream as a "sentence";
-
9. The method according to claim 8 wherein said score is calculated according to at least the following factors:
-
|Wi |;
Number of characters of character stream Wi, i=1 to n;Prob(Wi);
Frequency of Wi to exist in articles;Prob(POSi |POSi-1);
Probability of a term with POS of W1 to follow a term with POS of Wi-1 in articles; andCi ;
Number of characters in the picked-up character stream that are different from its corresponding character in the original character stream.
-
-
10. The method according to claim 9 wherein said score is calculated according to the following equation:
- ##EQU8##
-
11. The method according to any one of claims 8, 9, and 10, further comprising a step of displaying series of character streams with highest scores.
-
12. The method according to any one of claims 8, 9, and 10, wherein said representative code table is a phonetic code table having characters encoded according to their pronunciations wherein characters with same pronunciation are encoded with the same code.
-
13. The method according to any one of claims 8, 9, and 10, wherein said representative code table is a pictorial code table having characters encoded according to their stroke structures wherein characters with similar stroke structure are encoded with the same code.
-
14. The device according to any one of claims 8, 9, and 10, wherein said representative code table is a conversional code table having characters encoded according to their relation between the complex Chinese system and the simplified Chinese system wherein characters that are corresponding in the complex Chinese system and the simplified Chinese system are encoded with the same code.
Specification