Method and system for processing text

US 8,566,080 B2
Filed: 04/29/2010
Issued: 10/22/2013
Est. Priority Date: 04/30/2009
Status: Expired due to Fees

First Claim

Patent Images

1. A method for text processing, comprising:

determining a plurality of characters in a text, wherein the text comprises double-byte coded characters;

determining whether a number of bytes included in each text segment is even or odd;

detecting which of the plurality of characters represent punctuations;

dividing the text into a plurality of different text segments using the detected punctuations as separators between the different text segments; and

performing a plurality of discrete decoding operations, one for each of the plurality of different text segments, wherein one or more of the plurality of different text segments comprise at least one occurrence of unrecognizable codes that are unable to be successfully decoded as comprehensible characters without inferences being made, wherein decoding operations on text segments lacking unrecognizable codes are unaffected by other decoding operations on text segments including unrecognizable codes; and

when performing the plurality of discrete decoding operations and only when the number of word segments included in one of the text segments is odd, decoding from a head of the text segment rearward, as a first decoding result of the text segment, and decoding from a tail of the text segment frontward, as a second decoding result of the text segment.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The present invention provides a method and system for text processing. The method comprises determining at least a part of characters in a text; dividing the text into a plurality of text segments by using the at least a part of characters as separators; and decoding the plurality of text segments respectively.

12 Citations

View as Search Results

18 Claims

1. A method for text processing, comprising:
- determining a plurality of characters in a text, wherein the text comprises double-byte coded characters;
  
  determining whether a number of bytes included in each text segment is even or odd;
  
  detecting which of the plurality of characters represent punctuations;
  
  dividing the text into a plurality of different text segments using the detected punctuations as separators between the different text segments; and
  
  performing a plurality of discrete decoding operations, one for each of the plurality of different text segments, wherein one or more of the plurality of different text segments comprise at least one occurrence of unrecognizable codes that are unable to be successfully decoded as comprehensible characters without inferences being made, wherein decoding operations on text segments lacking unrecognizable codes are unaffected by other decoding operations on text segments including unrecognizable codes; and
  
  when performing the plurality of discrete decoding operations and only when the number of word segments included in one of the text segments is odd, decoding from a head of the text segment rearward, as a first decoding result of the text segment, and decoding from a tail of the text segment frontward, as a second decoding result of the text segment.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9)
- - 2. A method according to claim 1, further comprising:
    - determining a first part and a second part of punctuations in the text;
      
      dividing the text into a first segment based on the first part of the punctuations and a second segment based on the second part of punctuations;
      
      decoding the first segment to obtain a first decoding result of the text;
      
      decoding the second segment to obtain a second decoding result; and
      
      comparing the first decoding result to the second decoding result to determine a decoding difference.
  - 3. A method according to claim 1, wherein each of the discrete decoding operations comprises:
    - decoding the text segment using a first decoding method, to obtain a first decoding result of the text; and
      
      decoding the text segment using a second decoding method, to obtain a second decoding result of the text; and
      
      comparing the first decoding result of the text segment to the second decoding result of the text segment, indicating a problem with decoding the text segment when the first decoding result differs from the second decoding result.
  - 4. A method according to claim 1, wherein each of the discrete decoding operations comprises:
    - decoding the text segment from a beginning to an ending of each text segment to obtain a first decoding result; and
      
      decoding the text segment from the ending to the beginning of the text segment to obtain a second decoding result; and
      
      determining a decoding problem exists, when the first decoding result is different than the second decoding result.
  - 5. The method of claim 1, further comprising:
    - when a decoding problem exists for a text segment and when each of the first decoding result and the second decoding result comprise unrecognizable characters;
      
      adding to a combined result recognizable characters from the first decoding result from the beginning of the first decoding result until an unrecognizable character is detected in the first decoding result; and
      
      adding in reverse order to the combined result characters from the second decoding result from the ending of the second decoding result until an unrecognizable character is detected in the second decoding result; and
      
      using the combined result as a decoding result for the text segment for which a decoding problem exists.
  - 6. A method according to claim 1, wherein decoding operations performed for each of the text segments comprises:
    - decoding from a head of a text segment rearward, as a first decoding result of the text segment, and decoding from a tail of a text segment frontward, as a second decoding result of the text segment.
  - 7. A method according to claim 1, further comprising:
    - determining a front part element from the first decoding result;
      
      determining a rear part element from the second decoding result; and
      
      combining the front part element and the rear part element into a final decoding result of the text segment.
  - 8. A method according to claim 1, wherein the decoding operations performed for each text segments further comprises:
    - determining whether the text segment comprises an ASCII coded character;
      
      when the text segment comprises an ASCII coded character, dividing the text segment further into two sub-text segments by using the ASCII coded character as a separator, and decoding the two sub-text segments using independent decoding operations.
  - 9. A method according to claim 1, wherein the punctuations comprise at least one of “
    - ,”
      
      “
      
      .”
      
      “
      
      ;
      
      ”
      
      “
      
      ;
      
      ”
      
      “
      
      !”
      
      “
      
      ?”
      
      or “
      
      \”
      
      ; and
      
      the text comprises at least one of Chinese text, Japanese text, and Korean text.

10. A system for text processing, comprising:
- a character determining module for determining a plurality of characters in a text and for detecting which of the plurality of characters represent punctuations, wherein the text comprises at least one of double-byte coded characters and multi-byte coded characters;
  
  a text segment dividing module for dividing the text into a plurality of different text segments using the punctuations detected by the character determination module as separators between the different text segments wherein the text segment dividing module is further configured to;
  
  determine a first part and a second part of punctuations in the text;
  
  divide the text in into a first segment based on the first part of the punctuations and a second segment based on the second part of punctuations; and
  
  a decoding module for performing a plurality of discrete decoding operations on the text, one for each of the plurality of different text segments, wherein one or more of the plurality of different text segments comprises at least one occurrence of unrecognizable codes that are unable to be successfully decoded as comprehensible characters without inferences being made, wherein decoding operations on text segments lacking unrecognizable codes are unaffected by other decoding operations on text segments including unrecognizable codes, wherein the decoding module is further configured to;
  
  decode the first segment to obtain a first decoding result of the text,decode the second segment to obtain a second decoding result; and
  
  compare the first decoding result to the second decoding result to determine a decoding difference.
- View Dependent Claims (11, 12, 13, 14, 15, 16, 17)
- - 11. A system according to claim 10, wherein the decoding module is configured to:
    - decode at least a portion of the different text segments using a first decoding method, to obtain a first decoding result of the text segment; and
      
      decoding the portion of the text segments using a second decoding method, to obtain a second decoding result of the text segment; and
      
      comparing the first decoding result of the text segment to the second decoding result of the text segment, indicating a problem with decoding the text segment when the first decoding result differs from the second decoding result.
  - 12. A system according to claim 11, wherein the decoding module decodes the plurality of text segments from a head of each text segment, so as to obtain a first decoding result of the text;
    - andthe decoding module decodes the plurality of text segments from a tail of each text segment, so as to obtain a second decoding result of the text.
  - 13. A system according to claim 10, wherein the system is further configured such that:
    - when a decoding problem exists for a text segment and when each of the first decoding result and the second decoding result comprise unrecognizable characters;
      
      adding to a combined result recognizable characters from the first decoding result from the beginning of the first decoding result until an unrecognizable character is detected in the first decoding result; and
      
      adding in reverse order to the combined result characters from the second decoding result from the ending of the second decoding result until an unrecognizable character is detected in the second decoding result; and
      
      using the combined result as a decoding result for the text segment for which a decoding problem exists.
  - 14. A system according to claim 10, wherein the decoding module is further configured to:
    - decode from a head of a text segment rearward, as a first decoding result of the text segment, and decode from a tail of a text segment frontward, as a second decoding result of the text segment.
  - 15. A system according to claim 10, wherein the character determining module is further configured to:
    - determine whether the text segment comprises an ASCII coded character;
      
      when the text segment comprises an ASCII coded character, divide the text segment further into two sub-text segments by using the ASCII coded character as a separator, and decoding the two sub-text segments using independent decoding operations.
  - 16. A system according to claim 10, wherein the text comprises a double-byte coded characters, the system further comprising:
    - a byte number determining module for determining whether the number of bytes comprised in the double-byte coded text segment is odd; and
      
      only when the number of word segments included in the text segment is odd, decoding from a head of the text segment to the tail, as a first decoding result of the text segment, and decoding from a tail of the text segment to the head, as a second decoding result of the text segment.
  - 17. A system according to claim 16, further comprising:
    - a text segment front part element determining module for determining the front part element of the text segment from the first decoding result;
      
      a text segment rear part element determining module for determining the rear part element of the text segment from the second decoding result; and
      
      an element combining module for combining the front part element and the rear part element into a final decoding result of the text segment.

18. A method for text processing, comprising:
- determining a plurality of characters in a text, wherein the text comprises at least one of double-byte coded characters and multi-byte coded characters;
  
  detecting which of the plurality of characters represent punctuations;
  
  determining a first part and a second part of punctuations in the text;
  
  dividing the text into a first segment based on the first part of the punctuations and a second segment based on the second part of punctuations;
  
  dividing the text into a plurality of different text segments using the detected punctuations as separators between the different text segments; and
  
  performing a plurality of discrete decoding operations, one for each of the plurality of different text segments, wherein one or more of the plurality of different text segments comprise at least one occurrence of unrecognizable codes that are unable to be successfully decoded as comprehensible characters without inferences being made, wherein decoding operations on text segments lacking unrecognizable codes are unaffected by other decoding operations on text segments including unrecognizable codes, wherein the performing of the discrete coding operations comprises;
  
  decoding the first segment to obtain a first decoding result of the text;
  
  decoding the second segment to obtain a second decoding result; and
  
  comparing the first decoding result to the second decoding result to determine a decoding difference.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Li, Bin, Zuo, Zhi Bo, Pang, Li Qun, Sha, Zhi Qiang
Primary Examiner(s)
Opsasnick, Michael N

Application Number

US12/770,439
Publication Number

US 20100278427A1
Time in Patent Office

1,272 Days
Field of Search

715/256, 715/259, 707/758, 707/760, 707/776, 707/778, 707674-677, 707/690, 704/6, 704/9
US Class Current

704/6
CPC Class Codes

G06F 40/129   Handling non-Latin characte...

G06F 40/284   Lexical analysis, e.g. toke...

G06F 40/289   Phrasal analysis, e.g. fini...

Method and system for processing text

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

12 Citations

18 Claims

Specification

Solutions

Use Cases

Quick Links

Method and system for processing text

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

12 Citations

18 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links