Method, apparatus, and storage medium for text information processing

US 10,262,059 B2
Filed: 06/06/2016
Issued: 04/16/2019
Est. Priority Date: 03/14/2014
Status: Active Grant

First Claim

Patent Images

1. A text information processing method, applied to a terminal, the terminal comprising one or more processors, a memory, and program instructions stored in the memory, the program instructions being executed by the one or more processors, and the method comprising:

performing word segmentation on a target text according to a preset fixed word segmentation policy, to obtain a word segmentation result;

comparing the word segmentation result with a preset word segmentation list, and obtaining a word, which is not in the preset word segmentation list, as a new word;

adding the new word to the preset word segmentation list, to obtain a test word segmentation list;

classifying a test text according to the preset word segmentation list, to obtain a first text, and classifying the test text according to the test word segmentation list, to obtain a second text;

calculating classification accuracy of the first text and classification accuracy of the second text;

comparing the classification accuracy of the first text with the classification accuracy of the second text, and determining a target new word from the new word according to a comparison result;

adding the target new word to the preset word segmentation list, to obtain a target preset word segmentation list; and

classifying the target text according to the target preset word segmentation list,wherein the classifying a test text according to the preset word segmentation list, to obtain a first text, and classifying the test text according to the test word segmentation list, to obtain a second text comprises;

classifying the test text according to a preset classification algorithm, to obtain the first text, wherein the preset classification algorithm is associated with the preset word segmentation list; and

classifying the test text according to the preset classification algorithm, to obtain the second text, wherein the preset classification algorithm is associated with the test word segmentation list; and

the classifying the target text according to the target preset word segmentation list comprises;

calibrating the preset classification algorithm according to the target preset word segmentation list, and classifying the target text according to the calibrated preset classification algorithm.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Method, apparatus, and storage medium for text information processing are provided. The method includes: performing word segmentation on a target text according to a preset fixed word segmentation policy, and comparing a word segmentation result with a preset word segmentation list, to obtain a new word; adding the new word to the preset word segmentation list, to obtain a test word segmentation list; classifying a test text according to the preset word segmentation list, to obtain a first text, and classifying the test text according to the test word segmentation list, to obtain a second text; comparing classification accuracy of the first text with classification accuracy of the second text, and determining a target new word from the new word according to a comparison result; and adding the target new word to the preset word segmentation list, and classifying the target text.

8 Citations

15 Claims

1. A text information processing method, applied to a terminal, the terminal comprising one or more processors, a memory, and program instructions stored in the memory, the program instructions being executed by the one or more processors, and the method comprising:
- performing word segmentation on a target text according to a preset fixed word segmentation policy, to obtain a word segmentation result;
  
  comparing the word segmentation result with a preset word segmentation list, and obtaining a word, which is not in the preset word segmentation list, as a new word;
  
  adding the new word to the preset word segmentation list, to obtain a test word segmentation list;
  
  classifying a test text according to the preset word segmentation list, to obtain a first text, and classifying the test text according to the test word segmentation list, to obtain a second text;
  
  calculating classification accuracy of the first text and classification accuracy of the second text;
  
  comparing the classification accuracy of the first text with the classification accuracy of the second text, and determining a target new word from the new word according to a comparison result;
  
  adding the target new word to the preset word segmentation list, to obtain a target preset word segmentation list; and
  
  classifying the target text according to the target preset word segmentation list,wherein the classifying a test text according to the preset word segmentation list, to obtain a first text, and classifying the test text according to the test word segmentation list, to obtain a second text comprises;
  
  classifying the test text according to a preset classification algorithm, to obtain the first text, wherein the preset classification algorithm is associated with the preset word segmentation list; and
  
  classifying the test text according to the preset classification algorithm, to obtain the second text, wherein the preset classification algorithm is associated with the test word segmentation list; and
  
  the classifying the target text according to the target preset word segmentation list comprises;
  
  calibrating the preset classification algorithm according to the target preset word segmentation list, and classifying the target text according to the calibrated preset classification algorithm.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The method according to claim 1, wherein the calculating classification accuracy of the first text and classification accuracy of the second text comprises:
    - separately calculating, for each new word, classification accuracy of a first text corresponding to the each new word and classification accuracy of a second text corresponding to the each new word.
  - 3. The method according to claim 2, wherein the comparing the classification accuracy of the first text with the classification accuracy of the second text, and determining a target new word from the new word according to a comparison result comprises:
    - subtracting the classification accuracy of the first text corresponding to the each new word from the classification accuracy of the second text corresponding to the new word, to obtain a difference;
      
      determining whether the difference meets a preset difference; and
      
      determining the new word as the target new word if the difference meets the preset difference.
  - 4. The method according to claim 1, wherein the comparing the word segmentation result with a preset word segmentation list, to obtain a word segmentation result, which is not in the preset word segmentation list, as a new word comprises:
    - determining whether a word in the word segmentation result matches a word in the preset word segmentation list, and if not, collecting statistics on an eigenvalue of a word, which does not match the word in the preset word segmentation list, in the word segmentation result, wherein the eigenvalue comprises a frequency at which the mismatched word appears in the target text; and
      
      determining the mismatched word as the new word if the eigenvalue of the mismatched word meets a preset eigenvalue.
  - 5. The method according to claim 1, wherein the performing word segmentation on a target text according to a preset fixed word segmentation policy comprises:
    - intercepting the target text every N characters from the first character, to obtain multiple word strings, wherein a character quantity of each word string is N, and N is a positive integer greater than 1.

6. A text information processing apparatus, the apparatus comprising:
- one or more processors;
  
  a memory; and
  
  one or more program modules, stored in the memory, executed by the one or more processors, and the one or more program modules comprising;
  
  a new-word processing module, configured to perform word segmentation on a target text according to a preset fixed word segmentation policy, to obtain a word segmentation result; and
  
  compare the word segmentation result with a preset word segmentation list, to obtain a word segmentation result, which is not in the preset word segmentation list, as a new word;
  
  an adding module, configured to add the new word to the preset word segmentation list, to obtain a test word segmentation list;
  
  a test-text classification module, configured to classify a test text according to the preset word segmentation list, to obtain a first text, and classify the test text according to the test word segmentation list, to obtain a second text;
  
  a target-new-word determining module, configured to calculate classification accuracy of the first text and classification accuracy of the second text, compare the classification accuracy of the first text with the classification accuracy of the second text, and determine a target new word from the new word according to a comparison result; and
  
  a target-text classification module, configured to add the target new word to the preset word segmentation list, to obtain a target preset word segmentation list; and
  
  classify the target text according to the target preset word segmentation list,wherein the test-text classification module comprises;
  
  a first classification unit, configured to classify the test text according to a preset classification algorithm, to obtain the first text, wherein the preset classification algorithm is associated with the preset word segmentation list; and
  
  a second classification unit, configured to classify the test text according to the preset classification algorithm, to obtain the second text, wherein the preset classification algorithm is associated with the test word segmentation list; and
  
  the target-text classification module classifies the target text according to the target preset word segmentation list comprises;
  
  calibrating the preset classification algorithm according to the target preset word segmentation list, and classifying the target text according to the calibrated preset classification algorithm.
- View Dependent Claims (7, 8, 9, 10)
- - 7. The apparatus according to claim 6, wherein the target-new-word determining module comprises:
    - a calculation unit, configured to separately calculate, for each new word, classification accuracy of a first text corresponding to the each new word and classification accuracy of a second text corresponding to the each new word.
  - 8. The apparatus according to claim 7, wherein the target-new-word determining module comprises:
    - a first judging unit, configured to subtract the classification accuracy of the first text corresponding to the each new word from the classification accuracy of the second text corresponding to the new word, to obtain a difference; and
      
      determine whether the difference meets a preset difference; and
      
      a first determining unit, configured to determine the new word as the target new word when a determining result of the first judging unit is yes.
  - 9. The apparatus according to claim 6, wherein the new-word processing module comprises:
    - a second judging unit, configured to determine whether a word in the word segmentation result matches a word in the preset word segmentation list;
      
      a statistics collecting unit, configured to collect statistics on an eigenvalue of a word, which does not match the word in the preset word segmentation list, in the word segmentation result when a determining result of the second judging unit is no, wherein the eigenvalue comprises a frequency at which the mismatched word appears in the target text; and
      
      a second determining unit, configured to determine the mismatched word as the new word if the eigenvalue of the mismatched word meets a preset eigenvalue.
  - 10. The apparatus according to claim 6, wherein that the new-word processing module performs word segmentation on a target text according to a preset fixed word segmentation policy specifically comprises:
    - intercepting the target text every N characters from the first character, to obtain multiple word strings, wherein a character quantity of each word string is N, and N is a positive integer greater than 1.

11. A non-transitory computer readable storage medium, having computer executable instructions stored therein, and when these executable instructions run in a terminal, the terminal executing a text information processing method, comprising:
- performing word segmentation on a target text according to a preset fixed word segmentation policy, to obtain a word segmentation result;
  
  comparing the word segmentation result with a preset word segmentation list, to obtain a word segmentation result, which is not in the preset word segmentation list, as a new word;
  
  adding the new word to the preset word segmentation list, to obtain a test word segmentation list;
  
  classifying a test text according to the preset word segmentation list, to obtain a first text, and classifying the test text according to the test word segmentation list, to obtain a second text;
  
  calculating classification accuracy of the first text and classification accuracy of the second text;
  
  comparing the classification accuracy of the first text with the classification accuracy of the second text, and determining a target new word from the new word according to a comparison result;
  
  adding the target new word to the preset word segmentation list, to obtain a target preset word segmentation list; and
  
  classifying the target text according to the target preset word segmentation list,wherein the classifying a test text according to the preset word segmentation list, to obtain a first text, and classifying the test text according to the test word segmentation list, to obtain a second text comprises;
  
  classifying the test text according to a preset classification algorithm, to obtain the first text, wherein the preset classification algorithm is associated with the preset word segmentation list; and
  
  classifying the test text according to the preset classification algorithm, to obtain the second text, wherein the preset classification algorithm is associated with the test word segmentation list; and
  
  the classifying the target text according to the target preset word segmentation list comprises;
  
  calibrating the preset classification algorithm according to the target preset word segmentation list, and classifying the target text according to the calibrated preset classification algorithm.
- View Dependent Claims (12, 13, 14, 15)
- - 12. The storage medium according to claim 11, wherein the calculating classification accuracy of the first text and classification accuracy of the second text comprises:
    - separately calculating, for each new word, classification accuracy of a first text corresponding to the each new word and classification accuracy of a second text corresponding to the each new word.
  - 13. The storage medium according to claim 12, wherein the comparing the classification accuracy of the first text with the classification accuracy of the second text, and determining a target new word from the new word according to a comparison result comprises:
    - subtracting the classification accuracy of the first text corresponding to the each new word from the classification accuracy of the second text corresponding to the new word, to obtain a difference;
      
      determining whether the difference meets a preset difference; and
      
      determining the new word as the target new word if the difference meets the preset difference.
  - 14. The storage medium according to claim 11, wherein the comparing the word segmentation result with a preset word segmentation list, to obtain a word segmentation result, which is not in the preset word segmentation list, as a new word comprises:
    - determining whether a word in the word segmentation result matches a word in the preset word segmentation list, and if not, collecting statistics on an eigenvalue of a word, which does not match the word in the preset word segmentation list, in the word segmentation result, wherein the eigenvalue comprises a frequency at which the mismatched word appears in the target text; and
      
      determining the mismatched word as the new word if the eigenvalue of the mismatched word meets a preset eigenvalue.
  - 15. The storage medium according to claim 11, wherein the performing word segmentation on a target text according to a preset fixed word segmentation policy comprises:
    - intercepting the target text every N characters from the first character, to obtain multiple word strings, wherein a character quantity of each word string is N, and N is a positive integer greater than 1.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Tencent Technology Company Limited (Tencent Holdings Limited)
Original Assignee
Tencent Technology Company Limited (Tencent Holdings Limited)
Inventors
Liu, Jie, Li, Yinghui
Primary Examiner(s)
Ruiz, Angelica

Application Number

US15/174,607
Publication Number

US 20160283583A1
Time in Patent Office

1,044 Days
Field of Search

707600-831, 707899, 707999001-999206
US Class Current
CPC Class Codes

G06F 16/3344   using natural language anal...

G06F 16/35   Clustering; Classification

G06F 40/284   Lexical analysis, e.g. toke...

G06F 40/40   Processing or translation o...

G06N 5/02   Knowledge representation; S...

Method, apparatus, and storage medium for text information processing

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

8 Citations

15 Claims

Specification

Solutions

Use Cases

Quick Links

Method, apparatus, and storage medium for text information processing

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

8 Citations

15 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links