Broad-coverage normalization system for social media language

US 9,164,983 B2
Filed: 02/27/2013
Issued: 10/20/2015
Est. Priority Date: 05/27/2011
Status: Active Grant

First Claim

Patent Images

1. A method of selecting data for an automated training process comprising:

identifying a plurality of occurrences of a non-standard token in a text corpus stored in a memory;

identifying a first plurality of tokens in the text corpus that are located proximate to at least one of the occurrences of the non-standard token;

identifying a plurality of occurrences of a candidate standard token in the text corpus;

identifying a second plurality of tokens in the text corpus that are located proximate to at least one of the occurrences of the candidate standard token;

identifying a contextual similarity between the first plurality of tokens and the second plurality of tokens;

generating a statistical model for correction of non-standard tokens with the non-standard token in association with the standard token for generation of a statistical model only in response to the identified contextual similarity being greater than a predetermined threshold; and

storing the generated statistical model in the memory for use in identification of another standard token that corresponds to another non-standard token identified in text data that are not included in the text corpus.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A method for identification of a standard text token in a dictionary that corresponds to a non-standard token identified in text includes identification of a first standard token that is associated with the non-standard using a predetermined conditional random field (CRF) model and identification of a second standard token that is associated with the non-standard token using a spell checker. The method further includes identification of noisy channel scores using data from the CRF model and the spell checker for the first standard token and the second standard token, respectively. The method further includes presentation of one of the first and second standard tokens having the greatest identified noisy channel score to a user with a user interface device.

180 Citations

13 Claims

1. A method of selecting data for an automated training process comprising:
- identifying a plurality of occurrences of a non-standard token in a text corpus stored in a memory;
  
  identifying a first plurality of tokens in the text corpus that are located proximate to at least one of the occurrences of the non-standard token;
  
  identifying a plurality of occurrences of a candidate standard token in the text corpus;
  
  identifying a second plurality of tokens in the text corpus that are located proximate to at least one of the occurrences of the candidate standard token;
  
  identifying a contextual similarity between the first plurality of tokens and the second plurality of tokens;
  
  generating a statistical model for correction of non-standard tokens with the non-standard token in association with the standard token for generation of a statistical model only in response to the identified contextual similarity being greater than a predetermined threshold; and
  
  storing the generated statistical model in the memory for use in identification of another standard token that corresponds to another non-standard token identified in text data that are not included in the text corpus.
- View Dependent Claims (2, 3, 4, 5)
- - 2. The method of claim 1, the identification of the contextual similarity further comprising:
    - identifying a first weight vector including a first plurality of weights with each weight in the first plurality of weights being an identified weight for one token in the first plurality of tokens;
      
      identifying a second weight vector including a second plurality of weights with each weight in the second plurality of weights being an identified weight for one token in the second plurality of tokens;
      
      identifying a sum of products of the first plurality of weights in the first weight vector multiplied by corresponding weights in the second plurality of weights in the second weight vector;
      
      identifying a first square root of a sum of squared values of the first plurality of weights in the first weight vector;
      
      identifying a second square root of a sum of squared values of the second plurality of weights in the second weight vector; and
      
      identifying the contextual similarity with reference to the identified sum divided by a product of the first square root multiplied by the second square root.
  - 3. The method of claim 2, the identification of one weight in the first plurality of weights for one token in the first plurality of tokens further comprising:
    - identifying a first number of occurrences of the one token in the text corpus in a location that is proximate to the non-standard token;
      
      identifying a total number of occurrences of the one token in the text corpus;
      
      identifying a number of messages in the text corpus that include at least one occurrence of the one token; and
      
      identifying the one weight in the first plurality of weights for the one token in the first plurality of tokens with reference to a product of a ratio of the first number of occurrences of the one token to the total number of occurrences of the first token multiplied by a logarithm of a ratio of a predetermined total number of messages in the text corpus to the identified number of messages in the text corpus that include the at least one occurrence of the one token.
  - 4. The method of claim 1, the generation of the statistical model further comprising:
    - generating a conditional random field (CRF) model with the non-standard token in association with the standard token only in response to the identified contextual similarity being greater than a predetermined threshold.
  - 5. The method of claim 1, the generation of the statistical model further comprising:
    - generating a Hidden Markov Model (HMM) with the non-standard token in association with the standard token only in response to the identified contextual similarity being greater than a predetermined threshold.

6. A method of identifying a standard token in a dictionary that corresponds to a non-standard token identified in data including a plurality of text tokens comprising:
- identifying a candidate token in a plurality of standard tokens stored in a memory;
  
  identifying a longest common sequence (LCS) of features in the candidate token corresponding to at least one feature in the candidate token that is present in the non-standard token;
  
  identifying a number of features in the LCS;
  
  identifying a frequency of the candidate token in a text corpus stored in a memory;
  
  identifying a similarity score between the non-standard token and the standard token with reference to a ratio of the identified number of features in the LCS to a total number of features in the non-standard token multiplied by a logarithm of the identified frequency of the candidate token; and
  
  presenting with a user interface device the standard candidate token to a user in replacement of the non-standard token or in association with the non-standard token in response to the identified similarity score exceeding a predetermined threshold.
- View Dependent Claims (7, 8, 9)
- - 7. The method of claim 6, the identification of the frequency of the candidate token in the text corpus further comprising:
    - identifying the frequency with reference to a total number of occurrences of the candidate token in the text corpus.
  - 8. The method of claim 6, the identification of the longest common sequence of features further comprising:
    - identifying a common series of characters that are included in the non-standard token and in the candidate token.
  - 9. The method of claim 6, the identification of the candidate token further comprising:
    - identifying a first character in one standard token in the plurality of standard tokens stored in the memory;
      
      identifying a first character in the non-standard token; and
      
      selecting the one standard token as the candidate token only in response to the first character in the one standard token corresponding to the first character in the non-standard token.

10. A method of identifying a plurality of standard tokens that correspond to a non-standard token comprising:
- identifying a first standard token corresponding to the non-standard token, the standard token being included in a dictionary having a plurality of standard tokens stored in a memory, the identification of the first standard token being made through transformation of a first plurality of features in the first standard token into a corresponding second plurality of features in the non-standard token using a conditional random field (CRF) model;
  
  identifying a second standard token in the dictionary of standard tokens corresponding to the non-standard token, the identification of the second standard token being made with reference to a comparison of the non-standard token with the standard tokens stored in the dictionary;
  
  identifying a first noisy channel score for the first standard token with reference to a first conditional probability value and a probability of the first standard token occurring in a text corpus stored in the memory, the first conditional probability value corresponding to the first standard token given the non-standard token from the CRF model;
  
  identifying a second noisy channel score for the second standard token with reference to a second conditional probability value and a probability of the second standard token occurring in the text corpus, the second conditional probability value corresponding to the second standard token given the non-standard token from the comparison;
  
  presenting with a user interface device the first standard token to a user in replacement of the non-standard token or in association with the non-standard token in response to the first noisy channel score being greater than the second noisy channel score; and
  
  presenting with the user interface device the second standard token to the user in replacement of the non-standard token or in association with the non-standard token in response to the second noisy channel score being greater than the first noisy channel score.
- View Dependent Claims (11, 12, 13)
- - 11. The method of claim 10 further comprising:
    - presenting with the user interface device the first standard token and the second standard token to the user in association with the non-standard token, the first standard token and the second standard token being presented in a descending order with the first standard token being displayed first in response to the first noisy channel score being higher than the second noisy channel score and the second standard token being displayed first in response to the second noisy channel score being higher than the first noisy channel score.
  - 12. The method of claim 11 further comprising:
    - identifying a third standard token corresponding to the non-standard token, the identification of the third standard token further comprising;
      
      identifying a first similarity score with reference to a ratio of a longest common sequence (LCS) of features between the third standard token and the non-standard token to a total number of features in the non-standard token multiplied by a logarithm of a number of occurrences of the third standard token in the text corpus;
      
      identifying a fourth standard token corresponding to the non-standard token, the identification of the fourth standard token further comprising;
      
      identifying a second similarity score with reference to a ratio of a longest common sequence (LCS) of features between the fourth standard token and the non-standard token to a total number of features in the non-standard token multiplied by a logarithm of a number of occurrences of the fourth standard token in the text corpus;
      
      presenting with the user interface device the third standard token to the user in association with or in replacement of the non-standard token in response to the first similarity score exceeding the second similarity score; and
      
      presenting with the user interface device the fourth standard token to the user in association with or in replacement of the non-standard token in response to the second similarity score exceeding the first similarity score.
  - 13. The method of claim 11 further comprising:
    - identifying a plurality of tokens in a message including the non-standard token;
      
      applying a Viterbi decoding process to the plurality of tokens in the message using a predetermined language model corresponding to a language of the message with the first standard token used in place of the non-standard token to identify a first posterior probability of the first standard token in the message;
      
      applying the Viterbi decoding process to the plurality of tokens in the message using the predetermined language model corresponding to the language of the message with the second standard token used in place of the non-standard token to identify a second posterior probability of the first standard token in the message;
      
      presenting to the user with the user interface device the message with the first standard token used as a replacement for the non-standard token in response to the first posterior probability being greater than the second posterior probability; and
      
      presenting to the user with the user interface device the message with the second standard token used as a replacement for the non-standard token in response to the second posterior probability being greater than the first posterior probability.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Robert Bosch GmbH
Original Assignee
Robert Bosch GmbH
Inventors
Liu, Fei, Weng, Fuliang
Primary Examiner(s)
BRYAR, JEREMIAH A

Application Number

US13/779,083
Publication Number

US 20130173258A1
Time in Patent Office

965 Days
Field of Search

None
US Class Current

1/1
CPC Class Codes

G06F 40/126   Character encoding

G06F 40/232   Orthographic correction, e....

G06F 40/274   Converting codes to words; ...

G06F 40/40   Processing or translation o...

Broad-coverage normalization system for social media language

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

180 Citations

13 Claims

Specification

Solutions

Use Cases

Quick Links

Broad-coverage normalization system for social media language

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

180 Citations

13 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links