Broad-coverage normalization system for social media language
First Claim
1. A method of selecting data for an automated training process comprising:
- identifying a plurality of occurrences of a non-standard token in a text corpus stored in a memory;
identifying a first plurality of tokens in the text corpus that are located proximate to at least one of the occurrences of the non-standard token;
identifying a plurality of occurrences of a candidate standard token in the text corpus;
identifying a second plurality of tokens in the text corpus that are located proximate to at least one of the occurrences of the candidate standard token;
identifying a contextual similarity between the first plurality of tokens and the second plurality of tokens;
generating a statistical model for correction of non-standard tokens with the non-standard token in association with the standard token for generation of a statistical model only in response to the identified contextual similarity being greater than a predetermined threshold; and
storing the generated statistical model in the memory for use in identification of another standard token that corresponds to another non-standard token identified in text data that are not included in the text corpus.
1 Assignment
0 Petitions
Accused Products
Abstract
A method for identification of a standard text token in a dictionary that corresponds to a non-standard token identified in text includes identification of a first standard token that is associated with the non-standard using a predetermined conditional random field (CRF) model and identification of a second standard token that is associated with the non-standard token using a spell checker. The method further includes identification of noisy channel scores using data from the CRF model and the spell checker for the first standard token and the second standard token, respectively. The method further includes presentation of one of the first and second standard tokens having the greatest identified noisy channel score to a user with a user interface device.
-
Citations
13 Claims
-
1. A method of selecting data for an automated training process comprising:
-
identifying a plurality of occurrences of a non-standard token in a text corpus stored in a memory; identifying a first plurality of tokens in the text corpus that are located proximate to at least one of the occurrences of the non-standard token; identifying a plurality of occurrences of a candidate standard token in the text corpus; identifying a second plurality of tokens in the text corpus that are located proximate to at least one of the occurrences of the candidate standard token; identifying a contextual similarity between the first plurality of tokens and the second plurality of tokens; generating a statistical model for correction of non-standard tokens with the non-standard token in association with the standard token for generation of a statistical model only in response to the identified contextual similarity being greater than a predetermined threshold; and storing the generated statistical model in the memory for use in identification of another standard token that corresponds to another non-standard token identified in text data that are not included in the text corpus. - View Dependent Claims (2, 3, 4, 5)
-
-
6. A method of identifying a standard token in a dictionary that corresponds to a non-standard token identified in data including a plurality of text tokens comprising:
-
identifying a candidate token in a plurality of standard tokens stored in a memory; identifying a longest common sequence (LCS) of features in the candidate token corresponding to at least one feature in the candidate token that is present in the non-standard token; identifying a number of features in the LCS; identifying a frequency of the candidate token in a text corpus stored in a memory; identifying a similarity score between the non-standard token and the standard token with reference to a ratio of the identified number of features in the LCS to a total number of features in the non-standard token multiplied by a logarithm of the identified frequency of the candidate token; and presenting with a user interface device the standard candidate token to a user in replacement of the non-standard token or in association with the non-standard token in response to the identified similarity score exceeding a predetermined threshold. - View Dependent Claims (7, 8, 9)
-
-
10. A method of identifying a plurality of standard tokens that correspond to a non-standard token comprising:
-
identifying a first standard token corresponding to the non-standard token, the standard token being included in a dictionary having a plurality of standard tokens stored in a memory, the identification of the first standard token being made through transformation of a first plurality of features in the first standard token into a corresponding second plurality of features in the non-standard token using a conditional random field (CRF) model; identifying a second standard token in the dictionary of standard tokens corresponding to the non-standard token, the identification of the second standard token being made with reference to a comparison of the non-standard token with the standard tokens stored in the dictionary; identifying a first noisy channel score for the first standard token with reference to a first conditional probability value and a probability of the first standard token occurring in a text corpus stored in the memory, the first conditional probability value corresponding to the first standard token given the non-standard token from the CRF model; identifying a second noisy channel score for the second standard token with reference to a second conditional probability value and a probability of the second standard token occurring in the text corpus, the second conditional probability value corresponding to the second standard token given the non-standard token from the comparison; presenting with a user interface device the first standard token to a user in replacement of the non-standard token or in association with the non-standard token in response to the first noisy channel score being greater than the second noisy channel score; and presenting with the user interface device the second standard token to the user in replacement of the non-standard token or in association with the non-standard token in response to the second noisy channel score being greater than the first noisy channel score. - View Dependent Claims (11, 12, 13)
-
Specification