×

Broad-coverage normalization system for social media language

  • US 9,164,983 B2
  • Filed: 02/27/2013
  • Issued: 10/20/2015
  • Est. Priority Date: 05/27/2011
  • Status: Active Grant
First Claim
Patent Images

1. A method of selecting data for an automated training process comprising:

  • identifying a plurality of occurrences of a non-standard token in a text corpus stored in a memory;

    identifying a first plurality of tokens in the text corpus that are located proximate to at least one of the occurrences of the non-standard token;

    identifying a plurality of occurrences of a candidate standard token in the text corpus;

    identifying a second plurality of tokens in the text corpus that are located proximate to at least one of the occurrences of the candidate standard token;

    identifying a contextual similarity between the first plurality of tokens and the second plurality of tokens;

    generating a statistical model for correction of non-standard tokens with the non-standard token in association with the standard token for generation of a statistical model only in response to the identified contextual similarity being greater than a predetermined threshold; and

    storing the generated statistical model in the memory for use in identification of another standard token that corresponds to another non-standard token identified in text data that are not included in the text corpus.

View all claims
  • 1 Assignment
Timeline View
Assignment View
    ×
    ×