Classification of Offensive Words

US 20150309987A1
Filed: 04/29/2014
Published: 10/29/2015
Est. Priority Date: 04/29/2014
Status: Abandoned Application

First Claim

Patent Images

1. A computer-implemented method comprising:

obtaining a plurality of text samples;

identifying, from among the plurality of text samples, a first set of text samples that each includes a particular potentially offensive term;

obtaining labels for the first set of text samples that indicate whether the particular potentially offensive term is used in an offensive manner in respective ones of the text samples in the first set of text samples;

training, based at least on the first set of text samples and the labels for the first set of text samples, a classifier that is configured to use one or more signals associated with a text sample to generate a label that indicates whether a potentially offensive term in the text sample is used in an offensive manner in the text sample; and

providing, to the classifier, a first text sample that includes the particular potentially offensive term, and in response, obtaining, from the classifier, a label that indicates whether the particular potentially offensive term is used in an offensive manner in the first text sample.

View all claims

2 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

A computer-implemented method can include identifying a first set of text samples that include a particular potentially offensive term. Labels can be obtained for the first set of text samples that indicate whether the particular potentially offensive term is used in an offensive manner. A classifier can be trained based at least on the first set of text samples and the labels, the classifier being configured to use one or more signals associated with a text sample to generate a label that indicates whether a potentially offensive term in the text sample is used in an offensive manner in the text sample. The method can further include providing, to the classifier, a first text sample that includes the particular potentially offensive term, and in response, obtaining, from the classifier, a label that indicates whether the particular potentially offensive term is used in an offensive manner in the first text sample.

61 Citations

View as Search Results

21 Claims

1. A computer-implemented method comprising:
- obtaining a plurality of text samples;
  
  identifying, from among the plurality of text samples, a first set of text samples that each includes a particular potentially offensive term;
  
  obtaining labels for the first set of text samples that indicate whether the particular potentially offensive term is used in an offensive manner in respective ones of the text samples in the first set of text samples;
  
  training, based at least on the first set of text samples and the labels for the first set of text samples, a classifier that is configured to use one or more signals associated with a text sample to generate a label that indicates whether a potentially offensive term in the text sample is used in an offensive manner in the text sample; and
  
  providing, to the classifier, a first text sample that includes the particular potentially offensive term, and in response, obtaining, from the classifier, a label that indicates whether the particular potentially offensive term is used in an offensive manner in the first text sample.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
- - 2. The computer-implemented method of claim 1, further comprising:
    - identifying, from among the plurality of text samples, a second set of text samples that each includes the particular potentially offensive term;
      
      providing the second set of text samples to the classifier, and in response, obtaining labels for the second set of text samples that were generated by the classifier and that indicate whether the particular potentially offensive term is used in an offensive manner in respective ones of the text samples in the second set of text samples,wherein training the classifier is further based on the second set of text samples and the labels for the second set of text samples that were generated by the classifier.
  - 3. The computer-implemented method of claim 1, further comprising iteratively training the classifier by performing multiple training iterations, each training iteration comprising providing a particular set of text samples to the classifier, obtaining labels for the particular set of text samples that were generated by the classifier in response, and re-training the classifier based at least on the particular set of text samples and the labels for the particular set of text samples that were generated by the classifier.
  - 4. The computer-implemented method of claim 3, wherein the particular set of text samples in a first of the training iterations includes more text samples than the particular set of text samples in a training iteration that preceded the first of the training iterations.
  - 5. The computer-implemented method of claim 3, further comprising, for each of at least some of the multiple training iterations, determining a measure of accuracy of the classifier by comparing the labels generated by the classifier for a subset of the particular set of text samples with a control set of labels for the subset of the particular set of text samples that are known to be accurate.
  - 6. The computer-implemented method of claim 1, wherein training the classifier comprises using information from the first set of text samples in an expectation-maximization algorithm.
  - 7. The computer-implemented method of claim 1, further comprising obtaining, in response to providing the first text sample to the classifier, a label confidence score that indicates a confidence that the label correctly indicates whether the particular potentially offensive term is used in an offensive manner in the first text sample.
  - 8. The computer-implemented method of claim 1, wherein the one or more signals associated with the text sample used by the classifier to generate the label comprise information determined based on content of the text sample.
  - 9. The computer-implemented method of claim 8, wherein the information determined based on content of the text sample comprises n-gram data for an n-gram in the text sample that includes the particular potentially offensive term.
  - 10. The computer-implemented method of claim 8, wherein the information determined based on content of the text sample comprises bag-of-words data that indicates a distribution of terms in the text sample.
  - 11. The computer-implemented method of claim 1, wherein the one or more signals associated with the text sample used by the classifier to generate the label comprise contextual data associated with the text sample that is not determined based on content of the text sample.
  - 12. The computer-implemented method of claim 11, wherein the text sample is a transcription of an utterance, and wherein the contextual data associated with the text sample comprises an indication of user satisfaction with the transcription of the utterance.
  - 13. The computer-implemented method of claim 11, wherein the text sample is a transcription of an utterance, and wherein the contextual data associated with the text sample comprises a transcription confidence score that indicates a likelihood that the text sample is an accurate transcription of the utterance.
  - 14. The computer-implemented method of claim 1, wherein the one or more signals associated with the text sample used by the classifier to generate the label comprise both information determined based on content of the text sample and contextual data associated with the text sample that is not determined based on the content of the text sample.
  - 15. The computer-implemented method of claim 1, wherein the plurality of text samples includes text samples obtained from at least one of records of transcribed speech and records of search queries.
  - 16. The computer-implemented method of claim 1, wherein the labels for at least some of the first set of text samples that indicate whether the particular potentially offensive term is used in an offensive manner in respective ones of the text samples in the first set of text samples were manually determined by one or more users.

17. One or more computer-readable devices having instructions stored thereon that, when executed by one or more processors, cause performance of operations comprising:
- obtaining a plurality of text samples;
  
  identifying, from among the plurality of text samples, a first set of text samples that each includes a particular potentially offensive term;
  
  obtaining labels for the first set of text samples that indicate whether the particular potentially offensive term is used in an offensive manner in respective ones of the text samples in the first set of text samples;
  
  training, based at least on the first set of text samples and the labels for the first set of text samples, a classifier that is configured to use one or more signals associated with a text sample to generate a label that indicates whether a potentially offensive term in the text sample is used in an offensive manner in the text sample; and
  
  providing, to the classifier, a first text sample that includes the particular potentially offensive term, and in response, obtaining, from the classifier, a label that indicates whether the particular potentially offensive term is used in an offensive manner in the first text sample.
- View Dependent Claims (18, 19)
- - 18. The one or more computer-readable devices of claim 17, wherein the operations further comprise:
    - identifying, from among the plurality of text samples, a second set of text samples that each includes the particular potentially offensive term;
      
      providing the second set of text samples to the classifier, and in response, obtaining labels for the second set of text samples that were generated by the classifier and that indicate whether the particular potentially offensive term is used in an offensive manner in respective ones of the text samples in the second set of text samples,wherein training the classifier is further based on the second set of text samples and the labels for the second set of text samples that were generated by the classifier.
  - 19. The one or more computer-readable devices of claim 17, wherein the operations further comprise iteratively training the classifier by performing multiple training iterations, each training iteration comprising providing a particular set of text samples to the classifier, obtaining labels for the particular set of text samples that were generated by the classifier in response, and re-training the classifier based at least on the particular set of text samples and the labels for the particular set of text samples that were generated by the classifier, wherein different particular sets of text samples are used among particular ones of the multiple training iterations.

20. A system comprising:
- one or more computers configured to provide;
  
  a repository of potentially offensive terms;
  
  a repository of labeled text samples that includes a first set of labeled text samples for which one or more potentially offensive terms from the repository of potentially offensive terms have been labeled in the first set of text samples so as to indicate likelihoods that the potentially offensive terms are used in offensive manners in particular ones of the text samples in the first set of labeled text samples;
  
  a repository of non-labeled text samples that includes a first set of non-labeled text samples that include one or more potentially offensive terms from the repository of potentially offensive terms;
  
  a classifier that labels the one or more potentially offensive terms in the first set of non-labeled text samples to generate a second set of labeled text samples that are labeled so as to indicate a likelihood that the one or more potentially offensive terms in the text samples are used in offensive manners; and
  
  a training engine that trains the classifier based at least on the first set of labeled text samples and the second set of labeled text samples that were labeled by the classifier.

21. A computer-implemented method comprising:
- obtaining a plurality of text samples;
  
  identifying, from among the plurality of text samples, a first set of text samples that each includes a particular potentially offensive term;
  
  obtaining labels for the first set of text samples that indicate whether a particular user considers the particular potentially offensive term to be used in an offensive manner in respective ones of the text samples in the first set of text samples;
  
  training, based at least on the first set of text samples and the labels for the first set of text samples, a user-specific classifier for the particular user, wherein the user-specific classifier is configured to use one or more signals associated with a text sample to generate a label that indicates whether a potentially offensive term in the text sample is likely to be considered by the particular user to be used in an offensive manner in the text sample; and
  
  providing, to the user-specific classifier, a first text sample that includes the particular potentially offensive term, and in response, obtaining, from the user-specific classifier, a label that indicates whether the particular potentially offensive term is likely to be considered by the particular user to be used in an offensive manner in the first text sample.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Google LLC (Alphabet Inc.)
Original Assignee
Google Inc. (Alphabet Inc.)
Inventors
Epstein, Mark Edward, Mengibar, Pedro J. Moreno

Application Number

US14/264,617
Publication Number

US 20150309987A1
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 40/205   Parsing

G06F 40/253   Grammatical analysis; Style...

G06F 40/279   Recognition of textual enti...

G06F 40/30   Semantic analysis

Classification of Offensive Words

First Claim

2 Assignments

0 Petitions

Accused Products

Abstract

61 Citations

21 Claims

Specification

Solutions

Use Cases

Quick Links

Classification of Offensive Words

First Claim

2 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

61 Citations

21 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links