Multi-label content recategorization

US 10,691,739 B2
Filed: 12/22/2015
Issued: 06/23/2020
Est. Priority Date: 12/22/2015
Status: Active Grant

First Claim

Patent Images

1. A computing apparatus, comprising:

a hardware platform comprising a processor and a memory; and

one or more tangible, non-transitory computer-readable mediums having instructions to provide a two-phase classification engine to;

in a first phase, receive a clean multi-labeled dataset comprising a plurality of documents, each assigned to one or more categories from a set of fixed categories;

receive an unclean multi-labeled dataset, wherein at least some objects of the unclean multi-labeled dataset belong to overlapping classes, wherein the probability that a document belongs to the overlapping classes is approximately equal;

produce a recategorized and cleansed dataset from the unclean multi-labeled dataset, comprising predicting a number of labels {circumflex over (l)} for a document j, and comparing {circumflex over (l)} to an existing number of labels {circumflex over (l)}; and

in a second phase, compute from the recategorized and cleansed dataset a probability difference between l and {circumflex over (l)} for j, and take l to be correct if the difference is less than or equal to a threshold.

View all claims

13 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

In an example, there is disclosed a computing apparatus, including one or more logic elements, including at least one hardware logic element, comprising a classification engine to: receive a clean multi-labeled dataset comprising a plurality of document each assigned to one or more of a plurality of categories; receive an unclean multi-labeled dataset; and produce a recategorized and cleansed dataset from the unclean multi-labeled dataset, comprising predicting a number of labels {circumflex over (l)} for a document j, and comparing {circumflex over (l)} to an existing number of labels l. There is also disclosed a method of providing a classification engine.

15 Citations

25 Claims

1. A computing apparatus, comprising:
- a hardware platform comprising a processor and a memory; and
  
  one or more tangible, non-transitory computer-readable mediums having instructions to provide a two-phase classification engine to;
  
  in a first phase, receive a clean multi-labeled dataset comprising a plurality of documents, each assigned to one or more categories from a set of fixed categories;
  
  receive an unclean multi-labeled dataset, wherein at least some objects of the unclean multi-labeled dataset belong to overlapping classes, wherein the probability that a document belongs to the overlapping classes is approximately equal;
  
  produce a recategorized and cleansed dataset from the unclean multi-labeled dataset, comprising predicting a number of labels {circumflex over (l)} for a document j, and comparing {circumflex over (l)} to an existing number of labels {circumflex over (l)}; and
  
  in a second phase, compute from the recategorized and cleansed dataset a probability difference between l and {circumflex over (l)} for j, and take l to be correct if the difference is less than or equal to a threshold.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
- - 2. The computing apparatus of claim 1, wherein the two-phase classification engine is further to divide at least part of the clean multi-labeled dataset into a training dataset.
  - 3. The computing apparatus of claim 2, wherein the two-phase classification engine is further to use the training dataset to build a support vector regression model to predict a number of labels to associate with document j.
  - 4. The computing apparatus of claim 3, wherein the two-phase classification engine is further to divide at least part of the clean multi-labeled dataset into a validation set, and to use the validation set to tune the two-phase classification engine.
  - 5. The computing apparatus of claim 1, wherein producing the recategorized and cleansed dataset from the unclean multi-labeled dataset further comprises:
    - receiving a probability threshold a for a number of labels;
      
      computing a probability for {circumflex over (l)}; and
      
      determining that the probability for {circumflex over (l)} is greater than α
      
      .
  - 6. The computing apparatus of claim 1, wherein producing the recategorized and cleansed dataset from the unclean multi-labeled dataset further comprises computing a set of predicted labels Ŝ
    - for document j.
  - 7. The computing apparatus of claim 6, wherein producing the recategorized and cleansed dataset from the unclean multi-labeled dataset further comprises comparing Ŝ
    - to a set of existing labels S.
  - 8. The computing apparatus of claim 7, wherein producing the recategorized and cleansed dataset from the unclean multi-labeled dataset further comprises determining that Ŝ
    - is partly but not fully contained in S, and replacing S with labels unique to Ŝ
      
      that have a probability greater than a threshold T¹.
  - 9. The computing apparatus of claim 7, wherein producing the recategorized and cleansed dataset from the unclean multi-labeled dataset further comprises determining that Ŝ
    - is fully contained in S, and replacing S with Ŝ
      
      .
  - 10. The computing apparatus of claim 7, wherein producing the recategorized and cleansed dataset from the unclean multi-labeled dataset further comprises determining that Ŝ
    - is not contained in S, and replacing S with labels common to Ŝ and
      
      S, along with labels unique to Ŝ
      
      that have a probability greater than a threshold T¹.
  - 11. The computing apparatus of claim 1, wherein the two-phase classification engine is further to build a classifier from the recategorized and cleansed dataset.
  - 12. The computing apparatus of claim 11, wherein the two-phase classification engine is further to compare a precision of the classifier to a precision of a prior classifier.

13. One or more tangible, non-transitory computer-readable mediums having stored thereon executable instructions for providing a two-phase classification engine to:
- in a first phase, receive a clean multi-labeled dataset comprising a plurality of documents, each assigned to one or more categories from a set of fixed categories;
  
  receive an unclean multi-labeled dataset, wherein at least some objects of the unclean multi-labeled dataset belong to overlapping classes, wherein the probability that a document belongs to the overlapping classes is approximately equal;
  
  produce a recategorized and cleansed dataset from the unclean multi-labeled dataset, comprising predicting a number of labels {circumflex over (l)} for a document j, and comparing {circumflex over (l)} to an existing number of labels l; and
  
  in a second phase, compute from the recategorized and cleansed dataset a probability difference between l and {circumflex over (l)} for j, and take l to be correct if the difference is less than or equal to a threshold.
- View Dependent Claims (14, 15, 16, 17, 18, 19, 20, 21, 22)
- - 14. The one or more tangible, non-transitory computer-readable mediums of claim 13, wherein the two-phase classification engine is further to divide at least part of the clean multi-labeled dataset into a training dataset.
  - 15. The one or more tangible, non-transitory computer-readable mediums of claim 14, wherein the two-phase classification engine is further to use the training dataset to build a support vector regression model to predict a number of labels to associate with document j.
  - 16. The one or more tangible, non-transitory computer-readable mediums of claim 15, wherein the two-phase classification engine is further to divide at least part of the clean multi-labeled dataset into a validation set, and to use the validation set to tune the two-phase classification engine.
  - 17. The one or more tangible, non-transitory computer-readable mediums of claim 13, wherein producing the recategorized and cleansed dataset from the unclean multi-labeled dataset further comprises:
    - receiving a probability threshold a for a number of labels;
      
      computing a probability for {circumflex over (l)}; and
      
      determining that the probability for {circumflex over (l)} is greater than α
      
      .
  - 18. The one or more tangible, non-transitory computer-readable mediums of claim 13, wherein producing the recategorized and cleansed dataset from the unclean multi-labeled dataset further comprises computing a set of predicted labels Ŝ
    - for document j.
  - 19. The one or more tangible, non-transitory computer-readable mediums of claim 18, wherein producing the recategorized and cleansed dataset from the unclean multi-labeled dataset further comprises comparing Ŝ
    - to a set of existing labels S.
  - 20. The one or more tangible, non-transitory computer-readable mediums of claim 19, wherein producing the recategorized and cleansed dataset from the unclean multi-labeled dataset further comprises determining that Ŝ
    - is partly but not fully contained in S, and replacing S with labels unique to Ŝ
      
      that have a probability greater than a threshold T¹.
  - 21. The one or more tangible, non-transitory computer-readable mediums of claim 19, wherein producing the recategorized and cleansed dataset from the unclean multi-labeled dataset further comprises determining that Ŝ
    - is fully contained in S, and replacing S with Ŝ
      
      .
  - 22. The one or more tangible, non-transitory computer-readable mediums of claim 19, wherein producing the recategorized and cleansed dataset from the unclean multi-labeled dataset further comprises determining that Ŝ
    - is not contained in S, and replacing S with labels common to Ŝ and
      
      S, along with labels unique to S that have a probability greater than a threshold T¹.

23. A computer-implemented method of providing two-phase multi-label content recategorization, comprising:
- in a first phase, receiving a clean multi-labeled dataset comprising a plurality of documents, each assigned to one or more categories from a set of fixed categories;
  
  receiving an unclean multi-labeled dataset, wherein at least some objects of the unclean multi-labeled dataset belong to overlapping classes, wherein the probability that a document belongs to the overlapping classes is approximately equal;
  
  producing a recategorized and cleansed dataset from the unclean multi-labeled dataset, comprising predicting a number of labels {circumflex over (l)} for a document j, and comparing {circumflex over (l)} to an existing number of labels l; and
  
  in a second phase, computing from the recategorized and cleansed dataset a probability difference between l and {circumflex over (l)} for j, and take l to be correct if the difference is less than or equal to a threshold.
- View Dependent Claims (24, 25)
- - 24. The method of claim 23, wherein producing the recategorized and cleansed dataset from the unclean multi-labeled dataset further comprises:
    - receiving a probability threshold a for a number of labels;
      
      computing a probability for {circumflex over (l)}; and
      
      determining that the probability for {circumflex over (l)} is greater than α
      
      .
  - 25. The method of claim 23, wherein producing the recategorized and cleansed dataset from the unclean multi-labeled dataset further comprises comparing Ŝ
    - to a set of existing labels S.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Musarubra US LLC (Musarubra US SellCo LLC)
Original Assignee
McAfee, LLC
Inventors
Singh, Nidhi, Olinsky, Craig Philip, Paramasivam, Thamizhannal
Primary Examiner(s)
Morrison, Jay A
Assistant Examiner(s)
Hoang, Ken

Application Number

US14/977,875
Publication Number

US 20170177627A1
Time in Patent Office

1,645 Days
Field of Search

707769
US Class Current
CPC Class Codes

G06F 16/353 into predefined classes

Multi-label content recategorization

First Claim

13 Assignments

0 Petitions

Accused Products

Abstract

15 Citations

25 Claims

Specification

Use Cases

Quick Links

Others

Multi-label content recategorization

First Claim

13 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

15 Citations

25 Claims

Specification

Subscription Required

Use Cases

Quick Links

Others