Managing a set of data

US 9,705,972 B2
Filed: 10/31/2014
Issued: 07/11/2017
Est. Priority Date: 10/31/2014
Status: Active Grant

First Claim

Patent Images

1. A computer implemented method for generating a qualified set of data, the method comprising:

receiving, by at least one processor, an input set of data;

determining, by the at least one processor analyzing the input set of data, a domain that characterizes a subject matter of the input set of data;

computing, by extracting a common feature from the input set of data by the at least one processor, a probability that a specific user created a first portion of the input set of data;

identifying, by the at least one processor, the first portion of the input set of data based, at least in part, on the first portion of the input set of data having the common feature;

generating, by the at least one processor, based, at least in part, on the domain, on the probability and on the first portion of the input set of data having the common feature, a user identifier associated with the first portion of the input set of data;

storing, by the at least one processor, the user identifier in a data repository;

computing, by the at least one processor, based at least in part on the domain and the user identifier, a credibility measure;

computing, by the at least one processor, based at least in part on the credibility measure, a quality factor associated with the first portion of the input set of data;

generating, by the at least one processor, based at least in part on the quality factor exceeding a quality factor threshold, the qualified set of data comprising data, among the first portion of the input data, that exceeds the quality threshold; and

outputting, by the at least one processor, the qualified set of data.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Aspects of the disclosure include managing a set of data associated with a corpus. By analyzing the corpus, a domain is established to characterize the subject matter of the set of data. A user identifier is generated for a portion of the set of data. Based upon a credibility computation, a quality factor for a portion of the set of data is determined. The credibility computation includes using both the domain and the user identifier to determine the quality factor for the portion of the set of data. The quality factor for the portion of the set of data is compared with a threshold. In response to a quality factor for a portion of the set of data exceeding the threshold, the portion of the set of data is selected.

Citations

19 Claims

1. A computer implemented method for generating a qualified set of data, the method comprising:
- receiving, by at least one processor, an input set of data;
  
  determining, by the at least one processor analyzing the input set of data, a domain that characterizes a subject matter of the input set of data;
  
  computing, by extracting a common feature from the input set of data by the at least one processor, a probability that a specific user created a first portion of the input set of data;
  
  identifying, by the at least one processor, the first portion of the input set of data based, at least in part, on the first portion of the input set of data having the common feature;
  
  generating, by the at least one processor, based, at least in part, on the domain, on the probability and on the first portion of the input set of data having the common feature, a user identifier associated with the first portion of the input set of data;
  
  storing, by the at least one processor, the user identifier in a data repository;
  
  computing, by the at least one processor, based at least in part on the domain and the user identifier, a credibility measure;
  
  computing, by the at least one processor, based at least in part on the credibility measure, a quality factor associated with the first portion of the input set of data;
  
  generating, by the at least one processor, based at least in part on the quality factor exceeding a quality factor threshold, the qualified set of data comprising data, among the first portion of the input data, that exceeds the quality threshold; and
  
  outputting, by the at least one processor, the qualified set of data.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
- - 2. The method of claim 1, wherein the input set of data includes user-provided content within a corpus.
  - 3. The method of claim 2, wherein the user-provided content includes at least one of commentary or commentator names.
  - 4. The method of claim 1, wherein the input set of data includes a user-generated file.
  - 5. The method of claim 4, wherein the user generated file comprises at least one of a news document, a published document, a patent document and a social media document.
  - 6. The method of claim 1, wherein the analyzing the input set of data includes using at least one of a web crawler technique, a pattern recognition technique, and a natural language processing technique.
  - 7. The method of claim 1, wherein the domain includes a topic to classify a specific data element among the input set of data.
  - 8. The method of claim 1, wherein the first portion of the input set of data includes a social network identifier associated with the common feature.
  - 9. The method of claim 1, wherein the computing the probability comprises the at least one processor using a natural language processing technique.
  - 10. The method of claim 1, wherein the storing the user identifier in the data repository includes the at least one processor mapping the user identifier to the first portion of the input set of data.
  - 11. The method of claim 1, wherein the computing the credibility measure includes:
    - the at least one processor computing, by awarding points to the user identifier, a score value for the user identifier wherein the points are associated with a previous internet activity, associated with the domain, corresponding to the user identifier;
      
      the at least one processor comparing the score value with a score value threshold; and
      
      the at least one processor computing the credibility measure based on the comparison of the score value with the score value threshold.
  - 12. The method of claim 1, wherein the quality factor is based on a quality score which indicates a level of credibility for the user identifier within the subject matter of the input set of data.
  - 13. The method of claim 12, wherein the at least one processor computes the quality score by awarding points to the user identifier based upon historical data.
  - 14. The method of claim 13, wherein the historical data comprises at least one of:
    - commentator activity, social network activity, publication activity, and group association activity.
  - 15. The method of claim 1, wherein the at least one processor outputting the qualified set of data comprises at least one of the at least one processor marking and the at least one processor displaying data among the qualified set of data.
  - 16. The method of claim 1, further comprising the at least one processor:
    - storing, in the data repository, the first portion of the input set of data;
      
      computing a relevancy score associated with the first portion of the input set of data;
      
      determining that the relevancy score is below a relevancy score value threshold; and
      
      removing, based on the relevancy score being below the threshold, the first portion of the input set of data from the data repository.
  - 17. The method of claim 16, wherein the determining the relevancy score to be below the relevancy score value threshold includes the at least one processor using a technique comprising at least one of machine learning techniques, keyword techniques, or embedded link analysis techniques.

18. A computer program for generating a qualified set of data, the computer program product comprising a computer readable storage medium having instructions embodied therewith, the program instructions executable by a processor to cause the processor to:
- receive an input set of data;
  
  determine, by analyzing the input set of data, a domain that characterizes a subject matter of the input set of data;
  
  compute, by extracting a common feature from the input set of data, a probability that a specific user created a first portion of the input set of data;
  
  identify the first portion of the input set of data based, at least in part, on the first portion of the input set of data having the common feature;
  
  generate, based, at least in part, on the domain, on the probability and on the first portion of the input set of data having the common feature, a user identifier associated with the first portion of the input set of data;
  
  store the user identifier in a data repository;
  
  compute a credibility measure, based at least in part on the domain and the user identifier;
  
  compute, based at least in part on the credibility measure, a quality factor associated with the first portion of the input set of data; and
  
  generate, based at least in part on the quality factor exceeding a quality factor threshold, the qualified set of data comprising data, among the first portion of the input data, that exceeds the quality threshold; and
  
  output the qualified set of data.

19. A computer system for generating a qualified set of data, the computer system comprising a processor configured to:
- receive an input set of data;
  
  determine, by analyzing the input set of data, a domain that characterizes a subject matter of the input set of data;
  
  compute, by extracting a common feature from the input set of data, a probability that a specific user created a first portion of the input set of data;
  
  identify the first portion of the input set of data based, at least in part, on the first portion of the input set of data having the common feature;
  
  generate, based, at least in part, on the domain, the probability and the first portion of the input set of data having the common feature, a user identifier associated with the first portion of the input set of data;
  
  store the user identifier in a data repository;
  
  compute, based at least in part on the domain and the user identifier, a credibility measure;
  
  compute, based at least in part on the credibility measure, a quality factor associated with the first portion of the input set of data;
  
  generate, based at least in part on the quality factor exceeding a quality factor threshold, the qualified set of data comprising data, among the first portion of the input data, that exceeds the quality threshold; and
  
  output the qualified set of data.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
International Business Machines Corporation
Original Assignee
International Business Machines Corporation
Inventors
Branch, Joel W., Mazzella, Daniel A., Guo, Shang Q., Nelson, John C., Lenchner, Jonathan, Mukherjee, Maharaj, Aaron, Andrew S.
Primary Examiner(s)
ROBINSON, GRETA LEE

Application Number

US14/529,653
Publication Number

US 20160124946A1
Time in Patent Office

984 Days
Field of Search

707600, 707602, 707603, 707723, 707726, 707727, 707732, 707748-751, 707755, 707758, 705 11, 705 711, 705 729, 705 731, 705 732, 705 737, 705 738, 705 739, 705 741, 705319, 715230, 715233, 715968
US Class Current
CPC Class Codes

G06F 16/951 Indexing; Web crawling tech...

H04L 67/10 in which an application is ...

Managing a set of data

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Managing a set of data

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links