Systems and methods for identifying anomalous data in large structured data sets and querying the data sets

US 9,965,524 B2
Filed: 04/03/2014
Issued: 05/08/2018
Est. Priority Date: 04/03/2013
Status: Active Grant

First Claim

Patent Images

1. A system that identifies anomalous data in a record set by comparing frequencies of unique elements obtained from the record set and frequencies of the unique elements in a reference data set, the system including:

a computer including memory; and

computer instructions causing the computer to implement;

creating an expanded tuple set by automatically expanding an existing first tuple set of a first feature from the record set to include a second tuple set of a second feature from the record set, the existing first tuple set being expanded by (i) adding the second tuple set to the existing first tuple set and (ii) creating unique elements with elements from the first feature from the record set and the second feature from the record set, wherein the unique elements in the expanded tuple set enumerate permutations of unique values of the second feature from the record set that are combined with values of the first feature from the record set to form the expanded tuple set;

identifying a count of how often each feature value combination of the unique elements is found in the expanded tuple set;

limiting the unique elements in the expanded tuple set to inhabited feature value combinations by (i) applying a threshold count criterion of 2 or more to the identified counts of how often the feature value combinations of the unique elements are found in the expanded tuple set and (ii) not retaining unique elements in the expanded tuple set that do not satisfy the threshold count criterion;

after expanding the existing first tuple set into the expanded tuple set and applying the threshold count criterion, comparing frequencies of the unique elements in the expanded tuple set to frequencies of the unique elements in the reference data set to identify anomalous frequencies of the unique elements in the expanded tuple set with respect to the frequencies of the unique elements in the reference data set; and

spotting outliers from the expanded tuple set with respect to the reference data set based on the identified anomalous frequencies.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The technology disclosed relates to automatic generation of tuples from a record set for outlier analysis. Applying this new technology, user need not specify which 1-tuples to combine into n-tuples. The tuples are generated from structured records organized into features (that also could be fields, objects or attributes.) Tuples are generated from combinations of feature values in the records. Thresholding is applied to manage the number of tuples generated. The technology disclosed further relates to indexing and searching high dimensional tuple spaces in a computer-implemented system.

160 Citations

19 Claims

1. A system that identifies anomalous data in a record set by comparing frequencies of unique elements obtained from the record set and frequencies of the unique elements in a reference data set, the system including:
- a computer including memory; and
  
  computer instructions causing the computer to implement;
  
  creating an expanded tuple set by automatically expanding an existing first tuple set of a first feature from the record set to include a second tuple set of a second feature from the record set, the existing first tuple set being expanded by (i) adding the second tuple set to the existing first tuple set and (ii) creating unique elements with elements from the first feature from the record set and the second feature from the record set, wherein the unique elements in the expanded tuple set enumerate permutations of unique values of the second feature from the record set that are combined with values of the first feature from the record set to form the expanded tuple set;
  
  identifying a count of how often each feature value combination of the unique elements is found in the expanded tuple set;
  
  limiting the unique elements in the expanded tuple set to inhabited feature value combinations by (i) applying a threshold count criterion of 2 or more to the identified counts of how often the feature value combinations of the unique elements are found in the expanded tuple set and (ii) not retaining unique elements in the expanded tuple set that do not satisfy the threshold count criterion;
  
  after expanding the existing first tuple set into the expanded tuple set and applying the threshold count criterion, comparing frequencies of the unique elements in the expanded tuple set to frequencies of the unique elements in the reference data set to identify anomalous frequencies of the unique elements in the expanded tuple set with respect to the frequencies of the unique elements in the reference data set; and
  
  spotting outliers from the expanded tuple set with respect to the reference data set based on the identified anomalous frequencies.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15)
- - 2. The system of claim 1, wherein the threshold count criterion is in a range of 2 to 20.
  - 3. The system of claim 1, wherein a number of features in the expanded tuple set is in a range of 4 to 40.
  - 4. The system of claim 1, wherein a number of features in the expanded tuple set is in a range of 5 to 20.
  - 5. The system of claim 1, further including, before combining a unique value of the second feature from the record set with an element of or applying the threshold count criterion to a resulting expanded tuple set element, qualifying the unique value of the second feature as satisfying the threshold count criterion.
  - 6. The system of claim 1, wherein:
    - the record set includes elements of a first type that are being tested for frequency of anomalies; and
      
      the reference data set includes between 10 and one billion times as many elements of the first type as the record set.
  - 7. The system of claim 6, applied repeatedly to distinct groups of elements the first type, wherein there are between 10 and one million of the distinct groups of the first type.
  - 8. The system of claim 1, wherein the computer instructions further cause the computer to implement reporting the outliers for analysis.
  - 9. The system of claim 1, applied to identifying valued sources of contacts, wherein:
    - the record set and the reference data set both include sales of contact objects;
      
      the record set includes contact objects from identified sources that are being tested for frequency of contact resale;
      
      the record set and the reference data set both include or can be counted to produce a frequencies contact object sales;
      
      the comparing of the frequencies includes comparing the frequencies of the contact object sales for the expanded tuple set generated from the record set to tuples generated from the reference data set; and
      
      the outliers are the identified sources whose contact objects have been sold with an anomalous frequency.
  - 10. The system of claim 9, wherein categories of the identified valued sources of contacts further comprise company name, contact title, and contact location.
  - 11. The system of claim 1, applied to screening insurance claims, wherein:
    - the record set and the reference data set both include insurance claims submitted from service providers;
      
      the record set includes objects from at least one identified service provider whose claims are being tested; and
      
      the comparing of the frequencies includes comparing frequencies of insurance claim feature tuples generated from the record set for an identified service provider to insurance claim feature tuples generated from the reference data set.
  - 12. The system of claim 11, further including:
    - submissions of insurance claims having object features that match the insurance claim feature tuples generated from the record set to the insurance claim feature tuples generated from the reference data set; and
      
      the outliers are identified sources whose insurance claims have been submitted with an anomalous frequency.
  - 13. The system of claim 1, applied to customer service call center routing wherein:
    - the record set and the reference data set both include completed call summaries submitted from incoming customer calls;
      
      the record set includes objects from at least one identified call center whose incoming customer calls are being evaluated; and
      
      the comparing of the frequencies includes comparing frequencies of customer complaint feature tuples generated from the record set for an identified call center to customer complaint feature tuples from the reference data set.
  - 14. The system of claim 13, further including:
    - completed call summaries having object features that match the expanded tuple set generated from the record set to the tuples generated from the reference data set; and
      
      the outliers are customer service agents whose completed call summaries have been resolved with an anomalous frequency.
  - 15. The system of claim 14, wherein resolved customer complaints with anomalous frequency correlate to customer service agents who handled incoming service calls with high rates of success.

16. A non-transitory computer readable media, including instructions that, when executed on a processor, cause the processor to execute a method for identifying anomalous data in a record set by comparing frequencies of unique elements obtained from the record set and frequencies of the unique elements in a reference data set, the method comprising:
- creating an expanded tuple set by automatically expanding an existing first tuple set of a first feature from the record set to include a second tuple set of a second feature from the record set, the existing first tuple set being expanded by (i) adding the second tuple set to the existing first tuple set, and (ii) creating unique elements with elements from the first feature from the record set and the second feature from the record set, wherein the unique elements in the expanded tuple set enumerate permutations of unique values of the second feature from the record set that are combined with values of the first feature from the record set to form the expanded tuple set;
  
  identifying a count of how often each feature value combination of the unique elements is found in the expanded tuple set;
  
  limiting the unique elements in the expanded tuple set to inhabited feature value combinations by (i) applying a threshold count criterion of 2 or more to the identified counts of how often the feature value combinations of the unique elements are found in the expanded tuple set and (ii) not retaining unique elements in the expanded tuple set that do not satisfy the threshold count criterion;
  
  after expanding the existing first tuple set into the expanded tuple set and applying the threshold count criterion, comparing frequencies of the unique elements in the expanded tuple set to frequencies of the unique elements in the reference data set to identify anomalous frequencies of the unique elements in the expanded tuple set with respect to the frequencies of the unique elements in the reference data set; and
  
  spotting outliers from the expanded tuple set with respect to the reference data set based on the identified anomalous frequencies.

17. A method of identifying anomalous data in a record set by comparing frequencies of unique elements obtained from the record set and frequencies of the unique elements in a reference data set, the method including:
- creating an expanded tuple set by automatically expanding an existing first tuple set of a first feature from the record set to include a second tuple set of a second feature from the record set, the existing first tuple set being expanded by (i) adding the second tuple set to the existing first tuple set and (ii) creating unique elements with elements from the first feature from the record set and the second feature from the record set, wherein the unique elements in the expanded tuple set enumerate permutations of unique values of the second feature from the record set that are combined with values of the first feature from the record set to form the expanded tuple set;
  
  identifying a count of how often each feature value combination of the unique elements is found in the expanded tuple set;
  
  limiting the unique elements in the expanded tuple set to inhabited feature value combinations by (i) applying a threshold count criterion of 2 or more to the identified counts of how often the feature value combinations of the unique elements are found in the expanded tuple set and (ii) not retaining unique elements in the expanded tuple set that do not satisfy the threshold count criterion;
  
  after expanding the existing first tuple set into the expanded tuple set and applying the threshold count criterion, comparing frequencies of the unique elements in the expanded tuple set to frequencies of the unique elements in the reference data set to identify anomalous frequencies of the unique elements in the expanded tuple set with respect to the frequencies of the unique elements in the reference data set; and
  
  spotting outliers from the expanded tuple set with respect to the reference data set based on the identified anomalous frequencies.
- View Dependent Claims (18, 19)
- - 18. The method of claim 17, wherein the threshold count criterion is in a range of 2 to 20.
  - 19. The method of claim 18, wherein a number of features in the expanded tuple set is in a range of 4 to 40.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Salesforce.com, Inc.
Original Assignee
Salesforce.com, Inc.
Inventors
Fuchs, Matthew, Georgiev, Stanislav
Primary Examiner(s)
Hasan, Syed
Assistant Examiner(s)
Phan, Tuan-Khanh

Application Number

US14/244,146
Publication Number

US 20140304279A1
Time in Patent Office

1,496 Days
Field of Search

None
US Class Current
CPC Class Codes

G06F 16/2465 Query processing support fo...

H04B 17/00 Monitoring; Testing of line...

Systems and methods for identifying anomalous data in large structured data sets and querying the data sets

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

160 Citations

19 Claims

Specification

Solutions

Use Cases

Quick Links

Systems and methods for identifying anomalous data in large structured data sets and querying the data sets

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

160 Citations

19 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links