SYSTEM FOR CLUSTERING AND AGGREGATING DATA FROM MULTIPLE SOURCES

US 20150199744A1
Filed: 01/12/2015
Published: 07/16/2015
Est. Priority Date: 01/10/2014
Status: Active Grant

First Claim

Patent Images

1. A method of aggregating entity data from a plurality of sources, the method comprising:

obtaining sample data from a plurality of data sources, the sample data corresponding to a plurality of entities, wherein samples from multiple data sources correspond to a same entity;

processing the samples to identify a plurality of fields corresponding to each sample, the fields including a name and a geographical indicator;

identifying a first cluster of the samples as corresponding to a first entity based on a first set of rules, the first cluster including a first sample, wherein identifying the first cluster includes;

determining whether a second sample is in the first cluster by;

determining a first field distance between a first field of the first sample and the first field of the second sample;

calculating a first metric based on the first field distance; and

adding the second sample to the first metric when the first metric is within a first threshold;

comparing the fields of at least a portion of the samples in the first cluster to determine the name and the geographical indicator for the first entity; and

storing the name and the geographical indicator of the first entity into a first record of a database.

View all claims

6 Assignments

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Systems and methods are provided for receiving, aggregating, and analyzing data to develop caregiver rankings, recommendations, and other information that care seekers may use to connect with caregivers for services, or for caregivers to use to connect with care seekers. Sample data can be obtained from a plurality of data sources, processed to form data clusters, aggregated to form data records, and provided to a care seeker searching for a caregiver or medical facility.

58 Citations

View as Search Results

20 Claims

1. A method of aggregating entity data from a plurality of sources, the method comprising:
- obtaining sample data from a plurality of data sources, the sample data corresponding to a plurality of entities, wherein samples from multiple data sources correspond to a same entity;
  
  processing the samples to identify a plurality of fields corresponding to each sample, the fields including a name and a geographical indicator;
  
  identifying a first cluster of the samples as corresponding to a first entity based on a first set of rules, the first cluster including a first sample, wherein identifying the first cluster includes;
  
  determining whether a second sample is in the first cluster by;
  
  determining a first field distance between a first field of the first sample and the first field of the second sample;
  
  calculating a first metric based on the first field distance; and
  
  adding the second sample to the first metric when the first metric is within a first threshold;
  
  comparing the fields of at least a portion of the samples in the first cluster to determine the name and the geographical indicator for the first entity; and
  
  storing the name and the geographical indicator of the first entity into a first record of a database.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16)
- - 2. The method of claim 1, wherein comparing the fields of at least a portion of the samples in the first cluster includes:
    - identifying which data source a sample is from;
      
      determining confidence values corresponding to the data sources;
      
      using the confidence values to determine the name and the geographical indicator for the first entity.
  - 3. The method of claim 2, wherein at least one field of a sample from a first data source is not used when the data source has a confidence value below a confidence threshold.
  - 4. The method of claim 2, further comprising:
    - receiving feedback from the first entity regarding the first record;
      
      computing the confidence value for a first data source based on the feedback.
  - 5. The method of claim 1, wherein the first threshold, the first field, and the first metric are specified by the first set of rules.
  - 6. The method of claim 1, wherein the first set of rules specifies multiple fields for clustering the samples, each of the specified fields having a corresponding field distance.
  - 7. The method of claim 6, wherein calculating the first metric based on the first field distance includes:
    - calculating a weighted average of the first field distance and one or more other field distances of the corresponding field distances, wherein weights of at least two field distances are different.
  - 8. The method of claim 1, further comprising:
    - for each of one or more additional sets of rules;
      
      identifying an additional cluster of the samples as corresponding to the first entity;
      
      comparing the samples of the first cluster and the one or more additional clusters to determine samples in a first optimum cluster; and
      
      determining the name and the geographical indicator for the first entity by comparing the fields of samples in the first optimum cluster.
  - 9. The method of claim 8, wherein at least two different sets of rules specify different fields to use for clustering the samples.
  - 10. The method of claim 1, wherein identifying the first cluster of the samples as corresponding to the first entity comprises:
    - identifying a plurality of different sets of partial clusters that are clustered by different fields in the plurality of fields;
      
      determining one or more field distances between corresponding fields of two partial clusters of the different sets of clusters to obtain a cluster distance between the two partial clusters; and
      
      combining two partial clusters from two different sets of partial clusters to form the first cluster based on the cluster distance being below a cluster threshold.
  - 11. The method of claim 10, wherein the field distances are determined using one of:
    - string matching, fuzzy logic, fuzzy feature contrast (FCC), or local sequence comparison that identifies similarities between corresponding fields in the different sets of clusters.
  - 12. The method of claim 10, wherein the at least two clusters are generated based on the same data from the plurality of data sources.
  - 13. The method of claim 10, further comprising:
    - determining first representative fields of a first partial cluster of the two partial clusters by comparing the fields of the samples in the first partial cluster; and
      
      determining second representative fields of a second partial cluster of the two partial clusters by comparing the fields of the samples in the second partial cluster,wherein the one or more field distances are determined using one or more corresponding fields of the first representative fields and the second representative fields.
  - 14. The method of claim 1, further comprising:
    - determining that a first data source is associated with a higher confidence value than a second data source, the first data source and second data source included in the plurality of data sources;
      
      determining that the first data source is an origin of data for the first record in the database based in part on the higher confidence value;
      
      calculating a first decay rate for a first confidence value for the first data source and a second decay rate of a second confidence value for the second data source;
      
      determining that the first data source is less accurate after a time than the second data source based in part on the first confidence value being less than the second confidence value; and
      
      altering the first record in the database to correspond with the second data source in the database based on the first and second confidence values.
  - 15. The method of claim 1, further comprising:
    - identifying the first record of the database to include first feedback;
      
      comparing the first feedback associated with the first record of the database with second feedback associated with other records of the database;
      
      generating a ranking of the records in the database based in part on the first feedback associated with the first record and the second feedback associated with the other records; and
      
      enabling the ranking to display on a graphical user interface.
  - 16. The method of claim 1, further comprising:
    - identifying the first record of the database, wherein the first record includes the first field from a first data source in the plurality of data sources;
      
      obtaining new sample data from a second data source in the plurality of data sources, wherein the new sample data includes a different value for the first field compared to the first record;
      
      determining a new confidence value associated with the new sample data, wherein an existing confidence value is associated with the first data source or the first field corresponding with the first record of the database;
      
      determining that the new confidence value associated with the new sample data is less than the existing confidence value associated with the data source or first field corresponding with the first record of the database; and
      
      maintaining the first record as unchanged despite the new sample data.

17. A computer product comprising a non-transitory computer readable medium embodying thereon a set of instructions, which when executed by a computer system cause the computer system to perform the steps of:
- obtaining sample data from a plurality of data sources, the sample data corresponding to a plurality of entities, wherein samples from multiple data sources correspond to a same entity;
  
  processing the samples to identify a plurality of fields corresponding to each sample, the fields including a name and a geographical indicator;
  
  identifying a first cluster of the samples as corresponding to a first entity based on a first set of rules, the first cluster including a first sample, wherein identifying the first cluster includes;
  
  determining whether a second sample is in the first cluster by;
  
  determining a first field distance between a first field of the first sample and the first field of the second sample;
  
  calculating a first metric based on the first field distance; and
  
  adding the second sample to the first metric when the first metric is within a first threshold;
  
  comparing the fields of at least a portion of the samples in the first cluster to determine the name and the geographical indicator for the first entity; and
  
  storing the name and the geographical indicator of the first entity into a first record of a database.
- View Dependent Claims (18, 19, 20)
- - 18. The computer product of claim 17, wherein comparing the fields of at least a portion of the samples in the first cluster includes:
    - identifying which data source a sample is from;
      
      determining confidence values corresponding to the data sources;
      
      using the confidence values to determine the name and the geographical indicator for the first entity.
  - 19. The computer product of claim 17, wherein at least one field of a sample from a first data source is not used when the data source has a confidence value below a confidence threshold.
  - 20. The computer product of claim 17, further comprising:
    - receiving feedback from the first entity regarding the first record;
      
      computing the confidence value for a first data source based on the feedback.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Quest Analytics LLC
Original Assignee
BetterDoctor, Inc. (Quest Analytics LLC)
Inventors
Tolvanen, Tapio, Tulla, Ari, Hao, Tele

Granted Patent

US 10,026,114 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 16/9537   Spatial or temporal depende...

G06Q 30/0631   Item recommendations

G16H 10/60   for patient-specific data, ...

SYSTEM FOR CLUSTERING AND AGGREGATING DATA FROM MULTIPLE SOURCES

First Claim

6 Assignments

0 Petitions

Accused Products

Abstract

58 Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

SYSTEM FOR CLUSTERING AND AGGREGATING DATA FROM MULTIPLE SOURCES

First Claim

6 Assignments

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

58 Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links