DETERMINING A DEGREE OF SIMILARITY OF A SUBSET OF TABULAR DATA ARRANGEMENTS TO SUBSETS OF GRAPH DATA ARRANGEMENTS AT INGESTION INTO A DATA-DRIVEN COLLABORATIVE DATASET PLATFORM

US 20190095472A1
Filed: 09/20/2018
Published: 03/28/2019
Est. Priority Date: 03/09/2017
Status: Active Grant

First Claim

Patent Images

1. A method comprising:

identifying subsets of data as columnar data associated with a data arrangement, the data arrangement being a tabular data arrangement including each of the subsets of data as a column of data;

generating a similarity matrix of data associated with a subset of data for each column of data, the similarity matrix of data being configured to determine a degree of similarity to other datasets with which to join;

accessing a plurality of similarity matrices each formed to identify an amount of relevant data associated with a dataset disposed in a graph data arrangement;

analyzing the similarity matrix of data in view of the plurality of similarity matrices;

identifying a subset of the plurality of similarity matrices to form a subset of relevant similarity matrices;

generating links among the column of data and a subset of the other datasets associated with the subset of relevant similarity matrices; and

forming a subset of the links between the column of data and at least one of the other datasets.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Various embodiments relate generally to data science and data analysis, computer software and systems, and wired and wireless network communications to interface among repositories of disparate datasets and computing machine-based entities configured to access datasets, and, more specifically, to a computing and data storage platform to determine degrees of similarity between at least a subset of data associated with an ingested dataset and one or more equivalent or similar subsets of data associated with one or more graph-based data arrangements, the degrees of similarity facilitating preferences or priorities in joining one or more graph-based data arrangements to the ingested dataset, according to at least some examples. For example, a method may include generating similarity matrices to join an ingested dataset (e.g., tabular dataset) to one or more graph-based datasets in accordance with determining a degree of similarity indication of a dataset with which to join.

Citations

20 Claims

1. A method comprising:
- identifying subsets of data as columnar data associated with a data arrangement, the data arrangement being a tabular data arrangement including each of the subsets of data as a column of data;
  
  generating a similarity matrix of data associated with a subset of data for each column of data, the similarity matrix of data being configured to determine a degree of similarity to other datasets with which to join;
  
  accessing a plurality of similarity matrices each formed to identify an amount of relevant data associated with a dataset disposed in a graph data arrangement;
  
  analyzing the similarity matrix of data in view of the plurality of similarity matrices;
  
  identifying a subset of the plurality of similarity matrices to form a subset of relevant similarity matrices;
  
  generating links among the column of data and a subset of the other datasets associated with the subset of relevant similarity matrices; and
  
  forming a subset of the links between the column of data and at least one of the other datasets.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17)
- - 2. The method of claim 1 wherein generating the similarity matrix of data further comprises:
    - generating a plurality of compressed data representations for each column of data.
  - 3. The method of claim 1 wherein analyzing the similarity matrix of data in view of the plurality of similarity matrices further comprises:
    - computing the degree of similarity as a function of an approximated overlap based a first ratio.
  - 4. The method of claim 3 further comprising:
    - determining the first ratio between an amount of common data attributes and a combined set of data attributes.
  - 5. The method of claim 4 wherein the amount of common data attributes includes a number of data attribute values in the subset of data as a column of data and a subset of the dataset disposed in the graph data arrangement, and the combined set of data attributes includes a combined number of data attribute values in the subset of data and the subset of the dataset.
  - 6. The method of claim 3 further comprising:
    - determining the first ratio between an intersection of data attributes and a union of the data attributes.
  - 7. The method of claim 1 wherein analyzing the similarity matrix of data in view of the plurality of similarity matrices further comprises:
    - computing the degree of similarity as a function of an approximated coverage based on a second ratio.
  - 8. The method of claim 7 further comprising:
    - determining the second ratio between an amount of data attributes and a combined set of data attributes.
  - 9. The method of claim 8 wherein the amount of data attributes includes a number of data attribute values in the subset of data as a column of data, and the combined set of data attributes includes a combined number of data attribute values in the subset of data and the subset of the dataset.
  - 10. The method of claim 7 further comprising:
    - determining the second ratio between data attribute values in the subset of data and a union of a combined set of data attribute values.
  - 11. The method of claim 1 wherein generating the similarity matrix of data further comprises:
    - generating a plurality of compressed data representations via a plurality of hash functions for each column of data.
  - 12. The method of claim 1 further comprises:
    - determining a classification type association with the subset of data.
  - 13. The method of claim 1 wherein generating the similarity matrix of data comprises:
    - determining a classification type association with the subset of data.
  - 14. The method of claim 13 wherein determining the classification type comprises:
    - receiving data specifying the classification type for the subset of data.
  - 15. The method of claim 1 wherein identifying the subset of the plurality of similarity matrices further comprises:
    - identifying a ratio between a number of matched hash-derived attributes and a combined number of hash-derived attributes.
  - 16. The method of claim 1 further comprising:
    - presenting in a user interface data representations for a selection of the other datasets in the graph data arrangement with which to join via links to the tabular data arrangement.
  - 17. The method of claim 16 further comprising:
    - detecting one of the selections to form a subset of the links to join the tabular data arrangement and the at least one of the other datasets.

18. An apparatus comprising:
- a memory including executable instructions; and
  
  a processor, responsive to executing the instructions, is configured to;
  
  identify subsets of data as columnar data associated with a data arrangement, the data arrangement being a tabular data arrangement including each of the subsets of data as a column of data;
  
  generate a similarity matrix of data associated with a subset of data for each column of data, the similarity matrix of data being configured to determine a degree of similarity to other datasets with which to join;
  
  access a plurality of similarity matrices each formed to identify an amount of relevant data associated with a dataset disposed in a graph data arrangement;
  
  analyze the similarity matrix of data in view of the plurality of similarity matrices;
  
  identify a subset of the plurality of similarity matrices to form a subset of relevant similarity matrices;
  
  generate links among the column of data and a subset of the other datasets associated with the subset of relevant similarity matrices; and
  
  form a subset of the links between the column of data and at least one of the other datasets.
- View Dependent Claims (19, 20)
- - 19. The apparatus of claim 18 wherein the processor is further configured to:
    - generate a plurality of compressed data representations for each column of data; and
      
      compute the degree of similarity as a function of a ratio of compressed data representations to a combination of the compressed data representations.
  - 20. The apparatus of claim 18 wherein the processor is further configured to:
    - identify a ratio between a number of matched hash-derived attributes and a combined number of hash-derived attributes.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Data.World, Inc.
Original Assignee
Data.World, Inc.
Inventors
Griffith, David Lee

Granted Patent

US 11,068,453 B2
Time in Patent Office

Days
Field of Search
US Class Current
CPC Class Codes

G06F 16/221   Column-oriented storage; Ma...

G06F 16/2456   Join operations

G06F 16/24578   using ranking

G06F 16/252   between a Database Manageme...

G06F 16/256   in federated or virtual dat...

G06F 16/258   Data format conversion from...

G06F 16/285   Clustering or classification

G06F 16/9024   Graphs; Linked lists G06F16...

DETERMINING A DEGREE OF SIMILARITY OF A SUBSET OF TABULAR DATA ARRANGEMENTS TO SUBSETS OF GRAPH DATA ARRANGEMENTS AT INGESTION INTO A DATA-DRIVEN COLLABORATIVE DATASET PLATFORM

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

DETERMINING A DEGREE OF SIMILARITY OF A SUBSET OF TABULAR DATA ARRANGEMENTS TO SUBSETS OF GRAPH DATA ARRANGEMENTS AT INGESTION INTO A DATA-DRIVEN COLLABORATIVE DATASET PLATFORM

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links