Automated database schema annotation

US 10,452,661 B2
Filed: 06/18/2015
Issued: 10/22/2019
Est. Priority Date: 06/18/2015
Status: Active Grant

First Claim

Patent Images

1. A device comprising:

a processor; and

a computer-readable medium including modules, the modules, when executed by the processor, configure the device to generate annotations, the modules comprising;

a column discovery module configured to retrieve a table; and

a column annotation module configured to annotate a target column of a target table from a target database by;

calculating a value-related score between the target column of the target table and a column of the table, the value-related score based at least in part on similarities between one or more values in the target column of the target table and one or more column values extracted from the column of the table, the value-related score being a numerical value-related score;

calculating a context-related score between the target column of the target table and the column of the table, the context-related score based at least in part on similarities between identities of one or more columns of the target table and column identities of one or more columns of the table, the context-related score being a numerical context-related score;

calculating a similarity score based on a numerical value comprising a numerical combination of the value-related score and the context-related score, the similarity score being a numerical similarity score; and

annotating, based at least in part on the similarity score, the target column of the target table using a column identity of the column of the table.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

Techniques and constructs that improve annotating target columns of a target database by performing automated annotation of the target columns using sources. The techniques include calculating a similarity score between a target column and columns extracted from a table that is included in a source. The similarity score is calculated based at least in part on a similarity between a value in the target column of the target database and a column value of the extracted column from the table and on a similarity between an identity of the target column of the target database and column identities of the extracted columns from the table. In some examples, the techniques calculate similarity scores for one or more extracted columns and annotate the target column based on the similarity scores.

Citations

20 Claims

1. A device comprising:
- a processor; and
  
  a computer-readable medium including modules, the modules, when executed by the processor, configure the device to generate annotations, the modules comprising;
  
  a column discovery module configured to retrieve a table; and
  
  a column annotation module configured to annotate a target column of a target table from a target database by;
  
  calculating a value-related score between the target column of the target table and a column of the table, the value-related score based at least in part on similarities between one or more values in the target column of the target table and one or more column values extracted from the column of the table, the value-related score being a numerical value-related score;
  
  calculating a context-related score between the target column of the target table and the column of the table, the context-related score based at least in part on similarities between identities of one or more columns of the target table and column identities of one or more columns of the table, the context-related score being a numerical context-related score;
  
  calculating a similarity score based on a numerical value comprising a numerical combination of the value-related score and the context-related score, the similarity score being a numerical similarity score; and
  
  annotating, based at least in part on the similarity score, the target column of the target table using a column identity of the column of the table.
- View Dependent Claims (2, 3, 4, 5, 6, 7, 8)
- - 2. The device of claim 1, wherein the value-related score is a weighted combination of a Jaccard Containment of the column in the target column and a Jaccard Containment of the target column in the column.
  - 3. The device of claim 1, wherein the column annotation module is further configured to:
    - determine a first value of a first annotation based at least in part on the similarity;
      
      determine a second value of a second annotation based at least in part on another similarity between the target column of the target table and another column of the table, the another similarity based at least in part on similarities between the one or more values in the target column of the target table and one or more column values extracted from the another column of the table;
      
      rank the first annotation and the second annotation based at least in part on the first value and the second value; and
      
      annotate the target column based at least in part on the ranking of the first annotation and the second annotation.
  - 4. The device of claim 1, wherein retrieving the table comprises the discovery module accessing a database of sources to discover a source, the source including the table.
  - 5. The device of claim 4, wherein the modules further comprise an extraction module configured to identify the table for extraction from the source based at least in part on an identification of a header row included in the source.
  - 6. The device of claim 4, wherein the modules further comprise an extraction module configured to identify the table for extraction from the source based at least in part on at least one of an identification of a border around a group of cells included in the source, or a group of cells included in the source that is surrounded on at least two sides by blank or empty cells.
  - 7. The device of claim 1, wherein the modules further comprise an indexing module configured to generate an index for the table, the index comprising column values for the table mapped to individual column identities of columns included in the table.
  - 8. The device of claim 7, wherein the index further comprises the individual column identities of the columns included in the table mapped to the column values for the table.

9. A processor implemented method comprising:
- retrieving a table, under control of one or more processors;
  
  calculating, using the one or more processors, a value-related score between a target column of a target table from a target database and a column of the table, the value-related score based at least in part on similarities between one or more values in the target column of the target table and one or more column values extracted from the column of the table, the value-related score being a numerical value-related score;
  
  calculating, using the one or more processors, a context-related score between the target column of the target table and the column of the table, the context-related score based at least in part on similarities between identities of one or more columns of the target table and column identities of one or more columns of the table, the context-related score being a numerical context-related score;
  
  calculating, using the one or more processors, a similarity score based on a numerical value comprising a numerical combination of the value-related score and the context-related score, the similarity score being a numerical similarity score;
  
  annotating, using the one or more processors and based at least in part on the similarity score, the target column of the target table using a column identity of the column of the table; and
  
  storing, using the one or more processors, the annotated target column.
- View Dependent Claims (10, 11, 12, 13, 14, 15, 16)
- - 10. The processor implemented method of claim 9, wherein the value-related score is a weighted combination of a Jaccard Containment of the column in the target column and a Jaccard Containment of the target column in the column.
  - 11. The processor implemented method of claim 9, further comprising:
    - determining a first value of a first annotation based at least in part on the similarity;
      
      determining a second value of a second annotation based at least in part on another similarity between the target column of the target table and another column of the table, the another similarity based at least in part on similarities between the one or more values in the target column of the target table and one or more column values extracted from the another column of the table;
      
      ranking the first annotation and the second annotation based at least in part on the first value and the second value; and
      
      annotating the target column based at least in part on the ranking of the first annotation and the second annotation.
  - 12. The processor implemented method of claim 9, wherein the table is included in a spreadsheet, and the method further comprising discovering the spreadsheet.
  - 13. The processor implemented method of claim 12, further comprising extracting the table from the spreadsheet based at least in part on an identification of a header row within the spreadsheet.
  - 14. The processor implemented method of claim 12, further comprising extracting the table from the spreadsheet based at least in part on at least one of an identification of a border around a group of cells included in the spreadsheet, or a group of cells included in the spreadsheet that is surrounded on at least two sides by blank or empty cells.
  - 15. The processor implemented method of claim 9, further comprising generating an index for the table, the index comprising column values for the table mapped to individual column identities of columns included in the table.
  - 16. The processor implemented method of claim 15, wherein the index further comprises the individual column identities of the columns included in the table mapped to the column values for the table.

17. A non-transitory computer storage medium having computer-executable instructions to program a computer to perform operations comprising:
- performing receiving a table;
  
  identifying a column included in the table;
  
  identifying a target column in a target table from a target database;
  
  calculating a value-related score between the target column of the target table and the column of the table, the value-related score based at least in part on similarities between one or more values in the target column of the target table and one or more column values extracted from the column of the table, the value-related score being a numerical value-related score;
  
  calculating a context-related score between the target column of the target table and the column of the table, the context-related score based at least in part on similarities between identities of one or more columns of the target table and column identities of one or more columns of the table, the context-related score being a numerical context-related score;
  
  calculating a similarity score based on a numerical value comprising a numerical combination of the value-related score and the context-related score, the similarity score being a numerical similarity score; and
  
  annotating, based at least in part on the similarity score, the target column included in the target table using an identity of the column included in the table.
- View Dependent Claims (18, 19, 20)
- - 18. The non-transitory computer storage medium of claim 17, the operations further comprising ranking the identity of the column included in the table, the ranking based at least in part on at least one of:
    - a similarity between one or more values in the target column of the target table and one or more column values of the column included in the table; and
      
      a similarity between identities of one or more columns of the target table that contains the target column and the identities of one or more columns included in the table.
  - 19. The non-transitory computer storage medium of claim 17, wherein the table includes a first table and the column includes a first column, the operation further comprising:
    - receiving a second table;
      
      identifying a second column included in the second table; and
      
      annotating the target column included in the target table using an identity of the second column included in the second table.
  - 20. The non-transitory computer storage medium of claim 19, the operations further comprising:
    - determining a first similarity score based at least in part on similarities between the first column and the target column;
      
      determining a second similarity score based at least in part on similarities between the second column and the target column; and
      
      ranking the identity of the first column and the identity of the second column based at least in part on the first similarity score and the second similarity score,and wherein annotating the target column using the identity of the first column and the identity of the second column is based at least in part on the ranking.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Inventors
Bernstein, Philip A., He, Yeye, Cortez Custodio Vilarinho, Eli, Novik, Lev
Primary Examiner(s)
Pyo, Monica M

Application Number

US14/743,510
Publication Number

US 20160371275A1
Time in Patent Office

1,587 Days
Field of Search

707741, 707748, 707749
US Class Current
CPC Class Codes

G06F 16/20   of structured data, e.g. re...

G06F 16/24573   using data annotations, e.g...

G06F 16/24578   using ranking

G06F 40/169   Annotation, e.g. comment da...

G06F 40/18   of spreadsheets form-fillin...

Automated database schema annotation

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

20 Claims

Specification

Solutions

Use Cases

Quick Links

Automated database schema annotation

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

20 Claims

Specification

Subscription Required

Solutions

Use Cases

Quick Links