Detecting duplicate records in databases

US 20050262044A1
Filed: 07/14/2005
Published: 11/24/2005
Est. Priority Date: 06/28/2002
Status: Active Grant

First Claim

Patent Images

1. In a database having records stored on a medium, a method for identifying possible duplicate data records comprising:

a) providing multiple records in one or more tables that include multiple fields; and

b) identifying two or more records within a table as possible duplicates by measuring a co-occurence of data in two or more hierarchically related fields of the table.

View all claims

1 Assignment

Timeline View

Assignment View

0 Petitions

Accused Products

Abstract

The invention concerns a detection of duplicate tuples in a database. Previous domain independent detection of duplicated tuples relied on standard similarity functions (e.g., edit distance, cosine metric) between multi-attribute tuples. However, such prior art approaches result in large numbers of false positives if they are used to identify domain-specific abbreviations and conventions. In accordance with the invention a process for duplicate detection is implemented based on interpreting records from multiple dimensional tables in a data warehouse, which are associated with hierarchies specified through key-foreign key relationships in a snowflake schema. The invention exploits the extra knowledge available from the table hierarchy to develop a high quality, scalable duplicate detection process.

Citations

1 Claim

1. In a database having records stored on a medium, a method for identifying possible duplicate data records comprising:
- a) providing multiple records in one or more tables that include multiple fields; and
  
  b) identifying two or more records within a table as possible duplicates by measuring a co-occurence of data in two or more hierarchically related fields of the table.

Specification

Resources

Litigation Campaign Assessment

Current Assignee
Microsoft Technology Licensing LLC (Microsoft Corporation)
Original Assignee
Microsoft Corporation
Inventors
Chaudhuri, Surajit, Ganti, Venkatesh, Ananthakrishna, Rohit

Granted Patent

US 7,685,090 B2
Time in Patent Office

Days
Field of Search
US Class Current

1/1
CPC Class Codes

G06F 16/215   Improving data quality; Dat...

Y10S 707/99931   Database or file accessing

Y10S 707/99942   Manipulating data structure...

Detecting duplicate records in databases

First Claim

1 Assignment

0 Petitions

Accused Products

Abstract

Citations

1 Claim

Specification

Solutions

Use Cases

Quick Links

Detecting duplicate records in databases

First Claim

1 Assignment

Subscription Required

Subscription Required

0 Petitions

Subscription Required

Accused Products

Subscription Required

Abstract

Citations

1 Claim

Specification

Subscription Required

Solutions

Use Cases

Quick Links