Primary keys are essential to relational databases. They ensure that rows in a database conform to a unique identifier. The following code snippet shows how to identify problematic rows.

Sometimes when loading into a different table there can be issues with duplicates on distinct columns when you thought there should be none.

Finding duplicates in a table

You might have a table in your database that doesn't contain any unique constraints. Let's assume the table is called TABLE1 and we want to check duplicates on COL1 and COL2. The following code will check to see what rows are duplicated, based on certain columns.

SELECT *
  FROM (
      SELECT 
        T1.*, 
        COUNT(*) OVER (PARTITION BY COL1, COL2) ROW_COUNT
      FROM TABLE1 T1
  )
 WHERE ROW_COUNT > 1

How this works

This query uses COUNT as a window function to count what's given in the PARTITION BY section. That gets appended to our TABLE and from there we only keep the TABLE1.ROW_COUNTs that are greater than 1, leaving the duplicate rows behind from TABLE1.

Using on a query rather than table

It's a more common occurrence, however, to determine whether a query you're running may contain duplicate records and will fail when loading to a table with primary key constraints.

In that case you can wrap the query in a WITH clause (aka subquery factoring clause) and then reference that table in the snippet above:

WITH MY_TABLE AS (
    SELECT COL1, COL2, COL3
    FROM TABLE1
    WHERE COL3 != 'SOMETHING'
)

SELECT *
  FROM (
      SELECT 
        T1.*, 
        COUNT(*) OVER (PARTITION BY COL1, COL2) ROW_COUNT
      FROM MY_TABLE T1
  )
 WHERE ROW_COUNT > 1
Comments

comments powered by Disqus

Published

Category

SQL

Tags

Contact