Monday, August 22, 2005

Magic Exceptions Into

One of our developers sent me a procedure code that removes duplicate records from a table. The code was something like this:

Cursor c1 is select * from tab1;
Open c1;
Loop
Fetch c1 into r1;
delete from tab1
where col1=r1.col1
and rownum=1;
commit;
end loop;

This was supposed to delete about 600M rows. Surely this is not the way to go for removing duplicate records from a table.

There are more than one way to remove duplicate records. Here is a simplified test case.

Create table t_dups as select * from all_objects;
insert into t_dups as select * from t_dups;

DELETE FROM t_dups a
WHERE ROWID<>(SELECT MAX(ROWID)
FROM t_dups b
WHERE a.object_id=b.object_id);

This does the work in a reasonable time if we have an index on object_id. But in this case we did not have an index on the related column. Without an index on object_id, here are the stats:

DELETE FROM t_dups a
WHERE ROWID<>(SELECT MAX(ROWID) FROM t_dups b WHERE a.object_id=b.object_id)



call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.01 0.06 0 0 0 0
Execute 1 342.16 335.17 0 2306810 7113 6396
Fetch 0 0.00 0.00 0 0 0 0
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 2 342.17 335.23 0 2306810 7113 6396



Instead of creating an index on a big table just for the purpose of removing duplicates, we can use the exceptions into clause. The EXCEPTIONS table can be created using $ORACLE_HOME/rdbms/admin/utlexcpt.sql.

alter table t_dups add constraint cons01 unique(object_id) exceptions into exceptions;
*
ERROR at line 1:
ORA-02299: cannot validate (CONS01) - duplicate keys found

This puts the duplicate rows into the EXCEPTIONS table. The thing to be careful about is that it puts rowids of all the duplicates including the rows we want to keep. So, if you have two object_id’s that are the same, both are in EXCEPTIONS not just one of them.

As we have the rowids of the duplicate rows, we can remove one of each duplicate by using the following sql:

delete from t_dups
where rowid in (select minrid from (SELECT object_id,MIN(e.row_id) minrid
FROM EXCEPTIONS e,t_dups t

WHERE t.ROWID=e.row_id
GROUP BY object_id)
)



call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.03 0.10 0 1 0 0
Execute 1 2.27 2.27 0 19429 7247 6396
Fetch 0 0.00 0.00 0 0 0 0
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 2 2.30 2.38 0 19430 7247 6396


2.38 seconds compared to 335.23 seconds! So, using the exceptions into clause for removing duplicates is a good alternative when we do not have an index on the related columns.

5 comments:

  1. Hi!

    The below was used to delete duplicate records. Please look into it and let me know if this sql statement can be improved upon or written in a different manner.

    Thanks

    DPS


    delete trn_timechk where rowid in
    (select row_id from
    (
    select * from trn_timechk where rowid in
    (
    select rowt from
    (
    select t.emp_id, t.timechk, max(t.rowid) rowt, a.t from trn_timechk t,
    ( select emp_id, timechk, count(timechk) t
    from trn_timechk
    where timechk > '30-JUN-2006'
    group by emp_id, timechk
    )a
    where a.emp_id = t.emp_id
    and a.timechk = t.timechk
    and a.t > 1
    group by t.emp_id, t.timechk, a.t
    )a
    )
    )

    ReplyDelete
  2. interesting idea good catch.

    ReplyDelete
  3. delete from t_dups
    where rowid in (select minrid from (SELECT object_id,MIN(e.row_id) minrid
    FROM EXCEPTIONS e,t_dups t
    WHERE t.ROWID=e.row_id
    GROUP BY object_id)
    )


    This as you mentioned deletes ONE of each duplicate.

    What happens if --
    1. There are multiple duplicates rows for the same value of object_id
    2. There are duplicate rows with null object_id

    Would this be better --


    delete from t_dups
    where rowid not in (select minrid from (SELECT object_id,MIN(t.rowid) minrid
    FROM EXCEPTIONS e,t_dups t
    WHERE t.ROWID=e.row_id(+)
    GROUP BY object_id )
    );

    -- with due caution for the not in clause

    ReplyDelete
  4. Infact why even join with the exceptions table

    use --
    delete from t_dups
    where rowid not in (SELECT MIN(rowid)
    FROM t_dups
    GROUP BY object_id );

    ReplyDelete
  5. create table exceptions(row_id rowid,
    owner varchar2(30),
    table_name varchar2(30),
    constraint varchar2(30));

    ReplyDelete