Count CLOB Duplicates in a large Oracle Table

问题

I have an Oracle database table LOG_MESSAGES with a CLOB column called MESSAGE. Some of the rows contain the same MESSAGE.

For each MESSAGE which has at least a duplicate, I'd like to know the number of duplicates. Quite a number of these CLOBs are huge (> 100 kB), so converting to VARCHAR2 is out of question. Since many traditional methods such as GROUP BY do not work with CLOB, could someone please enlighten me?

For information, the table is very large (around 1 TB). So an optimised method would very much be appreciated.

Thank you in advance for your help.

回答1:

I think this question gets asked a lot but unfortunately there doesn't seem to be a perfect way of doing this. There are ways that work just fine though.

Search for "clob group by" or "clob distinct" and you will see several hits just on this website.

One way would be to write a PL/SQL script that does a DBMS_LOB.COMPARE between all clobs in the table but the efficiency would probably be in the order of O(n^2) which would make it really slow for your purpose.

Another approach that is well accepted is to take a hash value of the clob using dbms_crypto (i think that allows hashing on clobs) and then group by on the hash values. There is a possibility of hash collision, but the probability is minute. I read somewhere around 2^80 (number might be wrong though). This won't be as slow as the first approach but calculating a hash would also take non-negligible time.

I would suggest try the hash approach first and if that seems too slow, look for alternatives.

回答2:

dbms_crypto.hash can accept a CLOB and compute a hash. You can then group by the hash. Of course, computing a hash on a large CLOB is going to be an expensive process in terms of CPU consumption. If you have a large number of rows, it may take quite some time. You may want to compute and store the hash in one step and do a GROUP BY in a separate step.

来源：https://stackoverflow.com/questions/28907644/count-clob-duplicates-in-a-large-oracle-table

标签

sql

Oracle

plsql

duplicates

clob