问题
I have an Oracle database table LOG_MESSAGES
with a CLOB column called MESSAGE
. Some of the rows contain the same MESSAGE
.
For each MESSAGE
which has at least a duplicate, I'd like to know the number of duplicates. Quite a number of these CLOBs are huge (> 100 kB), so converting to VARCHAR2
is out of question. Since many traditional methods such as GROUP BY
do not work with CLOB
, could someone please enlighten me?
For information, the table is very large (around 1 TB). So an optimised method would very much be appreciated.
Thank you in advance for your help.
回答1:
I think this question gets asked a lot but unfortunately there doesn't seem to be a perfect way of doing this. There are ways that work just fine though.
Search for "clob group by" or "clob distinct" and you will see several hits just on this website.
One way would be to write a PL/SQL script that does a DBMS_LOB.COMPARE
between all clobs in the table but the efficiency would probably be in the order of O(n^2) which would make it really slow for your purpose.
Another approach that is well accepted is to take a hash value of the clob using dbms_crypto (i think that allows hashing on clobs) and then group by on the hash values. There is a possibility of hash collision, but the probability is minute. I read somewhere around 2^80 (number might be wrong though). This won't be as slow as the first approach but calculating a hash would also take non-negligible time.
I would suggest try the hash approach first and if that seems too slow, look for alternatives.
回答2:
dbms_crypto.hash can accept a CLOB
and compute a hash. You can then group by the hash. Of course, computing a hash on a large CLOB
is going to be an expensive process in terms of CPU consumption. If you have a large number of rows, it may take quite some time. You may want to compute and store the hash in one step and do a GROUP BY
in a separate step.
来源:https://stackoverflow.com/questions/28907644/count-clob-duplicates-in-a-large-oracle-table