Export large amount of data from Cassandra to CSV

本秂侑毒 提交于 2019-11-30 14:35:09

问题


I'm using Cassandra 2.0.9 for store quite big amounts of data, let's say 100Gb, in one column family. I would like to export this data to CSV in fast way. I tried:

  • sstable2json - it produces quite big json files which are hard to parse - because tool puts data in one row and uses complicated schema (ex. 300Mb Data file = ~2Gb json), it takes a lot of time to dump and Cassandra likes to change source file names according its internal mechanism
  • COPY - causes timeouts on quite fast EC2 instances for big number of records
  • CAPTURE - like above, causes timeouts
  • reads with pagination - I used timeuuid for it, but it returns about 1,5k records per second

I use Amazon Ec2 instance with fast storage, 15 Gb of RAM and 4 cores

Is there any better option for export gigabytes of data from Cassandra to CSV?


回答1:


Because using COPY will be quite challenging when you are trying to export a table with millions of rows from Cassandra, So what I have done is to create simple tool to get the data chunk by chunk (paginated) from cassandra table and export it to CSV.

Look at my example solution using java library from datastax.




回答2:


I also gave up after trying different solutions specially when the data is clustered and is huge.
I used Spark job to export all data to a file (e.g. S3) and it worked well.



来源:https://stackoverflow.com/questions/24896336/export-large-amount-of-data-from-cassandra-to-csv

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!