bigdata

Export large amount of data from Cassandra to CSV

本秂侑毒 提交于 2019-11-30 14:35:09
问题 I'm using Cassandra 2.0.9 for store quite big amounts of data, let's say 100Gb, in one column family. I would like to export this data to CSV in fast way. I tried: sstable2json - it produces quite big json files which are hard to parse - because tool puts data in one row and uses complicated schema (ex. 300Mb Data file = ~2Gb json), it takes a lot of time to dump and Cassandra likes to change source file names according its internal mechanism COPY - causes timeouts on quite fast EC2 instances

what is the basic difference between jobconf and job?

无人久伴 提交于 2019-11-30 11:37:58
hi i wanted to know the basic difference between jobconf and job objects,currently i am submitting my job like this JobClient.runJob(jobconf); i saw other way of submitting jobs like this Configuration conf = getConf(); Job job = new Job(conf, "secondary sort"); job.waitForCompletion(true); return 0; and how can i specify the sortcomparator class for the job using jobconf? can any one explain me this concept? In short: JobConf and everything else in the org.apache.hadoop.mapred package is part of the old API used to write hadoop jobs, Job and everything in the org.apache.hadoop.mapreduce

Export large amount of data from Cassandra to CSV

只愿长相守 提交于 2019-11-30 11:03:24
I'm using Cassandra 2.0.9 for store quite big amounts of data, let's say 100Gb, in one column family. I would like to export this data to CSV in fast way. I tried: sstable2json - it produces quite big json files which are hard to parse - because tool puts data in one row and uses complicated schema (ex. 300Mb Data file = ~2Gb json), it takes a lot of time to dump and Cassandra likes to change source file names according its internal mechanism COPY - causes timeouts on quite fast EC2 instances for big number of records CAPTURE - like above, causes timeouts reads with pagination - I used

Spark RDD's - how do they work

时光怂恿深爱的人放手 提交于 2019-11-30 11:00:47
I have a small Scala program that runs fine on a single-node. However, I am scaling it out so it runs on multiple nodes. This is my first such attempt. I am just trying to understand how the RDDs work in Spark so this question is based around theory and may not be 100% correct. Let's say I create an RDD: val rdd = sc.textFile(file) Now once I've done that, does that mean that the file at file is now partitioned across the nodes (assuming all nodes have access to the file path)? Secondly, I want to count the number of objects in the RDD (simple enough), however, I need to use that number in a

Is there something like Redis DB, but not limited with RAM size? [closed]

和自甴很熟 提交于 2019-11-30 10:16:25
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 6 years ago . I'm looking for a database matching these criteria: May be non-persistent; Almost all keys of DB need to be updated once in 3-6 hours (100M+ keys with total size of 100Gb) Ability to quickly select data by key (or Primary Key) This needs to be a DBMS (so LevelDB doesn't fit) When data is written, DB cluster must

Generating a very large matrix of string combinations using combn() and bigmemory package

主宰稳场 提交于 2019-11-30 07:14:55
问题 I have a vector x of 1,344 unique strings. I want to generate a matrix that gives me all possible groups of three values, regardless of order, and export that to a csv. I'm running R on EC2 on a m1.large instance w 64bit Ubuntu. When using combn(x, 3) I get an out of memory error: Error: cannot allocate vector of size 9.0 Gb The size of the resulting matrix is C1344,3 = 403,716,544 rows and three columns - which is the transpose of the result of combn() function. I thought of using the

What is the actual difference between Data Warehouse & Big Data?

◇◆丶佛笑我妖孽 提交于 2019-11-30 06:50:40
I know what is Data Warehouse & what is Big Data. But I am confused with Data Warehouse Vs Big Data. Both are same with different names or both are different(Conceptually & Physically). I know that this is an older thread but there have been some developments in the last year or so. Comparing the data warehouse to Hadoop is like comparing apples to oranges. The data warehouse is a concept: clean, integrated data of high quality. I don't think the need for a data warehouse will go away anytime soon. Hadoop on the other hand is a technology. It is a distributed compute framework to process large

Inserting a big array of object in mongodb from nodejs

谁说我不能喝 提交于 2019-11-30 05:31:05
问题 I need to insert a big array of objects (about 1.5-2 millions) in mongodb from nodejs. How can i improve my inserting? This is my code: var sizeOfArray = arrayOfObjects.length; //sizeOfArray about 1.5-2 millions for(var i = 0; i < sizeOfResult; ++i) { newKey = { field_1: result[i][1], field_2: result[i][2], field_3: result[i][3] }; collection.insert(newKey, function(err, data) { if (err) { log.error('Error insert: ' + err); } }); } 回答1: You can use bulk inserts. There are two types of bulk

How to view Apache Parquet file in Windows?

限于喜欢 提交于 2019-11-30 04:59:36
I couldn't find any plain English explanations regarding Apache Parquet files. Such as: What are they? Do I need Hadoop or HDFS to view/create/store them? How can I create parquet files? How can I view parquet files? Any help regarding these questions is appreciated. What is Apache Parquet? Apache Parquet is a binary file format that stores data in a columnar fashion. Data inside a Parquet file is similar to an RDBMS style table where you have columns and rows. But instead of accessing the data one row at a time, you typically access it one column at a time. Apache Parquet is one of the modern

What is the status on Neo4j's horizontal scalability project Rassilon?

倾然丶 夕夏残阳落幕 提交于 2019-11-30 04:57:36
Just wondering if anyone has any information on the status of project Rassilon, Neo4j's side project which focuses on improving horizontal scalability of Neo4j? It was first announced in January 2013 here . I'm particularly interested in knowing more about when the graph size limitation will be removed and when sharding across clusters will become available. Philip Rathle The node & relationship limits are going away in 2.1, which is the next release post 2.0 (which now has a release candidate). Rassilon is definitely still in the mix. That said, that work is not taking precedence over things