bigdata | 易学教程

Functions for creating and reshaping big data in R using the FF package

阅读更多关于 Functions for creating and reshaping big data in R using the FF package

问题 I'm new to R and the FF package, and am trying to better understand how FF allows users to work with large datasets (>4Gb). I have spent a considerable amount of time trawling the web for tutorials, but the ones I could find generally go over my head. I learn best by doing, so as an exercise, I would like to know how to create a long-format time-series dataset, similar to R's in-built "Indometh" dataset, using arbitrary values. Then I would like to reshape it into wide format. Then I would

Big Satellite Image Processing

阅读更多关于 Big Satellite Image Processing

问题 Im trying to run Mort Canty's http://mcanty.homepage.t-online.de/ Python iMAD implementation on bitemporal RapidEye Multispectral images. Which basically calculates the canonical correlation for the two images and then substracts them. The problem I'm having is that the images are of 5000 x 5000 x 5 (bands) pixels. If I try to run this on the whole image I get a memory error. Would the use of something like pyTables help me with this? What Mort Canty's code tries to do is that it loads the

Subtract all pairs of values from two arrays

阅读更多关于 Subtract all pairs of values from two arrays

问题 I have two vectors, v1 and v2 . I'd like to subtract each value of v2 from each value of v1 and store the results in another vector. I also would like to work with very large vectors (e.g. 1e6 size), so I think I should be using numpy for performance. Up until now I have: import numpy v1 = numpy.array(numpy.random.uniform(-1, 1, size=1e2)) v2 = numpy.array(numpy.random.uniform(-1, 1, size=1e2)) vdiff = [] for value in v1: vdiff.extend([value - v2]) This creates a list with 100 entries, each

What happens if an RDD can't fit into memory in Spark? [duplicate]

阅读更多关于 What happens if an RDD can't fit into memory in Spark? [duplicate]

问题 This question already has answers here : What will spark do if I don't have enough memory? (3 answers) Closed 2 years ago . As far as I know, Spark tries to do all computation in memory, unless you call persist with disk storage option. If however, we don't use any persist, what does Spark do when an RDD doesn't fit in memory? What if we have very huge data. How will Spark handle it without crashing? 回答1: From Apache Spark FAQ's: Spark's operators spill data to disk if it does not fit in

What happens if an RDD can't fit into memory in Spark? [duplicate]

阅读更多关于 What happens if an RDD can't fit into memory in Spark? [duplicate]

Is it a good practice to do sync database query or restful call in Kafka streams jobs?

阅读更多关于 Is it a good practice to do sync database query or restful call in Kafka streams jobs?

问题 I use Kafka streams to process real-time data, in the Kafka streams tasks, I need to access MySQL to query data, and need to call another restful service. All the operations are synchronous. I'm afraid the sync call will reduce the process capability of the streams tasks. Is this a good practice? or Is there any good idea to do this? 回答1: A better way to do it would be to stream your MySQL table(s) into Kafka, and access the data there. This has the advantage of decoupling your streams app

Is it a good practice to do sync database query or restful call in Kafka streams jobs?

阅读更多关于 Is it a good practice to do sync database query or restful call in Kafka streams jobs?

Is it a good practice to do sync database query or restful call in Kafka streams jobs?

阅读更多关于 Is it a good practice to do sync database query or restful call in Kafka streams jobs?

Scalable way to access every element of ConcurrentHashMap<Element, Boolean> exactly once

阅读更多关于 Scalable way to access every element of ConcurrentHashMap exactly once

问题 I have 32 machine threads and one ConcurrentHashMap<Key,Value> map , which contains a lot of keys. Key has defined a public method visit() . I want to visit() every element of map exactly once using the processing power I have available and possibly some sort of thread pooling. Things I could try: I could use the method map.keys() . The resulting Enumeration<Key> could be iterated over using nextElement() , but since a call to key.visit() is very brief I won't manage to keep threads busy. The

what is the basic difference between jobconf and job?

阅读更多关于 what is the basic difference between jobconf and job?

问题 hi i wanted to know the basic difference between jobconf and job objects,currently i am submitting my job like this JobClient.runJob(jobconf); i saw other way of submitting jobs like this Configuration conf = getConf(); Job job = new Job(conf, "secondary sort"); job.waitForCompletion(true); return 0; and how can i specify the sortcomparator class for the job using jobconf? can any one explain me this concept? 回答1: In short: JobConf and everything else in the org.apache.hadoop.mapred package