bigdata

Functions for creating and reshaping big data in R using the FF package

拈花ヽ惹草 提交于 2019-12-19 10:23:14
问题 I'm new to R and the FF package, and am trying to better understand how FF allows users to work with large datasets (>4Gb). I have spent a considerable amount of time trawling the web for tutorials, but the ones I could find generally go over my head. I learn best by doing, so as an exercise, I would like to know how to create a long-format time-series dataset, similar to R's in-built "Indometh" dataset, using arbitrary values. Then I would like to reshape it into wide format. Then I would

Big Satellite Image Processing

我只是一个虾纸丫 提交于 2019-12-19 10:11:35
问题 Im trying to run Mort Canty's http://mcanty.homepage.t-online.de/ Python iMAD implementation on bitemporal RapidEye Multispectral images. Which basically calculates the canonical correlation for the two images and then substracts them. The problem I'm having is that the images are of 5000 x 5000 x 5 (bands) pixels. If I try to run this on the whole image I get a memory error. Would the use of something like pyTables help me with this? What Mort Canty's code tries to do is that it loads the

Subtract all pairs of values from two arrays

筅森魡賤 提交于 2019-12-19 09:43:58
问题 I have two vectors, v1 and v2 . I'd like to subtract each value of v2 from each value of v1 and store the results in another vector. I also would like to work with very large vectors (e.g. 1e6 size), so I think I should be using numpy for performance. Up until now I have: import numpy v1 = numpy.array(numpy.random.uniform(-1, 1, size=1e2)) v2 = numpy.array(numpy.random.uniform(-1, 1, size=1e2)) vdiff = [] for value in v1: vdiff.extend([value - v2]) This creates a list with 100 entries, each

What happens if an RDD can't fit into memory in Spark? [duplicate]

一笑奈何 提交于 2019-12-19 07:54:33
问题 This question already has answers here : What will spark do if I don't have enough memory? (3 answers) Closed 2 years ago . As far as I know, Spark tries to do all computation in memory, unless you call persist with disk storage option. If however, we don't use any persist, what does Spark do when an RDD doesn't fit in memory? What if we have very huge data. How will Spark handle it without crashing? 回答1: From Apache Spark FAQ's: Spark's operators spill data to disk if it does not fit in

What happens if an RDD can't fit into memory in Spark? [duplicate]

可紊 提交于 2019-12-19 07:54:22
问题 This question already has answers here : What will spark do if I don't have enough memory? (3 answers) Closed 2 years ago . As far as I know, Spark tries to do all computation in memory, unless you call persist with disk storage option. If however, we don't use any persist, what does Spark do when an RDD doesn't fit in memory? What if we have very huge data. How will Spark handle it without crashing? 回答1: From Apache Spark FAQ's: Spark's operators spill data to disk if it does not fit in

Is it a good practice to do sync database query or restful call in Kafka streams jobs?

心不动则不痛 提交于 2019-12-19 05:00:46
问题 I use Kafka streams to process real-time data, in the Kafka streams tasks, I need to access MySQL to query data, and need to call another restful service. All the operations are synchronous. I'm afraid the sync call will reduce the process capability of the streams tasks. Is this a good practice? or Is there any good idea to do this? 回答1: A better way to do it would be to stream your MySQL table(s) into Kafka, and access the data there. This has the advantage of decoupling your streams app

Is it a good practice to do sync database query or restful call in Kafka streams jobs?

∥☆過路亽.° 提交于 2019-12-19 05:00:45
问题 I use Kafka streams to process real-time data, in the Kafka streams tasks, I need to access MySQL to query data, and need to call another restful service. All the operations are synchronous. I'm afraid the sync call will reduce the process capability of the streams tasks. Is this a good practice? or Is there any good idea to do this? 回答1: A better way to do it would be to stream your MySQL table(s) into Kafka, and access the data there. This has the advantage of decoupling your streams app

Is it a good practice to do sync database query or restful call in Kafka streams jobs?

北城余情 提交于 2019-12-19 05:00:26
问题 I use Kafka streams to process real-time data, in the Kafka streams tasks, I need to access MySQL to query data, and need to call another restful service. All the operations are synchronous. I'm afraid the sync call will reduce the process capability of the streams tasks. Is this a good practice? or Is there any good idea to do this? 回答1: A better way to do it would be to stream your MySQL table(s) into Kafka, and access the data there. This has the advantage of decoupling your streams app

Scalable way to access every element of ConcurrentHashMap<Element, Boolean> exactly once

こ雲淡風輕ζ 提交于 2019-12-19 04:04:52
问题 I have 32 machine threads and one ConcurrentHashMap<Key,Value> map , which contains a lot of keys. Key has defined a public method visit() . I want to visit() every element of map exactly once using the processing power I have available and possibly some sort of thread pooling. Things I could try: I could use the method map.keys() . The resulting Enumeration<Key> could be iterated over using nextElement() , but since a call to key.visit() is very brief I won't manage to keep threads busy. The

what is the basic difference between jobconf and job?

元气小坏坏 提交于 2019-12-18 14:15:23
问题 hi i wanted to know the basic difference between jobconf and job objects,currently i am submitting my job like this JobClient.runJob(jobconf); i saw other way of submitting jobs like this Configuration conf = getConf(); Job job = new Job(conf, "secondary sort"); job.waitForCompletion(true); return 0; and how can i specify the sortcomparator class for the job using jobconf? can any one explain me this concept? 回答1: In short: JobConf and everything else in the org.apache.hadoop.mapred package