bigdata

SparkR Job 100 Minutes Timeout

|▌冷眼眸甩不掉的悲伤 提交于 2020-01-01 15:50:13
问题 I have written a bit complex sparkR script and run it using spark-submit. What script basically do is read a big hive/impala parquet based table row by row and generate new parquet file having same number of rows. But it seems the job stops after exactly around 100 Minutes which seems some timeout. For up to 500K rows script works perfectly (Because it needs less than 100 Minutes) For 1, 2 or 3 or more million rows script exits after 100 Minutes. I checked all possible parameter having values

How to subtract months from date in HIVE

不羁岁月 提交于 2020-01-01 06:36:30
问题 I am looking for a method that helps me subtract months from a date in HIVE I have a date 2015-02-01 . Now i need to subtract 2 months from this date so that result should be 2014-12-01 . Can you guys help me out here? 回答1: select add_months('2015-02-01',-2); if you need to go back to first day of the resulting month: select add_months(trunc('2015-02-01','MM'),-2); 回答2: Please try add_months date function and pass -2 as months. Internally add_months uses Java Calendar.add method, which

convert data.frame to ff

荒凉一梦 提交于 2020-01-01 05:44:05
问题 I would like to convert a data.frame to a ff object, with as.ffdf as described here df.apr=as.data.frame(df.apr) # from data.table to data.frame cols=df.apr[1,] cols=sapply(cols,class) df_apr=as.ffdf(df.apr,vmode=cols) gives an error: Error in ff(initdata = initdata, length = length, levels = levels, ordered = ordered, : vmode 'numeric' not implemented without the 'vmode' argument, the following error is given: Error in ff(initdata = initdata, length = length, levels = levels, ordered =

Does a flatMap in spark cause a shuffle?

别来无恙 提交于 2020-01-01 05:05:32
问题 Does flatMap in spark behave like the map function and therefore cause no shuffling, or does it trigger a shuffle. I suspect it does cause shuffling. Can someone confirm it? 回答1: There is no shuffling with either map or flatMap. The operations that cause shuffle are: Repartition operations: Repartition: Coalesce: ByKey operations (except for counting): GroupByKey: ReduceByKey: Join operations: Cogroup: Join: Although the set of elements in each partition of newly shuffled data will be

Does a flatMap in spark cause a shuffle?

生来就可爱ヽ(ⅴ<●) 提交于 2020-01-01 05:05:26
问题 Does flatMap in spark behave like the map function and therefore cause no shuffling, or does it trigger a shuffle. I suspect it does cause shuffling. Can someone confirm it? 回答1: There is no shuffling with either map or flatMap. The operations that cause shuffle are: Repartition operations: Repartition: Coalesce: ByKey operations (except for counting): GroupByKey: ReduceByKey: Join operations: Cogroup: Join: Although the set of elements in each partition of newly shuffled data will be

Reverse Sorting Reducer Keys

妖精的绣舞 提交于 2020-01-01 03:48:06
问题 What is the best approach to get the Map Output keys to a reducer in reverse order? By default the reducer receives all keys in ascending order of keys. Any help or comments widely appreciated. In simple words, in the normal scenario, if a map emits keys 1,4,3,5,2 the reducer receives the same as 1,2,3,4,5 . I would like the reducer to receive 5,4,3,2,1 instead. 回答1: In Hadoop 1.X, you can specify a custom comparator class for your outputs using JobConf.setOutputKeyComparatorClass. Your

What are the limitations of implementing MySQL NDB Cluster?

老子叫甜甜 提交于 2019-12-31 22:20:33
问题 I want to implement NDB Cluster for MySQL Cluster 6. I want to do it for very huge data structure with minimum 2 million records. I want to know is if there are any limitations of implementing NDB cluster. For example, RAM size, number of databases, or size of database for NDB cluster. 回答1: 2 million databases? I asssume you meant "rows". Anyway, concerning limitations: one of the most important things to keep in mind is that NDB/MySQL Cluster is not a general purpose database. Most notably,

Numpy efficient big matrix multiplication

99封情书 提交于 2019-12-31 10:51:59
问题 To store big matrix on disk I use numpy.memmap. Here is a sample code to test big matrix multiplication: import numpy as np import time rows= 10000 # it can be large for example 1kk cols= 1000 #create some data in memory data = np.arange(rows*cols, dtype='float32') data.resize((rows,cols)) #create file on disk fp0 = np.memmap('C:/data_0', dtype='float32', mode='w+', shape=(rows,cols)) fp1 = np.memmap('C:/data_1', dtype='float32', mode='w+', shape=(rows,cols)) fp0[:]=data[:] fp1[:]=data[:]

Numpy efficient big matrix multiplication

限于喜欢 提交于 2019-12-31 10:50:08
问题 To store big matrix on disk I use numpy.memmap. Here is a sample code to test big matrix multiplication: import numpy as np import time rows= 10000 # it can be large for example 1kk cols= 1000 #create some data in memory data = np.arange(rows*cols, dtype='float32') data.resize((rows,cols)) #create file on disk fp0 = np.memmap('C:/data_0', dtype='float32', mode='w+', shape=(rows,cols)) fp1 = np.memmap('C:/data_1', dtype='float32', mode='w+', shape=(rows,cols)) fp0[:]=data[:] fp1[:]=data[:]

How to Serialize object in hadoop (in HDFS)

◇◆丶佛笑我妖孽 提交于 2019-12-31 06:53:49
问题 I have a HashMap < String,ArrayList < Integer > >. I want to serialize my HashMap object(hmap) to HDFS location and later deserialize it at Mapper and Reducers for using it. To serialize my HashMap object on HDFS I used normal java object serialization code as follows but got error (permission denied) try { FileOutputStream fileOut =new FileOutputStream("hashmap.ser"); ObjectOutputStream out = new ObjectOutputStream(fileOut); out.writeObject(hm); out.close(); } catch(Exception e) { e