bigdata | 易学教程

SparkR Job 100 Minutes Timeout

阅读更多关于 SparkR Job 100 Minutes Timeout

问题 I have written a bit complex sparkR script and run it using spark-submit. What script basically do is read a big hive/impala parquet based table row by row and generate new parquet file having same number of rows. But it seems the job stops after exactly around 100 Minutes which seems some timeout. For up to 500K rows script works perfectly (Because it needs less than 100 Minutes) For 1, 2 or 3 or more million rows script exits after 100 Minutes. I checked all possible parameter having values

How to subtract months from date in HIVE

阅读更多关于 How to subtract months from date in HIVE

问题 I am looking for a method that helps me subtract months from a date in HIVE I have a date 2015-02-01 . Now i need to subtract 2 months from this date so that result should be 2014-12-01 . Can you guys help me out here? 回答1: select add_months('2015-02-01',-2); if you need to go back to first day of the resulting month: select add_months(trunc('2015-02-01','MM'),-2); 回答2: Please try add_months date function and pass -2 as months. Internally add_months uses Java Calendar.add method, which

convert data.frame to ff

阅读更多关于 convert data.frame to ff

问题 I would like to convert a data.frame to a ff object, with as.ffdf as described here df.apr=as.data.frame(df.apr) # from data.table to data.frame cols=df.apr[1,] cols=sapply(cols,class) df_apr=as.ffdf(df.apr,vmode=cols) gives an error: Error in ff(initdata = initdata, length = length, levels = levels, ordered = ordered, : vmode 'numeric' not implemented without the 'vmode' argument, the following error is given: Error in ff(initdata = initdata, length = length, levels = levels, ordered =

Does a flatMap in spark cause a shuffle?

阅读更多关于 Does a flatMap in spark cause a shuffle?

问题 Does flatMap in spark behave like the map function and therefore cause no shuffling, or does it trigger a shuffle. I suspect it does cause shuffling. Can someone confirm it? 回答1: There is no shuffling with either map or flatMap. The operations that cause shuffle are: Repartition operations: Repartition: Coalesce: ByKey operations (except for counting): GroupByKey: ReduceByKey: Join operations: Cogroup: Join: Although the set of elements in each partition of newly shuffled data will be

Does a flatMap in spark cause a shuffle?

阅读更多关于 Does a flatMap in spark cause a shuffle?

Reverse Sorting Reducer Keys

阅读更多关于 Reverse Sorting Reducer Keys

问题 What is the best approach to get the Map Output keys to a reducer in reverse order? By default the reducer receives all keys in ascending order of keys. Any help or comments widely appreciated. In simple words, in the normal scenario, if a map emits keys 1,4,3,5,2 the reducer receives the same as 1,2,3,4,5 . I would like the reducer to receive 5,4,3,2,1 instead. 回答1: In Hadoop 1.X, you can specify a custom comparator class for your outputs using JobConf.setOutputKeyComparatorClass. Your

What are the limitations of implementing MySQL NDB Cluster?

阅读更多关于 What are the limitations of implementing MySQL NDB Cluster?

问题 I want to implement NDB Cluster for MySQL Cluster 6. I want to do it for very huge data structure with minimum 2 million records. I want to know is if there are any limitations of implementing NDB cluster. For example, RAM size, number of databases, or size of database for NDB cluster. 回答1: 2 million databases? I asssume you meant "rows". Anyway, concerning limitations: one of the most important things to keep in mind is that NDB/MySQL Cluster is not a general purpose database. Most notably,

Numpy efficient big matrix multiplication

阅读更多关于 Numpy efficient big matrix multiplication

问题 To store big matrix on disk I use numpy.memmap. Here is a sample code to test big matrix multiplication: import numpy as np import time rows= 10000 # it can be large for example 1kk cols= 1000 #create some data in memory data = np.arange(rows*cols, dtype='float32') data.resize((rows,cols)) #create file on disk fp0 = np.memmap('C:/data_0', dtype='float32', mode='w+', shape=(rows,cols)) fp1 = np.memmap('C:/data_1', dtype='float32', mode='w+', shape=(rows,cols)) fp0[:]=data[:] fp1[:]=data[:]

Numpy efficient big matrix multiplication

阅读更多关于 Numpy efficient big matrix multiplication

How to Serialize object in hadoop (in HDFS)

阅读更多关于 How to Serialize object in hadoop (in HDFS)

问题 I have a HashMap < String,ArrayList < Integer > >. I want to serialize my HashMap object(hmap) to HDFS location and later deserialize it at Mapper and Reducers for using it. To serialize my HashMap object on HDFS I used normal java object serialization code as follows but got error (permission denied) try { FileOutputStream fileOut =new FileOutputStream("hashmap.ser"); ObjectOutputStream out = new ObjectOutputStream(fileOut); out.writeObject(hm); out.close(); } catch(Exception e) { e