bigdata | 易学教程

Job queue for Hive action in oozie

阅读更多关于 Job queue for Hive action in oozie

问题 I have a oozie workflow. I am submitting all the hive actions with <name>mapred.job.queue.name</name> <value>${queueName}</value> But for few hive actions, the job launched is not in specified queue; it is invoked in default queue. Please suggest me the cause behind this behavior and solution. 回答1: A. Oozie specifics Oozie propagates the "regular" Hadoop properties to a "regular" MapReduce Action. But for other types of Action (Shell, Hive, Java, etc.) where Oozie runs a single Mapper task in

Unbalanced factor of KMeans?

阅读更多关于 Unbalanced factor of KMeans?

问题 Edit: The answer of this questions is heavily discussed in: Sum in Spark gone bad In Compute Cost of Kmeans, we saw how one can compute the cost of his KMeans model. I was wondering if we are able to compute the Unbalanced factor? If there is no such functionality provide by Spark, is there any easy way to implement this? I was not able to find a ref for the Unbalanced factor, but it should be similar to Yael's unbalanced_factor (my comments): // @hist: the number of points assigned to a

How can I calculate exact median with Apache Spark?

阅读更多关于 How can I calculate exact median with Apache Spark?

问题 This page contains some statistics functions (mean, stdev, variance, etc.) but it does not contain the median. How can I calculate exact median? Thanks 回答1: You need to sort RDD and take element in the middle or average of two elements. Here is example with RDD[Int]: import org.apache.spark.SparkContext._ val rdd: RDD[Int] = ??? val sorted = rdd.sortBy(identity).zipWithIndex().map { case (v, idx) => (idx, v) } val count = sorted.count() val median: Double = if (count % 2 == 0) { val l = count

How do I output the results of a HiveQL query to CSV?

阅读更多关于 How do I output the results of a HiveQL query to CSV?

we would like to put the results of a Hive query to a CSV file. I thought the command should look like this: insert overwrite directory '/home/output.csv' select books from table; When I run it, it says it completeld successfully but I can never find the file. How do I find this file or should I be extracting the data in a different way? Thanks! Although it is possible to use INSERT OVERWRITE to get data out of Hive, it might not be the best method for your particular case. First let me explain what INSERT OVERWRITE does, then I'll describe the method I use to get tsv files from Hive tables.

python - Using pandas structures with large csv(iterate and chunksize)

阅读更多关于 python - Using pandas structures with large csv(iterate and chunksize)

I have a large csv file, about 600mb with 11 million rows and I want to create statistical data like pivots, histograms, graphs etc. Obviously trying to just to read it normally: df = pd.read_csv('Check400_900.csv', sep='\t') doesn't work so I found iterate and chunksize in a similar post so I used df = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000) All good, i can for example print df.get_chunk(5) and search the whole file with just for chunk in df: print chunk My problem is I don't know how to use stuff like these below for the whole df and not for just one chunk plt

Would Spark unpersist the RDD itself when it realizes it won't be used anymore?

阅读更多关于 Would Spark unpersist the RDD itself when it realizes it won't be used anymore?

问题 We can persist an RDD into memory and/or disk when we want to use it more than once. However, do we have to unpersist it ourselves later on, or does Spark does some kind of garbage collection and unpersist the RDD when it is no longer needed? I notice that If I call unpersist function myself, I get slower performance. 回答1: Yes, Apache Spark will unpersist the RDD when it's garbage collected. In RDD.persist you can see: sc.cleaner.foreach(_.registerRDDForCleanup(this)) This puts a

Recommended package for very large dataset processing and machine learning in R

阅读更多关于 Recommended package for very large dataset processing and machine learning in R

问题 It seems like R is really designed to handle datasets that it can pull entirely into memory. What R packages are recommended for signal processing and machine learning on very large datasets that can not be pulled into memory? If R is simply the wrong way to do this, I am open to other robust free suggestions (e.g. scipy if there is some nice way to handle very large datasets) 回答1: Have a look at the "Large memory and out-of-memory data" subsection of the high performance computing task view

PySpark DataFrames - way to enumerate without converting to Pandas?

阅读更多关于 PySpark DataFrames - way to enumerate without converting to Pandas?

I have a very big pyspark.sql.dataframe.DataFrame named df. I need some way of enumerating records- thus, being able to access record with certain index. (or select group of records with indexes range) In pandas, I could make just indexes=[2,3,6,7] df[indexes] Here I want something similar, (and without converting dataframe to pandas) The closest I can get to is: Enumerating all the objects in the original dataframe by: indexes=np.arange(df.count()) df_indexed=df.withColumn('index', indexes) Searching for values I need using where() function. QUESTIONS: Why it doesn't work and how to make it

hadoop map reduce secondary sorting

阅读更多关于 hadoop map reduce secondary sorting

Can any one explain me how secondary sorting works in hadoop ? Why must one use GroupingComparator and how does it work in hadoop ? I was going through the link given below and got doubt on how groupcomapator works. Can any one explain me how grouping comparator works? http://www.bigdataspeak.com/2013/02/hadoop-how-to-do-secondary-sort-on_25.html Deepika C P Grouping Comparator Once the data reaches a reducer, all data is grouped by key. Since we have a composite key, we need to make sure records are grouped solely by the natural key. This is accomplished by writing a custom GroupPartitioner.

Strategies for reading in CSV files in pieces?

阅读更多关于 Strategies for reading in CSV files in pieces?

问题 I have a moderate-sized file (4GB CSV) on a computer that doesn\'t have sufficient RAM to read it in (8GB on 64-bit Windows). In the past I would just have loaded it up on a cluster node and read it in, but my new cluster seems to arbitrarily limit processes to 4GB of RAM (despite the hardware having 16GB per machine), so I need a short-term fix. Is there a way to read in part of a CSV file into R to fit available memory limitations? That way I could read in a third of the file at a time,