hadoop-partitioning | 易学教程

hadoop - how total mappers are determined

阅读更多关于 hadoop - how total mappers are determined

问题 I am new to hadoop and just installed oracle's virtualbox and hortonworks' sandbox. I then, downloaded the latest version of hadoop and imported the jar files into my java program. I copied a sample wordcount program and created a new jar file. I run this jar file as a job using sandbox. The wordcount works perfectly fine as expected. However, in my job status page, I see the number of mappers to my input file is determined as 28. In my input file, I have the following line. Ramesh is

Spark: can you include partition columns in output files?

阅读更多关于 Spark: can you include partition columns in output files?

问题 I am using Spark to write out data into partitions. Given a dataset with two columns (foo, bar) , if I do df.write.mode("overwrite").format("csv").partitionBy("foo").save("/tmp/output") , I get an output of /tmp/output/foo=1/X.csv /tmp/output/foo=2/Y.csv ... However, the output CSV files only contain the value for bar , not foo . I know the value of foo is already captured in the directory name foo=N , but is it possible to also include the value of foo in the CSV file? 回答1: Only if you make

Spark: can you include partition columns in output files?

阅读更多关于 Spark: can you include partition columns in output files?

Efficient way of joining multiple tables in Spark - No space left on device

阅读更多关于 Efficient way of joining multiple tables in Spark - No space left on device

问题 A similar question has been asked here, but it does not address my question properly. I am having nearly 100 DataFrames, with each having atleast 200,000 rows and I need to join them, by doing a full join based on the column ID , thereby creating a DataFrame with columns - ID, Col1, Col2,Col3,Col4, Col5..., Col102 . Just for illustration, the structure of my DataFrames - df1 = df2 = df3 = ..... df100 = +----+------+------+------+ +----+------+ +----+------+ +----+------+ | ID| Col1| Col2|

Specify minimum number of generated files from Hive insert

阅读更多关于 Specify minimum number of generated files from Hive insert

问题 I am using Hive on AWS EMR to insert the results of a query into a Hive table partitioned by date. Although the total output size each day is similar, the number of generated files varies, usually between 6 to 8, but some days it creates just a single big file. I reran the query a couple of times, just in case the number of files happens to be influenced by the availability of nodes in the cluster but it seems it's consistent. So my questions are (a) what determines how many files are

Specify minimum number of generated files from Hive insert

阅读更多关于 Specify minimum number of generated files from Hive insert

In Apache Spark, why does RDD.union not preserve the partitioner?

阅读更多关于 In Apache Spark, why does RDD.union not preserve the partitioner?

问题 As everyone knows partitioners in Spark have a huge performance impact on any "wide" operations, so it's usually customized in operations. I was experimenting with the following code: val rdd1 = sc.parallelize(1 to 50).keyBy(_ % 10) .partitionBy(new HashPartitioner(10)) val rdd2 = sc.parallelize(200 to 230).keyBy(_ % 13) val cogrouped = rdd1.cogroup(rdd2) println("cogrouped: " + cogrouped.partitioner) val unioned = rdd1.union(rdd2) println("union: " + unioned.partitioner) I see that by

Hive drops all the partitions if the partition column name is not correct

阅读更多关于 Hive drops all the partitions if the partition column name is not correct

问题 I am facing a strange issue with hive, I have a table, partitioned on the basis of dept_key (its a integer eg.3212) table is created as follows create external table dept_details (dept_key,dept_name,dept_location) PARTITIONED BY (dept_key_partition INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '~' LOCATION '/dept_details/dept/'; Now I have some partitions already added e.g: 1204,1203,1204 When I tried dropping the partition I by mistake typed only dept_key and not "dept_key_partition" this

Aggregate queries fail in hive if partition directory doesn't exist

阅读更多关于 Aggregate queries fail in hive if partition directory doesn't exist

问题 I am using Hive v1.2.1 with Tez. I have an external partitioned table. The partitions are hourly and of the form p=yyyy_mm_dd_hh. The situation is that these partition directories in hdfs are likely to be deleted sometime. After they are deleted, hive still contains the metadata for that partition, and a command 'show partitions ' would still list the partition whose directory was deleted from hdfs. Normally, this is not likely to cause any problem, and a select query for the partition(whose

PySpark: Partitioning and hashing multiple dataframes, then joining

阅读更多关于 PySpark: Partitioning and hashing multiple dataframes, then joining

问题 Background: I am working with clinical data with a lot of different .csv/.txt files. All these files are patientID based, but with different fields. I am importing these files into DataFrames , which I will join at a later stage after first processing each of these DataFrames individually. I have shown examples of two DataFrames below ( df_A and df_B ). Similarly, I have multiple DataFrames - df_A , df_B , df_C .... df_J and I will join all of them at a later stage. df_A = spark.read.schema