hadoop-partitioning

hadoop - how total mappers are determined

蓝咒 提交于 2019-12-24 06:36:09
问题 I am new to hadoop and just installed oracle's virtualbox and hortonworks' sandbox. I then, downloaded the latest version of hadoop and imported the jar files into my java program. I copied a sample wordcount program and created a new jar file. I run this jar file as a job using sandbox. The wordcount works perfectly fine as expected. However, in my job status page, I see the number of mappers to my input file is determined as 28. In my input file, I have the following line. Ramesh is

Spark: can you include partition columns in output files?

人盡茶涼 提交于 2019-12-20 02:13:48
问题 I am using Spark to write out data into partitions. Given a dataset with two columns (foo, bar) , if I do df.write.mode("overwrite").format("csv").partitionBy("foo").save("/tmp/output") , I get an output of /tmp/output/foo=1/X.csv /tmp/output/foo=2/Y.csv ... However, the output CSV files only contain the value for bar , not foo . I know the value of foo is already captured in the directory name foo=N , but is it possible to also include the value of foo in the CSV file? 回答1: Only if you make

Spark: can you include partition columns in output files?

瘦欲@ 提交于 2019-12-20 02:13:19
问题 I am using Spark to write out data into partitions. Given a dataset with two columns (foo, bar) , if I do df.write.mode("overwrite").format("csv").partitionBy("foo").save("/tmp/output") , I get an output of /tmp/output/foo=1/X.csv /tmp/output/foo=2/Y.csv ... However, the output CSV files only contain the value for bar , not foo . I know the value of foo is already captured in the directory name foo=N , but is it possible to also include the value of foo in the CSV file? 回答1: Only if you make

Efficient way of joining multiple tables in Spark - No space left on device

半腔热情 提交于 2019-12-19 04:00:50
问题 A similar question has been asked here, but it does not address my question properly. I am having nearly 100 DataFrames, with each having atleast 200,000 rows and I need to join them, by doing a full join based on the column ID , thereby creating a DataFrame with columns - ID, Col1, Col2,Col3,Col4, Col5..., Col102 . Just for illustration, the structure of my DataFrames - df1 = df2 = df3 = ..... df100 = +----+------+------+------+ +----+------+ +----+------+ +----+------+ | ID| Col1| Col2|

Specify minimum number of generated files from Hive insert

故事扮演 提交于 2019-12-17 21:08:30
问题 I am using Hive on AWS EMR to insert the results of a query into a Hive table partitioned by date. Although the total output size each day is similar, the number of generated files varies, usually between 6 to 8, but some days it creates just a single big file. I reran the query a couple of times, just in case the number of files happens to be influenced by the availability of nodes in the cluster but it seems it's consistent. So my questions are (a) what determines how many files are

Specify minimum number of generated files from Hive insert

与世无争的帅哥 提交于 2019-12-17 20:58:30
问题 I am using Hive on AWS EMR to insert the results of a query into a Hive table partitioned by date. Although the total output size each day is similar, the number of generated files varies, usually between 6 to 8, but some days it creates just a single big file. I reran the query a couple of times, just in case the number of files happens to be influenced by the availability of nodes in the cluster but it seems it's consistent. So my questions are (a) what determines how many files are

In Apache Spark, why does RDD.union not preserve the partitioner?

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-17 10:41:10
问题 As everyone knows partitioners in Spark have a huge performance impact on any "wide" operations, so it's usually customized in operations. I was experimenting with the following code: val rdd1 = sc.parallelize(1 to 50).keyBy(_ % 10) .partitionBy(new HashPartitioner(10)) val rdd2 = sc.parallelize(200 to 230).keyBy(_ % 13) val cogrouped = rdd1.cogroup(rdd2) println("cogrouped: " + cogrouped.partitioner) val unioned = rdd1.union(rdd2) println("union: " + unioned.partitioner) I see that by

Hive drops all the partitions if the partition column name is not correct

落爺英雄遲暮 提交于 2019-12-12 03:38:46
问题 I am facing a strange issue with hive, I have a table, partitioned on the basis of dept_key (its a integer eg.3212) table is created as follows create external table dept_details (dept_key,dept_name,dept_location) PARTITIONED BY (dept_key_partition INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '~' LOCATION '/dept_details/dept/'; Now I have some partitions already added e.g: 1204,1203,1204 When I tried dropping the partition I by mistake typed only dept_key and not "dept_key_partition" this

Aggregate queries fail in hive if partition directory doesn't exist

故事扮演 提交于 2019-12-12 02:18:20
问题 I am using Hive v1.2.1 with Tez. I have an external partitioned table. The partitions are hourly and of the form p=yyyy_mm_dd_hh. The situation is that these partition directories in hdfs are likely to be deleted sometime. After they are deleted, hive still contains the metadata for that partition, and a command 'show partitions ' would still list the partition whose directory was deleted from hdfs. Normally, this is not likely to cause any problem, and a select query for the partition(whose

PySpark: Partitioning and hashing multiple dataframes, then joining

空扰寡人 提交于 2019-12-11 10:08:48
问题 Background: I am working with clinical data with a lot of different .csv/.txt files. All these files are patientID based, but with different fields. I am importing these files into DataFrames , which I will join at a later stage after first processing each of these DataFrames individually. I have shown examples of two DataFrames below ( df_A and df_B ). Similarly, I have multiple DataFrames - df_A , df_B , df_C .... df_J and I will join all of them at a later stage. df_A = spark.read.schema