hadoop-partitioning | 易学教程

Unable to set partitoner to the JobConf object

阅读更多关于 Unable to set partitoner to the JobConf object

问题 I wrote a custom partitioner but am unable to set it to the JobConf object in the main class. import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Partitioner; public class FirstCharTextPartitioner extends Partitioner<Text, Text> { @Override public int getPartition(Text key, Text value, int numReduceTasks) { return (key.toString().charAt(0)) % numReduceTasks; } } But when I try to set this to the JobConf object, I get the following error. The method setPartitionerClass(Class)

Alternative to the default hashpartioner provided with hadoop

阅读更多关于 Alternative to the default hashpartioner provided with hadoop

问题 I have a hadoop MapReduce program that distributes keys unevenly. Some reducers end up with two keys, some with one key, and some with none. how do I force hadoop to distribute each partition with a certain key to a separate reducer. I have nine unique keys of the form: 0,0 0,1 0,2 1,0 1,1 1,2 2,0 2,1 2,2 and I set the job.setNumReduceTasks(9); but the hashpartitioner seems to hash two keys to the same hashcode causing overlapped keys being sent to the same reducer and leaving some reducers

hive explain plan not showing partition

阅读更多关于 hive explain plan not showing partition

问题 I have a table which contains 251M records and size is 2.5gb. I created a partition on two columns which I am doing condition in predicate. But the explain plan is not showing it is reading partition even though I have partitioned. With selecting from partition column, I am inserting to another table. Is there a particular order I have to mention the condition in predicate ? How should I improve performance. explain SELECT '123' AS run_session_id , tbl1.transaction_id , tbl1.src_transaction

Hadoop webuser: No such user

阅读更多关于 Hadoop webuser: No such user

问题 While running a hadoop multi-node cluster , i got below error message on my master logs , can some advise what to do..? do i need to create a new user or can i gave my existing Machine user name over here 2013-07-25 19:41:11,765 WARN org.apache.hadoop.security.UserGroupInformation: No groups available for user webuser 2013-07-25 19:41:11,778 WARN org.apache.hadoop.security.ShellBasedUnixGroupsMapping: got exception trying to get groups for user webuser org.apache.hadoop.util.Shell

Spark Clustered By/Bucket by dataset not using memory

阅读更多关于 Spark Clustered By/Bucket by dataset not using memory

问题 I recently came across Spark bucketby/clusteredby here. I tried to mimic this for a 1.1TB source file from S3 (already in parquet). Plan is to completely avoid shuffle as most of the datasets are always joined on "id" column. Here are is what I am doing: myDf.repartition(20) .write.partitionBy("day") .option("mode", "DROPMALFORMED") .option("compression", "snappy") .option("path","s3://my-bucket/folder/1year_data_bucketed/").mode("overwrite") .format("parquet").bucketBy(20,"id").sortBy("id")

Who will get a chance to execute first , Combiner or Partitioner?

阅读更多关于 Who will get a chance to execute first , Combiner or Partitioner?

问题 I'm getting confused after reading below article on Hadoop- Definitive guide 4th edition(page-204) Before it writes to disk, the thread first divides the data into partitions corresponding to the reducers that they will ultimately be sent to. Within each partition, the background thread performs an in-memory sort by key, and if there is a combiner function, it is run on the output of the sort. Running the combiner function makes for a more compact map output, so there is less data to write to

New user SSH hadoop

阅读更多关于 New user SSH hadoop

问题 Installation of hadoop on single node cluster , any idea why do we need to create the following Why do we need SSH access for a new user ..? Why should it be able to connect to its own user account? Why should i specify a password less for a new user ..? When all the nodes are in same machine, why do they are communicating explicitly ..? http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/ 回答1: Why do we need SSH access for a new user ..? Because you want

passing multiple dates as a paramters to Hive query

阅读更多关于 passing multiple dates as a paramters to Hive query

问题 I am trying to pass a list of dates as parameter to my hive query. #!/bin/bash echo "Executing the hive query - Get distinct dates" var=`hive -S -e "select distinct substr(Transaction_date,0,10) from test_dev_db.TransactionUpdateTable;"` echo $var echo "Executing the hive query - Get the parition data" hive -hiveconf paritionvalue=$var -e 'SELECT Product FROM test_dev_db.TransactionMainHistoryTable where tran_date in("${hiveconf:paritionvalue}");' echo "Hive query - ends" Output as: Executing

Who will get a chance to execute first , Combiner or Partitioner?

阅读更多关于 Who will get a chance to execute first , Combiner or Partitioner?

I'm getting confused after reading below article on Hadoop- Definitive guide 4th edition(page-204) Before it writes to disk, the thread first divides the data into partitions corresponding to the reducers that they will ultimately be sent to. Within each partition, the background thread performs an in-memory sort by key, and if there is a combiner function, it is run on the output of the sort. Running the combiner function makes for a more compact map output, so there is less data to write to local disk and to transfer to the reducer. Here is my doubt: 1) Who will execute first combiner or

New user SSH hadoop

阅读更多关于 New user SSH hadoop

Installation of hadoop on single node cluster , any idea why do we need to create the following Why do we need SSH access for a new user ..? Why should it be able to connect to its own user account? Why should i specify a password less for a new user ..? When all the nodes are in same machine, why do they are communicating explicitly ..? http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/ Tariq Why do we need SSH access for a new user ..? Because you want to communicate to the user who is running Hadoop daemons. Notice that ssh is actually from a user(on