Hadoop

Select if table exists in Apache Hive

坚强是说给别人听的谎言 提交于 2021-02-07 14:49:24
问题 I have a hive query which is of the format, select . . . from table1 left join (select . . . from table2) on (some_condition) The table2 might not be present depending on the environment. So I would like to join if only table2 is present otherwise just ignore the subquery. The below query returns the table_name if it exists, show tables in {DB_NAME} like '{table_name}' But I dont know how I can integrate this into my query to select only if it exists. Is there a way in hive query to check if

Select if table exists in Apache Hive

穿精又带淫゛_ 提交于 2021-02-07 14:48:19
问题 I have a hive query which is of the format, select . . . from table1 left join (select . . . from table2) on (some_condition) The table2 might not be present depending on the environment. So I would like to join if only table2 is present otherwise just ignore the subquery. The below query returns the table_name if it exists, show tables in {DB_NAME} like '{table_name}' But I dont know how I can integrate this into my query to select only if it exists. Is there a way in hive query to check if

hadoop hdfs points to file:/// not hdfs://

二次信任 提交于 2021-02-07 13:44:02
问题 So I installed Hadoop via Cloudera Manager cdh3u5 on CentOS 5. When I run cmd hadoop fs -ls / I expected to see the contents of hdfs://localhost.localdomain:8020/ However, it had returned the contents of file:/// Now, this goes without saying that I can access my hdfs:// through hadoop fs -ls hdfs://localhost.localdomain:8020/ But when it came to installing other applications such as Accumulo, accumulo would automatically detect Hadoop Filesystem in file:/// Question is, has anyone ran into

Running shell script from oozie through Hue

时光怂恿深爱的人放手 提交于 2021-02-07 13:18:23
问题 I am invoking a bash shell script using oozie editor in Hue. I used the shell action in the workflow and tried below different options in shell command: Uploaded the shell script using 'choose a file' Gave local directory path where shell script is present Gave HDFS path where shell script is present But all these options gave following error: Cannot run program "sec_test_oozie.sh" (in directory "/data/hadoop/yarn/local/usercache/user/appcache/application_1399542362142_0086/container

Simple User/Password authentication for HiveServer2 (without Kerberos/LDAP)

蹲街弑〆低调 提交于 2021-02-07 12:51:41
问题 How to provide a simple propertyfile or database user/password authentication for HiveServer2? I already found this presentation about this, but it's not in English :(. On the Cloudera reference manual they talk about the hive.server2.authentication property. It supports CUSTOM implementations of the interface hive.server2.custom.authentication . How to implement that? 回答1: In essence, you have to provide a java application that can perform your authentication. Maybe you're authing to a mysql

Simple User/Password authentication for HiveServer2 (without Kerberos/LDAP)

安稳与你 提交于 2021-02-07 12:51:03
问题 How to provide a simple propertyfile or database user/password authentication for HiveServer2? I already found this presentation about this, but it's not in English :(. On the Cloudera reference manual they talk about the hive.server2.authentication property. It supports CUSTOM implementations of the interface hive.server2.custom.authentication . How to implement that? 回答1: In essence, you have to provide a java application that can perform your authentication. Maybe you're authing to a mysql

Spark + Hive : Number of partitions scanned exceeds limit (=4000)

有些话、适合烂在心里 提交于 2021-02-07 11:03:50
问题 We upgraded our Hadoop Platform (Spark; 2.3.0, Hive: 3.1), and I'm facing this exception when reading some Hive tables in Spark : "Number of partitions scanned on table 'my_table' exceeds limit (=4000)". Tables we are working on : table1 : external table with a total of ~12300 partitions, partitioned by(col1: String, date1: String) , (ORC compressed ZLIB) table2 : external table with a total of 4585 partitions, partitioned by(col21: String, date2: Date, col22: String) (ORC uncompressed) [A]

How to read and write Parquet files efficiently?

心不动则不痛 提交于 2021-02-07 10:50:32
问题 I am working on a utility which reads multiple parquet files at a time and writing them into one single output file. the implementation is very straightforward. This utility reads parquet files from the directory, reads Group from all the file and put them into a list .Then uses ParquetWrite to write all these Groups into a single file. After reading 600mb it throws Out of memory error for Java heap space. It also takes 15-20 minutes to read and write 500mb of data. Is there a way to make

How to convert an Iterable to an RDD

戏子无情 提交于 2021-02-07 10:45:26
问题 To be more specific, how can i convert a scala.Iterable to a org.apache.spark.rdd.RDD ? I have an RDD of (String, Iterable[(String, Integer)]) and i want this to be converted into an RDD of (String, RDD[String, Integer]) , so that i can apply a reduceByKey function to the internal RDD . e.g i have an RDD where key is 2-lettered prefix of a person's name and the value is List of pairs of Person name and hours that they spent in an event my RDD is : ("To", List(("Tom",50),("Tod","30"),("Tom",70

What determines the number of mappers/reducers to use given a specified set of data [closed]

情到浓时终转凉″ 提交于 2021-02-07 10:35:30
问题 As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance. Closed 8 years ago . What are the factors which decide the number of mappers and reducers to use for a given set of data to achieve optimal performance? I