emr

Need strategy advice for migrating large tables from RDS to DynamoDB

让人想犯罪 __ 提交于 2019-12-01 06:44:41
问题 We have a couple of mySql tables in RDS that are huge (over 700 GB), that we'd like to migrate to a DynamoDB table. Can you suggest a strategy, or a direction to do this in a clean, parallelized way? Perhaps using EMR or the AWS Data Pipeline. 回答1: You can use AWS Pipeline. There are two basic templates, one for moving RDS tables to S3 and the second for importing data from S3 to DynamoDB. You can create your own pipeline using both templates. Regards 回答2: one thing to consider with such

Apache Hive: How to convert string to timestamp?

有些话、适合烂在心里 提交于 2019-11-30 21:45:25
I'm trying to convert the string in REC_TIME column to a timestamp format in hive. Ex: Sun Jul 31 09:28:20 UTC 2016 => 2016-07-31 09:28:20 SELECT xxx, UNIX_TIMESTAMP(REC_TIME, "E M dd HH:mm:ss z yyyy") FROM wlogs LIMIT 10; When I execute the above SQL it returns a NULL value. Try this : select from_unixtime(unix_timestamp("Sun Jul 31 09:28:20 UTC 2016","EEE MMM dd HH:mm:ss zzz yyyy")); This works fine if your hive cluster has UTC timezone. Say suppose your server is in CST then you need to do as below to get to UTC; select to_utc_timestamp(from_unixtime(unix_timestamp("Sun Jul 31 09:28:20 UTC

Optimizing GC on EMR cluster

*爱你&永不变心* 提交于 2019-11-30 18:45:40
I am running a Spark Job written in Scala on EMR and the stdout of each executor is filled with GC allocation failures. 2016-12-07T23:42:20.614+0000: [GC (Allocation Failure) 2016-12-07T23:42:20.614+0000: [ParNew: 909549K->432K(1022400K), 0.0089234 secs] 2279433K->1370373K(3294336K), 0.0090530 secs] [Times: user=0.11 sys=0.00, real=0.00 secs] 2016-12-07T23:42:21.572+0000: [GC (Allocation Failure) 2016-12-07T23:42:21.572+0000: [ParNew: 909296K->435K(1022400K), 0.0089298 secs] 2279237K->1370376K(3294336K), 0.0091147 secs] [Times: user=0.11 sys=0.01, real=0.00 secs] 2016-12-07T23:42:22.525+0000:

Get a yarn configuration from commandline

天涯浪子 提交于 2019-11-30 17:53:05
In EMR, is there a way to get a specific value of the configuration given the configuration key using the yarn command? For example I would like to do something like this yarn get-config yarn.scheduler.maximum-allocation-mb It's a bit non-intuitive, but it turns out the hdfs getconf command is capable of checking configuration properties for YARN and MapReduce, not only HDFS. > hdfs getconf -confKey fs.defaultFS hdfs://localhost:19000 > hdfs getconf -confKey dfs.namenode.name.dir file:///Users/chris/hadoop-deploy-trunk/data/dfs/name > hdfs getconf -confKey yarn.resourcemanager.address 0.0.0.0

Apache Hive: How to convert string to timestamp?

安稳与你 提交于 2019-11-30 17:11:24
问题 I'm trying to convert the string in REC_TIME column to a timestamp format in hive. Ex: Sun Jul 31 09:28:20 UTC 2016 => 2016-07-31 09:28:20 SELECT xxx, UNIX_TIMESTAMP(REC_TIME, "E M dd HH:mm:ss z yyyy") FROM wlogs LIMIT 10; When I execute the above SQL it returns a NULL value. 回答1: Try this : select from_unixtime(unix_timestamp("Sun Jul 31 09:28:20 UTC 2016","EEE MMM dd HH:mm:ss zzz yyyy")); This works fine if your hive cluster has UTC timezone. Say suppose your server is in CST then you need

How to install a GUI on Amazon AWS EC2 or EMR with the Amazon AMI

白昼怎懂夜的黑 提交于 2019-11-30 16:13:30
I have a need to run an application that requires a GUI interface to start and configure. I also need to be able to run this application on Amazon's EC2 service and EMR service. The EMR requirement means it has to run on Amazon's Linux AMI. After extensive searching I've been unable to find any ready made solutions, in particular the requirement to run on Amazon's AMI. The closest match and most often referenced solution is here . Unfortunately it was developed on a RHEL6 instance which differs enough from Amazon's AMI that the solution does not work. I'm posting my solution below. Hopefully

Exporting Hive Table to a S3 bucket

孤街浪徒 提交于 2019-11-30 04:46:32
I've created a Hive Table through an Elastic MapReduce interactive session and populated it from a CSV file like this: CREATE TABLE csvimport(id BIGINT, time STRING, log STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; LOAD DATA LOCAL INPATH '/home/hadoop/file.csv' OVERWRITE INTO TABLE csvimport; I now want to store the Hive table in a S3 bucket so the table is preserved once I terminate the MapReduce instance. Does anyone know how to do this? user495732 Why Me Yes you have to export and import your data at the start and end of your hive session To do this you need to create a table

Get a yarn configuration from commandline

♀尐吖头ヾ 提交于 2019-11-30 02:45:56
问题 In EMR, is there a way to get a specific value of the configuration given the configuration key using the yarn command? For example I would like to do something like this yarn get-config yarn.scheduler.maximum-allocation-mb 回答1: It's a bit non-intuitive, but it turns out the hdfs getconf command is capable of checking configuration properties for YARN and MapReduce, not only HDFS. > hdfs getconf -confKey fs.defaultFS hdfs://localhost:19000 > hdfs getconf -confKey dfs.namenode.name.dir file://

How do you make a HIVE table out of JSON data?

♀尐吖头ヾ 提交于 2019-11-29 19:44:39
I want to create a Hive table out of some JSON data (nested) and run queries on it? Is this even possible? I've gotten as far as uploading the JSON file to S3 and launching an EMR instance but I don't know what to type in the hive console to get the JSON file to be a Hive table? Does anyone have some example command to get me started, I can't find anything useful with Google ... You'll need to use a JSON serde in order for Hive to map your JSON to the columns in your table. A really good example showing you how is here: http://aws.amazon.com/articles/2855 Unfortunately the JSON serde supplied

How to install a GUI on Amazon AWS EC2 or EMR with the Amazon AMI

拜拜、爱过 提交于 2019-11-29 15:22:47
问题 I have a need to run an application that requires a GUI interface to start and configure. I also need to be able to run this application on Amazon's EC2 service and EMR service. The EMR requirement means it has to run on Amazon's Linux AMI. After extensive searching I've been unable to find any ready made solutions, in particular the requirement to run on Amazon's AMI. The closest match and most often referenced solution is here. Unfortunately it was developed on a RHEL6 instance which