amazon-redshift

Pandas to_sql returning 'relation already exists' error when using if_exists='append'

烈酒焚心 提交于 2019-12-25 02:54:02
问题 I am trying to insert a data frame daily into a table in Redshift. The to_sql command works to create the table, but returns an error when I try to append to the existing table even when using if_exists = 'append' argument. Versions: pandas: 0.23.4 sqlalchemy: 1.2.15 psycopg2: 2.7.6.1 Python: 3.6.7 I am also using the monkey patch to speed up inserts outlined here: https://github.com/pandas-dev/pandas/issues/8953 but without this patch the insert takes prohibitively long (several hours).

How to handle quoted values in AWS Redshift unload command?

微笑、不失礼 提交于 2019-12-25 02:17:55
问题 Suppose, following the AWS docs I'd like to use an unload command like unload ( 'SELECT * FROM table_name WHERE day = '2019-01-01' ') to 's3://bucket_name/path' iam_role 'arn:aws:iam::<aws acct num>:role/<redshift role>' ADDQUOTES ESCAPE DELIMITER ',' GZIP ALLOWOVERWRITE; The problem is that the full query should be quoted and to write a string literal into the query will escape the string before the full query (as valid sql) is finished. How to escape quotes inside an AWS redshift unload

How to SET DATEFIRST equal to Sunday in Amazon Redshift

风流意气都作罢 提交于 2019-12-25 01:29:47
问题 2016-01-01 is the week 1 of 2016, but also is the week 53 of 2015. When I run SELECT DATE_PART(w, '2016-01-01') it returns 53 , but when I run SELECT DATE_PART(w, '2016-01-04') it returns 1 . Most probably this is happening because Redshift sets Monday as the day 1 of the week, and not Sunday, as it should be. Not sure if this will solve the problem, but what I need is to make 2016-01-01 as week 1 and from 2016-01-03 to 2016-01-09 as week 2 and so on... 回答1: Create a custom function with this

How to deal with Linebreaks in redshift load?

╄→尐↘猪︶ㄣ 提交于 2019-12-25 01:29:44
问题 I have a csv which has line breaks in one of the column. I get the error Delimiter not found. If I replace the text as continuous without line-breaks then it works. But how do I deal with line-breaks. My COPY command: COPY cat_crt_test_scores from 's3://rds-cat-crt-test-score-table/checkcsv.csv' iam_role 'arn:aws:iam::423639311527:role/RedshiftS3Access' explicit_ids delimiter '|' TIMEFORMAT 'auto' ESCAPE; Delimiter not found after reading till Dear Conduira, 来源: https://stackoverflow.com

Comparison of Explain Statement Output on Amazon Redshift

风格不统一 提交于 2019-12-24 18:13:31
问题 I have written a very complicated query in Amazon Redshift which comprises of 3-4 temporary tables along with sub-queries.Since, Query is slow in execution, I tried to replace it with another query, which uses derived tables instead of temporary tables. I just want to ask, Is there any way to compare the " Explain " Output for both the queries, so that we can conclude which query is working better in performance(both space and time ). Also, how much helpful is replacing temporary tables with

AWS Redshift driver in Zeppelin

耗尽温柔 提交于 2019-12-24 17:17:03
问题 I want to explore my data in Redshift using notebook Zeppelin. A small EMR cluster with Spark is running behind. I am loading databricks' spark-redshift library %dep z.reset() z.load("com.databricks:spark-redshift_2.10:0.6.0") and then import org.apache.spark.sql.DataFrame val query = "..." val url = "..." val port=5439 val table = "..." val database = "..." val user = "..." val password = "..." val df: DataFrame = sqlContext.read .format("com.databricks.spark.redshift") .option("url", s"jdbc

Concurrency issue with psycopg2, Redshift, and unittest

▼魔方 西西 提交于 2019-12-24 16:35:15
问题 I am in Python 2.7, using psycopg2 to connect to an Amazon Redshift database. I have unit tests, and in the setUp and tearDown methods for this test class, I drop the tables that were created for the purpose of this test. So the scheme is: def setUp(self): drop_specific_tables() create_specific_tables() def tearDown(self): drop_specific_tables() The reason for dropping in the setUp as well as tearDown is in case a test exits unsafely and skips tearDown we can still know that whenever it runs

Want to Connect redshift to R

佐手、 提交于 2019-12-24 13:59:04
问题 I tried to use the code from this link but I got an error driver <- JDBC("com.amazon.redshift.jdbc41.Driver", "RedshiftJDBC41-1.1.9.1009.jar", identifier.quote="`") JavaVM: requested Java version ((null)) not available. Using Java at "" instead. JavaVM: Failed to load JVM: /bundle/Libraries/libserver.dylib JavaVM FATAL: Failed to load the jvm library. Error in .jinit(classPath) : JNI_GetCreatedJavaVMs returned -1 After loading the driver and trying to connect. I don't know how to connect

Python psycopg2 insert NULL in some rows in postgresql table

帅比萌擦擦* 提交于 2019-12-24 11:54:03
问题 I have a Python dataframe with NULL value in some rows, while inserting to postgresql, some null in datetype column turns into 'NaT' string or 'NaN', I like it to be a real NULL , which is nothing in that cell. sample dataframe before insert import psycopg2 import pandas as pd import numpy as np conn=psycopg2.connect(dbname= 'myDB', host='amazonaws.com', port= '2222', user= 'mysuser', password= 'mypass') cur = conn.cursor() df= pd.DataFrame({ 'zipcode':[1,np.nan,22,88],'city':['A','h','B',np

AWS Glue Crawlers and large tables stored in S3

北战南征 提交于 2019-12-24 10:18:40
问题 I have some general question about AWS Glue and its crawlers. I have some data streams into S3 buckets and I use AWS Athena to access them as external tables in redshift. The tables are partitioned by hour, some glue crawlers update the partitions and the table structure every hour. The Problem is that the crawlers take longer and longer and someday they will not finish in less than an hour. Is there some setting in order to speed up this process or some proper alternative to the crawlers in