bigdata | 易学教程

How to restart a failed task on Airflow

阅读更多关于 How to restart a failed task on Airflow

问题 I am using a LocalExecutor and my dag has 3 tasks where task(C) is dependant on task(A). Task(B) and task(A) can run in parallel something like below A-->C B So task(A) has failed and but task(B) ran fine . Task(C) is yet to run as task(A) has failed. My question is how do i re run Task(A) alone so Task(C) runs once Task(A) completes and Airflow UI marks them as success. 回答1: In the UI: Go to the dag, and dag run of the run you want to change Click on GraphView Click on task A Click "Clear"

Python generator to read large CSV file

阅读更多关于 Python generator to read large CSV file

问题 I need to write a Python generator that yields tuples (X, Y) coming from two different CSV files. It should receive a batch size on init, read line after line from the two CSVs, yield a tuple (X, Y) for each line, where X and Y are arrays (the columns of the CSV files). I've looked at examples of lazy reading but I'm finding it difficult to convert them for CSVs: Lazy Method for Reading Big File in Python? Read large text files in Python, line by line without loading it in to memory Also,

Convert PL/SQL to Hive QL

阅读更多关于 Convert PL/SQL to Hive QL

问题 I want a tool through which I can get the respective hive query by giving the PL/SQL query. There are lots of tools available which convert sql to hql. ie: taod for cloude database. But it does not show me the respective hive query. Is there any such kind of tool whose convert given sql to hql. Please help me. Thanks and Regards, Ratan 回答1: Please take a look at open-source project PL/HQL at http://www.hplsql.org/ which is now a part of Hive 2.x or higher version. It allows you to run

Convert PL/SQL to Hive QL

阅读更多关于 Convert PL/SQL to Hive QL

Convert PL/SQL to Hive QL

阅读更多关于 Convert PL/SQL to Hive QL

Is it a good idea to generate per day collections in mongodb

阅读更多关于 Is it a good idea to generate per day collections in mongodb

问题 Is it a good idea to create per day collections for data on a given day (we could start with per day and then move to per hour if there is too much data). Is there a limit on the number of collections we can create in mongodb, or does it result in performance loss (is it an overhead for mongodb to maintain so many collections). Does a large number of collections have any adverse effect on performance? To give you more context, the data will be more like facebook feeds, and only the latest

How to convert a Date String from UTC to Specific TimeZone in HIVE?

阅读更多关于 How to convert a Date String from UTC to Specific TimeZone in HIVE?

问题 My Hive table has a date column with UTC date strings. I want to get all rows for a specific EST date. I am trying to do something like the below: Select * from TableName T where TO_DATE(ConvertToESTTimeZone(T.date)) = "2014-01-12" I want to know if there is a function for ConvertToESTTimeZone, or how I can achieve that? I tried the following but it doesnt work (my default timezone is CST): TO_DATE(from_utc_timestamp(T.Date) = "2014-01-12" TO_DATE( from_utc_timestamp(to_utc_timestamp (unix

How to load large table into tableau for data visualization?

阅读更多关于 How to load large table into tableau for data visualization?

问题 I am able to connect tableau with my database but the table size is really large here. Everytime I try to load the table into tableau, it is crashing and I am not able to find any work around. The table size varies from 10 million - 400 million rows. How should I approach this issue any suggestion ? 回答1: I found a simple solution for optimising Tableau to work with very large datasets (1 billion+ rows): Google BigQuery, which is essentially a managed data warehouse. Upload data to BigQuery

Spark Fixed Width File Import Large number of columns causing high Execution time

阅读更多关于 Spark Fixed Width File Import Large number of columns causing high Execution time

问题 I am getting the fixed width .txt source file from which I need to extract the 20K columns. As lack of libraries to process fixed width files using spark, I have developed the code which extracts the fields from fixed width text files. Code read the text file as RDD with sparkContext.textFile("abc.txt") then reads JSON schema and gets the column names and width of each column. In the function I read the fixed length string and using the start and end position we use substring function to

“Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used” on an EMR cluster with 75GB of memory

阅读更多关于 “Container killed by YARN for exceeding memory limits. 10.4 GB of 10.4 GB physical memory used” on an EMR cluster with 75GB of memory

问题 I'm running a 5 node Spark cluster on AWS EMR each sized m3.xlarge (1 master 4 slaves). I successfully ran through a 146Mb bzip2 compressed CSV file and ended up with a perfectly aggregated result. Now I'm trying to process a ~5GB bzip2 CSV file on this cluster but I'm receiving this error: 16/11/23 17:29:53 WARN TaskSetManager: Lost task 49.2 in stage 6.0 (TID xxx, xxx.xxx.xxx.compute.internal): ExecutorLostFailure (executor 16 exited caused by one of the running tasks) Reason: Container