luigi

Using Luigi, how to read PostgreSQL data and then pass such data to the next task in the workflow?

无人久伴 提交于 2019-12-10 15:42:16
问题 Using Luigi, I want to define a workflow with two "stages": The first one reads data from PostgreSQL. The second one does something with the data. Thus I've started by subclassing luigi.contrib.postgres.PostgresQuery and overriding host, database, user, etc as stated in the doc. After that, how to pass the query result to the next task in the workflow? Such next task already specifies in the requires method the above class must be instantiated and returned. My code: class MyData(luigi.contrib

Running Hadoop jar using Luigi python

谁都会走 提交于 2019-12-10 15:21:09
问题 I need to run a Hadoop jar job using Luigi from python. I searched and found examples of writing mapper and reducer in Luigi but nothing to directly run a Hadoop jar. I need to run a Hadoop jar compiled directly. How can I do it? 回答1: You need to use the luigi.contrib.hadoop_jar package (code). In particular, you need to extend HadoopJarJobTask. For example, like that: from luigi.contrib.hadoop_jar import HadoopJarJobTask from luigi.contrib.hdfs.target import HdfsTarget class

How to enable dynamic requirements in Luigi?

折月煮酒 提交于 2019-12-10 13:35:25
问题 I have built a pipeline of Tasks in Luigi. Because this pipeline is going to be used in different contexts, it was possible that it would require to include more tasks at the beginning of or the end of the pipeline or even totally different dependencies between the tasks. That's when I thought: "Hey, why declare the dependencies between the tasks in my config file?", so I added something like this to my config.py: PIPELINE_DEPENDENCIES = { "TaskA": [], "TaskB": ["TaskA"], "TaskC": ["TaskA"],

Luigi - Unfulfilled %s at run time

做~自己de王妃 提交于 2019-12-09 16:32:15
问题 I am trying to learn in a very simple way how luigi works. Just as a newbie I came up with this code import luigi class class1(luigi.Task): def requires(self): return class2() def output(self): return luigi.LocalTarget('class1.txt') def run(self): print 'IN class A' class class2(luigi.Task): def requires(self): return [] def output(self): return luigi.LocalTarget('class2.txt') if __name__ == '__main__': luigi.run() Running this in command prompt gives error saying raise RuntimeError(

How to run a luigi task with spark-submit and pyspark

懵懂的女人 提交于 2019-12-07 10:29:12
问题 I have a luigi python task which includes some pyspark libs. Now I would like to submit this task on mesos with spark-submit. What should I do to run it? Below is my code skeleton: from pyspark.sql import functions as F from pyspark import SparkContext class myClass(SparkSubmitTask): # date = luigi.DateParameter() def __init__(self, date): self.date = date # date is datetime.date.today().isoformat() def output(self): def input(self): def run(self): # Some functions are using pyspark libs if _

MySQL Targets in Luigi workflow

流过昼夜 提交于 2019-12-07 08:21:53
问题 My TaskB requires TaskA, and on completion TaskA writes to a MySQL table, and then TaskB is to take in this output to the table as its input. I cannot seem to figure out how to do this in Luigi. Can someone point me to an example or give me a quick example here? 回答1: The existing MySqlTarget in luigi uses a separate marker table to indicate when the task is complete. Here's the rough approach I would take...but your question is very abstract, so it is likely to be more complicated in reality.

S3 file to local using luigi raises UnicodeDecodeError

泪湿孤枕 提交于 2019-12-07 06:09:17
I am copying a pdf file to local, using the following piece of code: with self.input_target().open('r') as r: with self.output_target().open('w') as w: for line in r: w.write(line) Which is based in this question (kind of) But when I execute that code I get the following: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 11: invalid continuation byte I tried this other approach, without good results: with self.input_target().open('r') as r, self.output_target().open('w') as w: w.write(r.read()) What is the correct way of doing it? Alexey Grigorev It seems that you're dealing

Suggestion for scheduling tool(s) for building hadoop based data pipelines

允我心安 提交于 2019-12-06 12:44:05
Between Apache Oozie, Spotify/Luigi and airbnb/airflow , what are the pros and cons for each of them? I have used oozie and airflow in the past for building a data ingestion pipeline using PIG and Hive. Currently, I am in the process of building a pipeline that looks at logs and extracts out useful events and puts them on redshift. I found that airflow was much easier to use/test/setup. It has a much cooler UI and lets users perform actions from the UI itself, which is not the case with Oozie. Any information about Luigi or other insights regarding stability and issues are welcome. Azkaban:

Can i use luigi with Python celery

夙愿已清 提交于 2019-12-06 11:56:14
问题 I am using celery for my web application. Celery executes Parent tasks which then executes further pipline of tasks The issues with celery I can't get dependency graph and visualizer i get with luigi to see whats the status of my parent task Celery does not provide mechanism to restart the failed pipeline and start from where it failed. These two thing i can easily get from luigi. So i was thinking that once celery runs the parent task then inside that task i execute the Luigi pipleine. Is

MySQL Targets in Luigi workflow

天涯浪子 提交于 2019-12-05 13:54:45
My TaskB requires TaskA, and on completion TaskA writes to a MySQL table, and then TaskB is to take in this output to the table as its input. I cannot seem to figure out how to do this in Luigi. Can someone point me to an example or give me a quick example here? The existing MySqlTarget in luigi uses a separate marker table to indicate when the task is complete. Here's the rough approach I would take...but your question is very abstract, so it is likely to be more complicated in reality. import luigi from datetime import datetime from luigi.contrib.mysqldb import MySqlTarget class TaskA(luigi