luigi

How to reset luigi task status?

谁说我不能喝 提交于 2019-12-22 04:26:09
问题 Currently, I have a bunch of luigi tasks queued together, with a simple dependency chain( a -> b -> c -> d ). d gets executed first, and a at the end. a is the task that gets triggered. All the targets except a return a luigi.LocalTarget() object and have a single generic luigi.Parameter() which is a string (containing a date and a time). Runs on a luigi central server (which has history enabled). The problem is that, when I rerun the said task a , luigi checks the history and sees if that

How to Dynamically create a Luigi Task

百般思念 提交于 2019-12-21 11:06:18
问题 I am building a wrapper for Luigi Tasks and I ran into a snag with the Register class that's actually an ABC metaclass and not being pickable when I create a dynamic type . The following code, more or less, is what I'm using to develop the dynamic class. class TaskWrapper(object): '''Luigi Spark Factory from the provided JobClass Args: JobClass(ScrubbedClass): The job to wrap options: Options as passed into the JobClass ''' def __new__(self, JobClass, **options): # Validate we have a good job

Scheduling spark jobs on a timely basis

爷,独闯天下 提交于 2019-12-14 03:43:05
问题 Which is the recommended tool for scheduling Spark Jobs on a daily/weekly basis. 1) Oozie 2) Luigi 3) Azkaban 4) Chronos 5) Airflow Thanks in advance. 回答1: Updating my previous answer from here: Suggestion for scheduling tool(s) for building hadoop based data pipelines Airflow: Try this first. Decent UI, Python-ish job definition, semi-accessible for non-programmers, dependency declaration syntax is weird. Airflow has built in support for the fact that jobs scheduled jobs often need to be

How to use Parameters in Python Luigi

﹥>﹥吖頭↗ 提交于 2019-12-12 09:39:39
问题 How do I pass in parameters to Luigi? if I have a python file called FileFinder.py with a class named getFIles: class getFiles(luigi.Task): and I want to pass in a directory to this class such as: C://Documents//fileName and then use this parameter in my run method def run(self): how do I run this in command line and add the parameter for use in my code? I am accustomed to running this file in command line like this: python FileFinder.py getFiles --local-scheduler What do I add to my code to

Make failure of a dynamic Luigi task non critical

橙三吉。 提交于 2019-12-11 15:25:49
问题 I have a luigi workflow that downloads a bunch of large files via ftp and deposits them on s3. I have one task that reads a list of files to download then creates a bunch of tasks that actually do the downloads The idea is that the result of this workflow is a single file containing a list of downloads that have succeeded, with any failed downloads being reattempted on the next run the following day. The problem is that if any of the download tasks fails then the successful download list is

Luigi flexible pipeline and passing parameters all the way through

烂漫一生 提交于 2019-12-11 10:57:50
问题 I've recently implemented a luigi pipeline to handle the processing for one of our bioinformatics pipelines. However, there's something fundamental about how to setup these tasks that I'm not grasping. Let's say I've got a chain of three tasks that I'd like to be able to run with multiple workers. For example, the dependency graph for three workers might look like: / taskC -> taskB -> taskA - taskC -> taskB -> taskA \ taskC -> taskB -> taskA and I might write class entry(luigi.Task): in_dir =

Recurrent machine learning ETL using Luigi

半腔热情 提交于 2019-12-11 06:58:34
问题 Today, running the machine learning job I've written is done by hand. I download the needed input files, learn and predict things, output a .csv file, which I then copy into a database. However, since this is going into production, I need to automate all this process. The needed input files will arrive every month (and eventually more frequently) into a S3 bucket from the provider. Now I'm planning using Luigi to solve this problem. Here is the ideal process: Every week (or day, or hour,

python luigi died unexpectedly with exit code -11

你离开我真会死。 提交于 2019-12-10 20:45:38
问题 I have a data pipeline with luigi that works perfectly fine if I put 1 worker to the task. However, if I put > 1 workers, then it dies (unexpectedly with exit code -11) in a stage with 2 dependencies. The code is rather complex, so a minimum example would be difficult to give. The gist of the matter is that I am doing the following things with gensim : Building a dictionary from some texts. Building a corpus from said texts and the dictionary (requires (1)). Training an LDA model from the

Can't pickle <class 'abc.class_name'>: attribute lookup class_name on abc failed

强颜欢笑 提交于 2019-12-10 18:53:39
问题 I'm getting the above error as I try to create dependencies (subtasks) based on dependency relationship defined in a dictionary ("cmdList). For instance, "BDX010" is a dependency of "BDX020". I'm using Python 3.7. Please see the stack trace at the bottom for the exact error message. import luigi from helpers import SQLTask import helpers import logging import time acctDate = '201904' ssisDate = '201905' runDesc0xx = 'prod period 4 test2' runDesc9xx = 'test2' YY = acctDate[:4] MM = acctDate[4

Can we limit the throughput of a luigi Task?

为君一笑 提交于 2019-12-10 18:44:22
问题 We have a Luigi Task that request a piece of information from a 3rd party service. We are limited on the number of call requests we can perform per minute to that API call. Is there a way to specify on a per-Task basis how many tasks of this kind must the scheduler run per unit of time? 回答1: We implemented our own rate limiting in the task. Our API limit was low enough that we could saturate it with a single thread. When we received a rate limit response, we just back off and retry. One thing