Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.

Last update: Jan 01, 2023

Overview

https://img.shields.io/travis/spotify/luigi/master.svg?style=flat

https://img.shields.io/codecov/c/github/spotify/luigi/master.svg?style=flat

https://landscape.io/github/spotify/luigi/master/landscape.svg?style=flat

https://img.shields.io/pypi/v/luigi.svg?style=flat

https://img.shields.io/pypi/l/luigi.svg?style=flat

Luigi is a Python (3.6, 3.7 tested) package that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more.

Getting Started

Run pip install luigi to install the latest stable version from PyPI. Documentation for the latest release is hosted on readthedocs.

Run pip install luigi[toml] to install Luigi with TOML-based configs support.

For the bleeding edge code, pip install git+https://github.com/spotify/luigi.git. Bleeding edge documentation is also available.

Background

The purpose of Luigi is to address all the plumbing typically associated with long-running batch processes. You want to chain many tasks, automate them, and failures will happen. These tasks can be anything, but are typically long running things like Hadoop jobs, dumping data to/from databases, running machine learning algorithms, or anything else.

There are other software packages that focus on lower level aspects of data processing, like Hive, Pig, or Cascading. Luigi is not a framework to replace these. Instead it helps you stitch many tasks together, where each task can be a Hive query, a Hadoop job in Java, a Spark job in Scala or Python, a Python snippet, dumping a table from a database, or anything else. It's easy to build up long-running pipelines that comprise thousands of tasks and take days or weeks to complete. Luigi takes care of a lot of the workflow management so that you can focus on the tasks themselves and their dependencies.

You can build pretty much any task you want, but Luigi also comes with a toolbox of several common task templates that you use. It includes support for running Python mapreduce jobs in Hadoop, as well as Hive, and Pig, jobs. It also comes with file system abstractions for HDFS, and local files that ensures all file system operations are atomic. This is important because it means your data pipeline will not crash in a state containing partial data.

Visualiser page

The Luigi server comes with a web interface too, so you can search and filter among all your tasks.

Dependency graph example

Just to give you an idea of what Luigi does, this is a screen shot from something we are running in production. Using Luigi's visualiser, we get a nice visual overview of the dependency graph of the workflow. Each node represents a task which has to be run. Green tasks are already completed whereas yellow tasks are yet to be run. Most of these tasks are Hadoop jobs, but there are also some things that run locally and build up data files.

Philosophy

Conceptually, Luigi is similar to GNU Make where you have certain tasks and these tasks in turn may have dependencies on other tasks. There are also some similarities to Oozie and Azkaban. One major difference is that Luigi is not just built specifically for Hadoop, and it's easy to extend it with other kinds of tasks.

Everything in Luigi is in Python. Instead of XML configuration or similar external data files, the dependency graph is specified within Python. This makes it easy to build up complex dependency graphs of tasks, where the dependencies can involve date algebra or recursive references to other versions of the same task. However, the workflow can trigger things not in Python, such as running Pig scripts or scp'ing files.

Who uses Luigi?

We use Luigi internally at Spotify to run thousands of tasks every day, organized in complex dependency graphs. Most of these tasks are Hadoop jobs. Luigi provides an infrastructure that powers all kinds of stuff including recommendations, toplists, A/B test analysis, external reports, internal dashboards, etc.

Since Luigi is open source and without any registration walls, the exact number of Luigi users is unknown. But based on the number of unique contributors, we expect hundreds of enterprises to use it. Some users have written blog posts or held presentations about Luigi:

Some more companies are using Luigi but haven't had a chance yet to write about it:

We're more than happy to have your company added here. Just send a PR on GitHub.

External links

Mailing List for discussions and asking questions. (Google Groups)
Releases (PyPI)
Source code (GitHub)
Hubot Integration plugin for Slack, Hipchat, etc (GitHub)

Authors

Luigi was built at Spotify, mainly by Erik Bernhardsson and Elias Freider. Many other people have contributed since open sourcing in late 2012. Arash Rouhani is currently the chief maintainer of Luigi.

Comments

New theme for luigi visualiser
This is a new theme for the visualiser accessible through .../index.lte.html. The theme takes the d3 theme and adds jQuery DataTables for displaying a filterable table of tasks and AdminLTE for styling.

Key features are:

Table paging

Search filter by any text in the table

Info boxes displaying count of task status (done, running, etc.) are selectable to filter the table

Sidebar displaying task names with counts are selectable to filter the table

Exception reports are accessible via expandable rows

I can imagine improving the layout of the "Parameters" column and making more use of the expandable rows but we are already finding this very useful in production.

With 3 separate versions of index.html and several new dependencies I think it would be worth refactoring the /static structure and finding a neater way of selecting themes. If you agree I'll take a look at it.
opened by stephenpascoe 89
Add Datadog contrib for monitoring purpose

Description

Datadog is a tool that allows you to send metrics that you create dashboard and add alerting on specific behaviors.

Adding this contrib will allow for users of this tool to log their pipeline information to Datadog.

Motivation and Context

Based on the status change of a task, we log that information to Datadog with the parameters that were used to run that specific task.

This allows us to easily create dashboards to visualize the health. For example, we can be notified via Datadog if a task has failed, or we can graph the execution time of a specific task over a period of time.

The implementation idea was strongly based on the stale PR https://github.com/spotify/luigi/pull/2044.

Have you tested this? If so, how?

We've been using this contrib for multiple months now (maybe a year?), at Glossier. This is the main point of reference to see the health of our pipeline.

opened by thisiscab 48
Dynamic requirements

This is a prototype for dynamic requirement based of erikbern's dynamic-requires branch

It differs in that it multiple requirements can be returned at once and that it supports multiple threads. There is no concept of suspending tasks. If the requirements are not all available when yielded the task will be put back as pending.

I struggled with getting access to the dynamic requirements from other processes. How it works now is that you have to add a dynamic_requires function that return the Tasks that may be yielded later on. I'm very open to suggestions on how to improve this.

Again, just a prototype... Comments please!

opened by tommyengstrom 43
Visible parameter for luigi.Parameters
Description

Visible parameter for luigi.Parameter class which is needed to show/hide some parameters from WEB page such as passwords to database

Motivation and Context

Well, in my case this change is crucial because of some of my tasks connect to database and contain passwords for it which i don't want to show it from web page to all.

In code it will look like this:

class MyTask(luigi.Task): secret = luigi.Parameter(visible=False) # default value for visible - True public = luigi.Parameter()

Have you tested this? If so, how?

I tested it on my tasks and it works fine. It just removes this parameters from node represents current task, so it shouldn't break any compatibility

(Below images are just examples) Before: After:

Maybe, it will be helpful for others too in similar case
opened by nryanov 39
S3Client to use Boto3
Description

This work is to move away from boto to start using Boto3 for the S3Client.

Motivation and Context

Boto being no longer supported, Luigi should move away from it and use the current Boto3. This would solve a number of issues for example:

More stable downloads for big files

Built-in support for multipart uploads

Encryption with KMS

.... One of the main motivation from my part is the lack of support for aws task roles in boto. As more people are using this AWS functionality, It would make sense to move the S3Client to use boto3. More reasons can be found here: https://github.com/spotify/luigi/issues/1344

Have you tested this? If so, how?

This is Work In Progress. I'm trying my best to stick to the original tests but sometimes change is inevitable (Happy to chat for more details)

Note

This is my very first contribution so feedback and suggestions are more than welcome.
opened by ouanixi 39
Task id hashing
Implementing semi-opaque, hashed task_ids as discussed at #1312.

This is the version I will be testing on our infrastructure in the next few weeks. This PR is intended to give my proposal better visibility whilst I test.

The PR replaces task_ids with a value which is reliably unique but you can't extract the full parameter values from it. The actual algorithm is family_pval1_pval2_pval3_hash where:

family: The task_family (a.k.a task_name)

pval*: A truncated serialisation of the first 3 parameter values, sorted by parameter name

hash: A md5hash of the canonical JSON serialisation of the family and parameters, truncated to 10 characters.

str(task) will return the traditional serialisation.

The db_task_history database schema is changed to include tash_id in the tasks table. No further use is made of this column at this stage. A migration script is supplied to add the column to an existing database. This has been tested on MySQL only.

All tests are updated to pass by replacing hard-coded task_ids with reference to Task.task_id.
opened by stephenpascoe 39
Task history, from https://github.com/spotify/luigi/issues/51
All the tables are added as per the RFC, though currently only Task, TaskEvent, and TaskParameters is being used right now.

Added a simple HTML front-end using Flask. I'd be open to using tornado to be consistent with server.py, I just used flask because I've used it before.

Some screenshots of the pages:

Recent runs: http://cl.ly/image/1h01133g1Q07

Task details: http://cl.ly/image/1P1i3T0C1y42

All "A" runs: http://cl.ly/image/2X3h0X3k3P2h
opened by JoeEnnever 34

Issue when upgrading luigi version

When upgrading to a version of luigi at or after 6e549ef, I receive the following when python setup.py install within an existing venv.

(venv) [email protected]:~/luigi$ python setup.py install
running install
running bdist_egg
running egg_info
writing requirements to luigi.egg-info/requires.txt
writing luigi.egg-info/PKG-INFO
writing top-level names to luigi.egg-info/top_level.txt
writing dependency_links to luigi.egg-info/dependency_links.txt
writing entry points to luigi.egg-info/entry_points.txt
reading manifest file 'luigi.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
writing manifest file 'luigi.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
error: can't copy 'luigi/static/visualiser/lib/URI.js': doesn't exist or not a regular file

This commit added a file luigi/static/visualiser/lib/URI.js/1.18.2/URI.js . I'm suspicious there's a problem here.

opened by dlstadther 31

Py3k

First step to at python3 support. Majority of tests still fail but Luigi can be imported in python3. The code still run just fine in python2, the only limitation is an extra dependency on six.

opened by gpoulin 31
[fixed] Task status messages
This PR is a fixed version of PR #1621 which has a faulty commit history.

This PR adds status messages to tasks which are also visible on the scheduler GUI.

Examples:

Status messages are meant to change during the run method in an opportunistic way. Especially for long-running non-Hadoop tasks, the ability to read those messages directly in the scheduler GUI is quite helpful (at least for us). Internally, message changes are propagated to the scheduler via a _status_message_callback which is - in contrast to tracking_url_callback - not passed to the run method, but set by the TaskProcess.

Usage example:

class MyTask(luigi.Task): ... def run(self): for i in range(100): # do some hard work here if i % 10 == 0: self.set_status_message("Progress: %s / 100" % i)

I know that you don't like PR's that affect core code (which is reasonable =) ), but imho this feature is both lightweight and really helpful.
opened by riga 30
Enables running of multiple tasks in batches
Enables running of multiple tasks in batches

Sometimes it's more efficient to run a group of tasks all at once rather than one at a time. With luigi, it's difficult to take advantage of this because your batch size will also be the minimum granularity you're able to compute. So if you have a job that runs hourly, you can't combine their computation when many of them get backlogged. When you have a task that runs daily, you can't get hourly runs.

In order to gain efficiency when many jobs are queued up, this change allows workers to provide details of how jobs can be batched to the scheduler. If you have several hourly jobs of the same type in the scheduler, it can combine them into a single job for the worker. We allow parameters to be combined in three ways: we can combine all the arguments in a csv, take the min and max to form a range, or just provide the min or max. The csv gives the most specificity, but range and min/max are available for when that's all you need. In particular, the max function provides an implementation of #570, allowing for jobs that overwrite eachother to be grouped by just running the largest one.

In order to implement this, the scheduler will create a new task based on the information sent by the worker. It's possible (as in the max/min case) that the new task already exists, but if it doesn't it will be cleaned up at the end of the run. While this new task is running, any other tasks will be marked as BATCH_RUNNING. When the head task becomes DONE or FAILED, the BATCH_RUNNING tasks will also be updated accordingly. They'll also have their tracking urls updated to match the batch task.

This is a fairly big change to how the scheduler works, so there are a few issues with it in the initial implementation:

newly created batch tasks don't show up in dependency graphs

the run summary doesn't know what happened to the batched tasks

batching takes quadratic time for simplicity of implementation

I'm not sure what would happen if there was a yield in a batch run function

For the user, batching is accomplished by setting batch_method in the parameters that you wish to batch. You can limit the number of tasks allowed to run simultaneously in a single batch by setting the class variable batch_size. If you want to exempt a specific task from being part of a batch, simply set its is_batchable property to False.
opened by daveFNbuck 30
fix: using default_scheduler_url as a mounting point for not root pat…
…h. fix #3213

Signed-off-by: ROUDET Franck INNOV/IT-S [email protected]

Description

1 files modified to change the build of URL. Keep full base path from default_scheduler_url. 2nd attempt for that: more cleaner approach.

Motivation and Context

Fixes:

https://github.com/spotify/luigi/issues/3213

and https://groups.google.com/g/luigi-user/c/kJ7aI2GNfcE

Have you tested this? If so, how?

I'm able to test:

With or Without scheduler-url set

With or Without scheduler-url not at root
opened by franckOL 1
using default_scheduler_url as a mounting point (not root `http://address/mount`) behind proxy not working
I need to mount luigi behind a nginx, for example luigi at http://address/mount. For that I configure:

[core] default_scheduler_url=http://address/mount ....

GUI is ok and works but, CLI not due to url resolution. it happens there https://github.com/spotify/luigi/blob/c13566418c92de3e4d8d33ead4e7c936511afae1/luigi/rpc.py#L54

To understand what happened:

parsed=urlparse('http://address/mount') url='/api/add_task' urljoin(parsed.geturl(), url) # ==> give 'http://address/api/add_task' # expected http://address/mount/api/add_task

What I must do for working - slash at the end of mount point, no slash for url -:

parsed=urlparse('http://address/mount/') url='api/add_task' urljoin(parsed.geturl(), url) # ==> http://address/mount/api/add_task
opened by franckOL 0
Add support/option to pick `TaskProcess` worker process context

Currently TaskProcess uses the default multiprocessing context (inherits from multiprocessing.Process), which ends up being spawn on mac/windows and fork on linux. There's no way to change the context on Linux to spawn or forkserver if the default fork is causing issues in the worker process (see https://github.com/python/cpython/issues/84559).

opened by ravwojdyla 4

Fails to parse list parameter from cli argument: JSONDecodeError

foo.py

import luigi

class Foo(luigi.Task):
    bar = luigi.ListParameter(default="['foo']")
    def run(self):
        print(f"Running task {self}")
    def output(self):
        return luigi.LocalTarget("./foo.txt")

Running luigi --module foo Foo --bar 'bar' results in json.decoder.JSONDEcodeError: Expecting value: line1 column 1 (char 0)

Stacktrace:

ERROR: Uncaught exception in luigi
Traceback (most recent call last):
  File "/home/lib/python3.8/site-packages/luigi/retcodes.py", line 75, in run_with_retcodes
    worker = luigi.interface._run(argv).worker
  File "/home/lib/python3.8/site-packages/luigi/interface.py", line 213, in _run
    return _schedule_and_run([cp.get_task_obj()], worker_scheduler_factory)
  File "/home/lib/python3.8/site-packages/luigi/cmdline_parser.py", line 114, in get_task_obj
    return self._get_task_cls()(**self._get_task_kwargs())
  File "/home/lib/python3.8/site-packages/luigi/cmdline_parser.py", line 131, in _get_task_kwargs
    res.update(((param_name, param_obj.parse(attr)),))
  File "/home/lib/python3.8/site-packages/luigi/parameter.py", line 1140, in parse
    i = json.loads(x, object_pairs_hook=FrozenOrderedDict)
  File "/usr/lib/python3.8/json/__init__.py", line 370, in loads
    return cls(**kw).decode(s)
  File "/usr/lib/python3.8/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.8/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Expected Behavior: The command should complete successfully

OS: Ubuntu python: 3.8.10 luigi : 3.1.1

opened by ymentha14 1

Failed to return in running luigi.build in luigi tasks

I have a workflow to trigger another luigi.build in the task run method (like below). Both parent and child level of luigi build use multiple workers.

import luigi

class Task2(luigi.Task):
    def run(self):
        print("*"*10, "done task2")

class Task1(luigi.Task):
    def run(self):
        luigi.build([Task2()], local_scheduler=True, workers=2)
        print("*"*10, "done task1")

luigi.build([Task1()], local_scheduler=True, workers=2)

The scheduler in the child process (Task.run) stuck and failed to return. I killed the process and it seems to me the worker queue failed to pull the task.

2022-11-12 01:06:00,411 INFO Informed scheduler that task   Task2__99914b932b   has status   PENDING
2022-11-12 01:06:00,411 INFO Done scheduling tasks
2022-11-12 01:06:00,411 INFO Running Worker with 2 processes
^C2022-11-12 01:06:40,606 INFO Worker Worker(salt=4670658361, workers=2, host=slurm-node6.development-singapore.chorus, username=gchan, pid=27743) was stopped. Shutting down Keep-Alive thread
Traceback (most recent call last):
  File "test_luigi.py", line 37, in <module>
    luigi.build([Task1()], local_scheduler=True, workers=2)
  File "/home/gchan/intranet/environments/20220905_211953_12c7ddd0-0e8b-4bcd-b13b-b7e44cd1f344/lib/python3.8/site-packages/luigi/interface.py", line 239, in build
    luigi_run_result = _schedule_and_run(tasks, worker_scheduler_factory, override_defaults=env_params)
  File "/home/gchan/intranet/environments/20220905_211953_12c7ddd0-0e8b-4bcd-b13b-b7e44cd1f344/lib/python3.8/site-packages/luigi/interface.py", line 173, in _schedule_and_run
    success &= worker.run()
  File "/home/gchan/intranet/environments/20220905_211953_12c7ddd0-0e8b-4bcd-b13b-b7e44cd1f344/lib/python3.8/site-packages/luigi/worker.py", line 1242, in run
    self._handle_next_task()
  File "/home/gchan/intranet/environments/20220905_211953_12c7ddd0-0e8b-4bcd-b13b-b7e44cd1f344/lib/python3.8/site-packages/luigi/worker.py", line 1101, in _handle_next_task
    self._task_result_queue.get(
  File "/opt/chorus-python38/lib/python3.8/multiprocessing/queues.py", line 107, in get
    if not self._poll(timeout):
  File "/opt/chorus-python38/lib/python3.8/multiprocessing/connection.py", line 257, in poll
    return self._poll(timeout)
  File "/opt/chorus-python38/lib/python3.8/multiprocessing/connection.py", line 424, in _poll
    r = wait([self], timeout)
  File "/opt/chorus-python38/lib/python3.8/multiprocessing/connection.py", line 931, in wait
    ready = selector.select(timeout)
  File "/opt/chorus-python38/lib/python3.8/selectors.py", line 415, in select
    fd_event_list = self._selector.poll(timeout)
KeyboardInterrupt

Currently I am using the latest luigi version 3.1.1 and Python 3.8.12.

opened by gavincyi 4

Releases(3.1.1)

3.1.1(Aug 18, 2022)
3.1.1

Added

luigi

Add worker config to cache task completion results. (#3178)

Fixed

luigi

Close requests.Socket in RemoteScheduler before exiting (#3173) (#3175)

Use int-parameters for random.randrange() (#3177)

Alternate (probably more correct) fix for #3182 (supercedes #3183) #3185

Source code(tar.gz)
Source code(zip)
3.1.0(Jun 20, 2022)
3.1.0

Added

luigi

Documentation guidance around release version increments #3074

Add support for naming tasks in @requires #3077

Add traceback_max_length parameter for error email notifications #3086

Document cause of Unfulfilled dependency error #3105

Add additional OptionalParameter datatype options #3079

UI: Add rerun command snippet on "show error" modal #3117

Add support for updating default config parser after loading luigi #3135

Allow batch_email.email_interval to be set in config as email-internal-minutes #3125

Allow TimeDeltaParameter to accept input as "seconds" #3125

Add EnumListParameter to top-level attribute imports #3144

Enable metrics_custom_import for MetricCollectors #3146

Improve warning when a parameter is not consumed by a task #3170

luigi.contrib

Add configure_job BigQuery property #3098

Add parquet support to BigQuery #3099

Add network retry logic to BigQuery #3088

Add run_task_kwargs property to ECS #3083

Add pickle_protocol attribute and configuration option to Spark #3001

Add pg8000 driver support to Postgres #3142

Fixed

luigi

Fix default value for task.disable_window #3081

Fix deconstructor of LocalTarget when is_tmp attribute dne #3085

Improved documentation reference to luigi.format.Nop import #3047

Fix Python 3.10 deprecation warnings #3150

Remove unnecessary extra call to cls.get_task_namespace() #3129

Fix documentation typo in notifications_test.py #3151

Fix docs ci #3158

Fix task history rendering #3153

Move max_shown_tasks and max_graph_nodes documentation to correct section #3156

Fix ability to subclass Task's Register metaclass #3154

Replace legacy TravicCI readme badge with GithubActions #3159

luigi.contrib

Fix apache ci

#3091

#3113

Fix documentation typo in sqla #3110

Fix spark cluster mode error for missing path #3111

Fix connection object passed to rdbms.CopyToTable's _add_metadata_columns() #3011

Removed

luigi

Remove @Tarrasch from codeowners #3127

Changed

luigi

Update license copyright year #3108

Improve error message on parsing default parameter value #3115

Group tasks by task class in svg graph #3122

Upgrade tenacity version #3147

Enable task search to be case insensitive #3157

luigi.contrib

Update ExternalPythonProgramTask parameters to OptionalParameter #3130

Allow newer versions of prometheus-client package #3163

Source code(tar.gz)
Source code(zip)
3.0.3(Apr 15, 2021)
Added:

luigi

Adding retry on _obj_exists #3022

don't use absolute path in redirect for visualizer #2785

luigi.contrib

kubernetes: labels are applied to pod #3007

Fixed:

luigi

Flush stream_for_searching_tracking_url #3000

luigi.contrib

Fix azureblob tests: use json instead of numpy #3032

Changed:

luigi

replace naive retry with tenacity #3026

Source code(tar.gz)
Source code(zip)
3.0.2(Sep 23, 2020)
3.0.2

Fixed:

luigi

Garbage collect task result queue when worker context exits #2973

Fixing problem with ListParameter and Dynamic Dependencies #2970

Changed:

luigi

Drop Python 3.3 and 3.4 support #2978

luigi.contrib

Use updated uri for gcs batch reqs #2998

Source code(tar.gz)
Source code(zip)
3.0.1(Jul 23, 2020)
Added:

luigi

Worker_timeout can be 0. #2968

Return bq job id from biquery.run_job() #2957

Documentation for check_complete_on_run config #2961

Source code(tar.gz)
Source code(zip)
3.0.0(Jun 2, 2020)
3.0.0

This is a major release without many feature changes compared to 2.8.13. The reason we decided to give it a major bump is the drop of Python2 support. From this version on, Luigi stops supporting Python2 for the obvious reason. 3.0.0 release includes a series of PRs deprecating Python2, plus a few other changes listed below. Special thanks go to @drowoseque for doing all the great work!

Added:

luigi

Show task progress in visualizer workers tab. #2932

Fixed:

luigi

Fix TravisCI build #2948

Use is_alive in favour of isAlive for Python 3.9 compatibility. #2940

Source code(tar.gz)
Source code(zip)
2.8.13(Apr 29, 2020)
Added:

luigi.contrib

Presto support in Luigi (#2885)

Fixed:

luigi

removed wrong type of Target.init path arg from doc master (#2927)

remove StreamingBodyAdaptor that didn't allow choosing the chunk size (#2929)

Fix docs explaining write modes for Luigi Targets. Closes #2783 (#2931)

All configuration parameters in docs now use underscore in their names for consistency. (#2890)

Changed:

luigi

Allowed wider popovers in grapth. (#2093)

update documentation to prefer pykube-ng (#2924)

Source code(tar.gz)
Source code(zip)
2.8.12(Feb 19, 2020)
Added:

luigi

EnumListParameter #2801

Fixed:

luigi

Import ABC from collections.abc instead of collections for Python 3.9 compatibility #2895

Changed:

luigi.contrib

[luigi.contrib.hive] HiveTableTarget inherits HivePartitionTarget #2872

[luigi.contrib.pyspark_runner] SparkSession support in PySparkTask #2862

Source code(tar.gz)
Source code(zip)
3.0.0b2(Feb 19, 2020)
This the second 3.0.0 beta release including a series of PRs deprecating Python2, plus following:

Special thanks go to @drowoseque for doing all the great work!

Added:

luigi

Add internal version info #2760

EnumListParameter #2801 (new since 3.0.0b1)

luigi.contrib

[luigi.contrib.spark] pyspark python options added #2818

Fixed:

luigi

Fix params hashing #2540

Check for autoload_range istead of autoload-range

autoload_range doc fix #2878 (new since 3.0.0b1)

Removed:

luigi

[luigi.file] removed #2832

[luigi.mock.MockFile] removed #2839

Changed:

luigi

Allow python-daemon >= 2.2.0 if not on windows #2796

Make URLLibFetcher aware of basic auth info in scheduler URL. #2791

luigi.contrib

[luigi.contrib.external_program.ExternalProgramTask] logs_output_pattern_to_url provided #2822

[luigi.contrib.hive] HiveTableTarget inherits HivePartitionTarget #2872 (new since 3.0.0b1)

[luigi.contrib.pyspark_runner] SparkSession support in PySparkTask #2862 (new since 3.0.0b1)

Source code(tar.gz)
Source code(zip)
2.8.11(Jan 2, 2020)
Added:

luigi

Add internal version info #2760

luigi.contrib

[luigi.contrib.spark] pyspark python options added #2818

Fixed:

luigi

Fix params hashing #2540

Check for autoload_range istead of autoload-range

autoload_range doc fix #2878

Removed:

luigi

[luigi.file] removed #2832

[luigi.mock.MockFile] removed #2839

Changed:

luigi

Allow python-daemon >= 2.2.0 if not on windows #2796

Make URLLibFetcher aware of basic auth info in scheduler URL. #2791

luigi.contrib

[luigi.contrib.external_program.ExternalProgramTask] logs_output_pattern_to_url provided #2822

Source code(tar.gz)
Source code(zip)
2.8.10(Nov 22, 2019)
Added:

luigi

Add HEAD endpoint to scheduler server for status/health checks #2789

luigi.contrib

[luigi.contrib.hive] WarehouseHiveClient #2826

Fixed:

luigi.contrib

Add Python version-agnostic get_writer_schema. #2827

PySparkTask: handle special characters in name (#2778) #2779

Changed:

luigi.contrib

[luigi.contrib.spark] tracking_url_pattern as a property #2820

Add pod_creation_wait_interal #2813

Added optional argument 'aws_session_token' to S3Client #2798

Source code(tar.gz)
Source code(zip)
2.8.9(Aug 27, 2019)
Added:

luigi

Adds "Force Commit" button in UI to set tasks to DONE #2751

Show task history link in visualizer when recording. #2759

Fixed:

luigi

Replace documentation reference to outdated test environment py27-nonhdfs #2762

Issue 2644: Tasks can be run several times under certain conditions #2645

luigi & luigi.contrib

Ensure ignored tests are picked up by tox #2758

Changed:

luigi

Update tornado requirement for new enough python versions #2761

luigi.contrib

contrib/ftp: Clean up temporary files #2755

Source code(tar.gz)
Source code(zip)
2.8.8(Aug 12, 2019)
Added:

luigi

Expandable Namespace Folders for the Visualiser Sidebar #2716

Added new companies to the luigi users list: #2730 #2747

luigi.contrib

Enable Overriding Poll Interval for Kubernetes Jobs #2724

Fixed:

luigi

Code example correction #2754

Changed:

luigi

Update release process #2727

Change GET request to POST requests in luigi/rpc #2732

Fix SendGrid email API documentation. #2745

Source code(tar.gz)
Source code(zip)
2.8.7(Jun 14, 2019)
Added:

luigi

Add check_complete_on_run to optionally mark tasks as failed if complete() is false when run finishes (#2710)

Add section "Running Luigi on Windows" to docs (#2720)

Add Giphy to list of companies using Luigi to docs (#2713)

luigi.contrib

Add DropboxTarget for luigi (#2696)

Fixed:

luigi

UI: Fix time graph - y axis to account for timezones. (#2711)

Changed:

luigi

Bump dependencies used by SendGrid integration. (#2715)

Update copyright year in LICENSE (#2723)

luigi.contrib

Make RedisTarget compatible with redis-py >= 3 (#2722)

Source code(tar.gz)
Source code(zip)
2.8.6(May 22, 2019)
Fixed:

luigi

fix calling method on wrong object #2709 This fixes #2704.

Source code(tar.gz)
Source code(zip)
2.8.5(May 9, 2019)
Fixed:

luigi

(Doc) Fix example of "summary_length" #2700

Fix __init__ error when using TOML config #2702

add callback to metric collector #2704

luigi.contrib

Fix BigQueryTarget parsing in beam_dataflow module #2705

Changed:

luigi.contrib

aws batch : job queue as parameter #2689

Source code(tar.gz)
Source code(zip)
2.8.4(May 6, 2019)
This release is broken due to #2628 .

Added:

luigi

Added support for a detailed LuigiRunResult instead of a plain Boolean (#2630)

Add worker option 'max_keep_alive_idle_duration (#2654)

Added worker-id commandline parameter (#2655)

luigi.contrib

Add support for specifying kubernetes namespace (#2629)

Add a Task wrapper for MicroSoft OpenPAI (#2531)

Provide automatic URL tracking for Spark applications (#2661) (#2669)

add beam_dataflow_task to luigi/contrib (#2675)

Both

Add Prometheus contrib for monitoring purpose (#2628)

Fixed:

luigi

setup.py: Support older setuptools (<=20.1.1) (#2623)

Fix broken aws tests (#2658)

Accept pathlib based path as argument for LocalTarget (#2548)

Fix durations in D3 graph (fixes #2620) (#2624)

Import collections ABCs from collections.abc, not collections (#2683)

Configuration documentation: remove deprecated/wrong [core]max_reschedules entry (#2692)

luigi.contrib

fixes #2223 HdfsTarget is not working with snakebite (#2572)

Add port field for PostgresQuery (fixes #2625) (#2627)

Both

Fix flake errors after moving to python 3 (#2695)

Changed:

luigi

Simplify implementation of temporary_path() (#2652)

Prevent range-tasks from autoloading (#2656)

Replace Python MapReduce example with Spark example. (#2668)

Minor improvements to single-worker-timeout support (#2667)

Require at least python-dateutil version 2.7.5 instead of only 2.7.5 (fixes #2662) (#2679)

RangeMonthly should deal with whole months (#2666)

Reconcile underscore/dash config style handling (#2688) #2691

luigi.contrib

Add autodetect parameter to BigQueryLoadTask (#2363) (#2575)

Source code(tar.gz)
Source code(zip)
2.8.3(Jan 16, 2019)
Added:

luigi

Add BaseTIS to the company list #2607

Add Hopper to the company list #2614

give a few default values to opts when setting up logging #2612

Add range functionality for monthly cadence. #2601

luigi.contrib

Added port to PostgresTarget #2615

Support for Azure Blob Storage Target #2585

Add Datadog contrib for monitoring purpose #2434

Fixed:

luigi

Docs: Fixed a mistake with @inherits syntax in luigi/util.py #2613

Check type of column before migrating schemas for task db history for postgres dialect (fixes #2563) #2564

luigi.contrib

S3: Fix call to message from TypeError not working with Python 3.6 #2617

Use proper API call in bigtable.py's make_dataset #2618

Sqla: Fix the table name when reflect is True in sqla.CopyToTable (fixes #2604) #2605

Changed:

luigi.contrib

Changed to buffered reads when using GCSTarget #2588

Source code(tar.gz)
Source code(zip)
2.8.2(Dec 12, 2018)
fix logging setup in Python 2.7 (#2593)

Source code(tar.gz)
Source code(zip)
2.8.1(Dec 11, 2018)
Note: Broken due to a runtime error in LuigiConfigParser. See https://github.com/spotify/luigi/issues/2592.

Added:

luigi

Add some docs to interface.run #2582

Configure logging via TOML config #2483

luigi.contrib

Added port property to CopyToTable #2561

Make it so we can do from luigi.contrib.hdfs import HdfsFlagTarget #2594

contrib: Add ExternalDailySnapshot #2591

Fixed:

luigi

(docstring) Update task.py #2589

Docs: Fixed "Github" to fit to the rest of the doc #2596

Fix inspect.getargspec() DeprecationWarning in PY3 #2579

luigi.contrib

Fix Travis Moto Test Failures #2586

Reduce TravisCI Test Runtime #2541

Changed:

luigi

Make Worker parameter task_process_context an OptionalParameter #2574

luigi.contrib

S3Client improvements #2569

Source code(tar.gz)
Source code(zip)
2.8.0(Nov 2, 2018)
This is a minor version bump, due to:

Dropping Python 3.4 and 3.5 from CI, which means no automated tests to ensure compatibility for those versions

[Security Patch] CORS being disabled by default. A new section of configuration [cors] is introduced to enable custom settings. For details, refer to user group topic: https://groups.google.com/forum/#!topic/luigi-user/ZgfRTpBsVUY

Added

luigi:

Add Python 3.7 compatibility (#2466) This also drops 3.4 and 3.5 from CI.

Interpolate environment variables in .cfg config files (#2527)

luigi.contrib:

Add CopyToTable task for MySQL (#2553)

Add HdfsFlagTarget (#2559)

luigi.contrib:

Fixed

luigi:

Fix ReadTheDocs build (#2546)

Make capture_output non-positional in ExternalProgramTask (#2547)

luigi.contrib:

Fix S3Client's _path_to_bucket_and_key to support keys with question marks (#2534)

Fix S3Client.remove - add max batch size (#2529)

Small fix to logging in contrib/ecs.py (#2556)

FIX HdfsAtomicWriteDirPipe.close() when using snakebite and the file do not exist. (#2549)

Changed:

luigi:

[ImgBot] optimizes images (#2555)

luigi.contrib:

Remove s3 bucket validation prior to file upload (#2528)

Refactor s3 copy into sub-methods (#2508)

Source code(tar.gz)
Source code(zip)
2.7.9(Sep 28, 2018)
Added

luigi.contrib:

Added optional choice for hdfs clients (#2487)

s3client check for deprecated host keyword and raise error with the details (#2493)

Add a "capture_output" parameter to ExternalProgramTask (#2430)

Fixed

luigi:

Fix exception when toml lib is not installed (#2506)

Replace direct attribute accessing by using built-n function getattr (#2509)

set upper bound of python-daemon (#2536)

luigi.contrib:

Fix S3Client.copy return value consistency (#2488)

Update MockTarget mode to accept r* or w* (#2519)

Source code(tar.gz)
Source code(zip)
2.7.8(Aug 24, 2018)
revert tornado upgrade (#2504) Upgrading tornado unfortunately breaks older version of Python

Source code(tar.gz)
Source code(zip)
2.7.7(Aug 24, 2018)
Added

luigi:

Add Data Revenue to the blogged list (#2472)

Add default reviewers in CODEOWNERS (#2465)

Optional TOML configs support (#2457)

Add support for multiple requires and inherits arguments (#2475)

Add a visiblity level for luigi.Parameters (#2278)

Make logging of RPC retries configurable #2486

Added a new event 'progress' (#2498)

luigi.contrib:

Additions to provide support for the Load Sharing Facility (LSF) job scheduler (#2373)

Added default port behaviour for Redshift (#2474)

Add metadata columns to the RDBMS contrib (#2440)

Use passed password when create a redis connection (#2489)

Changed

luigi:

Update supplementary github files to improve repo organization and maintenance (#2463)

Use task_id in Task.eq comparison (#2462)

Replace luigi.Task by RunOnceTask in scheduler_visualisation_test (#2476)

(Breaking change) Bump tornado milestone version (#2490) This changes requires Python version 2.7.9+ and 3.4+

luigi.contrib:

S3 client refactor (#2482)

Update moto to 1.x milestone version (#2471)

Fixed

luigi:

Fix Scheduler.add_task to overwrite accepts_messages attribute. (#2469)

Fix race condition (#2477)

Fix attribute forwarding for tasks with dynamic dependencies (#2478)

luigi.contrib:

Fix transfer config import (#2458)

Removed

luigi:

Remove long-deprecated scheduler config variable alternatives (#2491)

Source code(tar.gz)
Source code(zip)
2.7.6(Jul 11, 2018)
Added

luigi:

Add a configuration parameter to force multiprocessing (#2401)

Add a configuration parameter to enable/disable the pause button (#2399)

Send messages from scheduler to tasks (via "Send message" UI button) (#2426)

Allow to inject a context manager around TaskProcess.run (via task_process_context configuration parameter) (#2449)

luigi.contrib:

S3: use Boto3 for the S3Client (#2423, #2149)

GCS: add method to push files using multiprocessing (#2376)

HDFS: add get_merge to snakebite client (#2410)

Redshift: add schema to DB if it doesn't exist (#2439)

Redshift: add table constraints support (#2435)

Fixed

luigi:

Allow long parameters in task history DB SQL result store (#2404)

Fix MissingParameterException when generating execution summary (#2415)

Fix luigid crash due to configuration file parsing (#2394)

Allow explicit parsing of BoolParameters (via luigi.BoolParameter.parsing variable) (#2427)

Make ChoiceParameter check if option is valid within .normalize (#2454)

...and a good deal of documentation fixes and similar.

luigi.contrib:

BigQuery: fix bulk_complete failing when argument is a generator (#2441)

Kubernetes: prevent KeyError in KubernetesJobTask (#2433)

Kubernetes: don't set activeDeadlineSeconds by default (#2452)

Source code(tar.gz)
Source code(zip)
2.7.5(Apr 12, 2018)
Fixed

luigi

append is vulnerable of xss #2391

Source code(tar.gz)
Source code(zip)
2.7.4(Apr 11, 2018)
Added

luigi:

Add google-auth-httplib2 as dependency (#2384)

Add new company user (#2388)

Allow release of resources during task running. (#2346)

luigi.contrib:

Add parameterized backoff limit (#2375)

Fixed

luigi:

Typo in running luigi documentation (#2387)

Skip running coverage on unreasonable files

Pass kwargs to discovery.build() when instantiating GSCClient. (#2291)

Removed message 'No Instance(s) Available.' in Windows when starting … #2294

Source code(tar.gz)
Source code(zip)
2.7.3(Mar 13, 2018)
Added

luigi:

Added generated data files to .gitignore (#2367)

luigi.contrib:

Add possibility to specify Redshift column compression (#2343)

Changed

luigi:

Show status message button in worker tab when only progress is set (#2344)

Verbose worker error logging (#2353)

luigi.contrib:

Replace oauth2client by google-auth (#2361)

Fixed

luigi:

Fix unicode formatting (#2339)

luigi.contrib:

Fix contrib.docker_runner exit code check (#2356)

Source code(tar.gz)
Source code(zip)
2.7.2(Jan 24, 2018)
Added

luigi:

A button to show errors for disabled tasks in the visualizer (#2266)

OptionalParameter parameter class (#2295)

luigi.contrib:

Support for the Docker Python SDK (#2158)

Changed

luigi:

Change the logging status of prune messages from info to debug (#2254) - reduce repetitiveness of logging

Speed up the visualizer so that a refresh doesn’t take minutes when number of tasks gets into the millions (#2239)

Hide "re-enable" and "forgive failures" buttons on success (#2281)

Convert dates to datetimes for DateHourParameter (#2285)

Handle multiprocessing with request sessions (#2290) - Fixes bug with "RPCError: Received null response from remote scheduler"

Replaced param_args with dynamic property with deprecation message

Check task equality using param_kwargs instead of param_args

Removed

luigi.contrib:

Copying of Avro field documentation by BigQueryLoadAvro (#2269) - said copying is done by BigQuery natively now

Fixed

luigi:

Remove invalid entrypoint luigi.tools.migrate (#2257)

Preserve filter on server on reload (#2273)

Fix MRO on tasks using util.requires (#2307)

luigi.contrib:

Fix Python 3 TypeError in contrib.hive.HiveTableTarget.exists (#2323)

Source code(tar.gz)
Source code(zip)
2.7.1(Oct 5, 2017)
Luigi 2.7.1 is mostly bug fixes and small feature additions.

Standardize Redshift credential usage across Redshift tasks: https://github.com/spotify/luigi/pull/2068

BigQueryLoadAvro now handles complex Avro types: https://github.com/spotify/luigi/pull/2224

Support for user-specified number of paralleled scheduled processes: https://github.com/spotify/luigi/pull/2205

ECS support for non-default cluster: https://github.com/spotify/luigi/pull/2045

BigQuery ExtractTask support: https://github.com/spotify/luigi/pull/2134

Support for autocommitting queries within Redshift and Postgres, allowing VACUUM statement execution: https://github.com/spotify/luigi/pull/2242

High Availability (HA) support with WebHdfsClient using multiple namenodes: https://github.com/spotify/luigi/pull/2230

Column mapping support for Redshift S3CopyToTable using the column list: https://github.com/spotify/luigi/pull/2245

There have also been some other feature additions, bugfixes, and docfixes. See all commits here.
Source code(tar.gz)
Source code(zip)

Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.

Related tags

Overview

Getting Started

Background

Visualiser page

Dependency graph example

Philosophy

Who uses Luigi?

External links

Authors

Comments

Description

Motivation and Context

Have you tested this? If so, how?

Description

Motivation and Context

Have you tested this? If so, how?

Description

Motivation and Context

Have you tested this? If so, how?

Note

Description

Motivation and Context

Have you tested this? If so, how?

Releases(3.1.1)

3.1.1(Aug 18, 2022)

3.1.1

Added

luigi

Fixed

luigi

3.1.0(Jun 20, 2022)

3.1.0

Added

luigi

luigi.contrib

Fixed

luigi

luigi.contrib

Removed

luigi

Changed

luigi

luigi.contrib

3.0.3(Apr 15, 2021)

Added:

luigi

luigi.contrib

Fixed:

luigi

luigi.contrib

Changed:

luigi

3.0.2(Sep 23, 2020)

3.0.2

Fixed:

luigi

Changed:

luigi

luigi.contrib

3.0.1(Jul 23, 2020)

Added:

luigi

3.0.0(Jun 2, 2020)

3.0.0

Added:

luigi

Fixed:

luigi

2.8.13(Apr 29, 2020)

Added:

luigi.contrib

Fixed:

luigi

Changed:

luigi

2.8.12(Feb 19, 2020)

Added:

luigi