Dask (software) explained

Dask
Author:Matthew Rocklin
Developer:Dask
Released:
Latest Release Version:2024.2.1
Operating System:Linux, Microsoft Windows, macOS
Programming Language:Python
Genre:Data analytics
Language:Python

Dask is an open-source Python library for parallel computing. Dask[1] scales Python code from multi-core local machines to large distributed clusters in the cloud. Dask provides a familiar user interface by mirroring the APIs of other libraries in the PyData ecosystem including: Pandas, scikit-learn and NumPy. It also exposes low-level APIs that help programmers run custom algorithms in parallel.

Dask was created by Matthew Rocklin[2] in December 2014[3] and has over 9.8k stars and 500 contributors on GitHub.[4]

Dask is used by retail, financial, governmental organizations, as well as life science and geophysical institutes. Walmart,[5] Wayfair,[6] JDA,[7] GrubHub,[8] General Motors,[9] Nvidia,[10] Harvard Medical School, Capital One[11] and NASA[12] are among the organizations that use Dask.

Overview

Dask has two parts:[13]

Dask's high-level parallel collections – DataFrames,[14] Bags,[15] and Arrays[16] – operate in parallel on datasets that may not fit into memory.

Dask’s task scheduler executes task graphs in parallel. It can scale to thousand-node clusters. This powers the high-level collections as well as custom, user-defined workloads using low-level collections.

Dask collections

Dask supports several user interfaces[17] called high-level and low-level collections:

High-level

Low-level

Under the hood, each of these user interfaces adopts the same parallel computing machinery.

High-level collections

Dask's high-level collections are the natural entry point for users who are interested in scaling up their pandas, NumPy or scikit-learn workload. Dask’s DataFrame, Array and Dask-ML are alternatives to Pandas DataFrame, Numpy Array and scikit-learn respectively with slight variations to the original interfaces.

Dask Array

Dask Array is a high-level collection that parallelizes array-based workloads and maintains the familiar NumPy API, such as slicing, arithmetic, reductions, mathematics, etc., making it easy for Numpy users to scale up array operations.

A Dask array comprises many smaller n-dimensional Numpy arrays and uses a blocked algorithm to enable computation on larger-than-memory arrays. During an operation, Dask translates the array operation into a task graph, breaks up large Numpy arrays into multiple smaller chunks, and executes the work on each chunk in parallel. Results from each chunk are combined to produce the final output.

Dask DataFrame

Dask DataFrame is a high-level collection that parallelizes DataFrame based workloads. A Dask DataFrame comprises many smaller Pandas DataFrames partitioned along the index. It maintains the familiar Pandas API, making it easy for Pandas users to scale up DataFrame workloads. During a DataFrame operation, Dask creates a task graph and triggers operations on the constituent DataFrames in a manner that reduces memory footprint and increases parallelism through sharing and deleting of intermediate results.

Dask Bag

Dask Bag is an unordered collection of repeated objects, a hybrid between a set and a list. Dask Bag is used to parallelize computation of semi-structured or unstructured data, such as JSON records, text data, log files or user-defined Python objects using operations such as filter, fold, map and groupby. Dask Bags can be created from an existing Python iterable or can load data directly from text files and binary files in the Avro format.

Low-level collections

The Dask low-level interface allows for more customization. It is suitable for data that does not fall within the scope of a Dask DataFrame, Bag or Array. Dask has the following low-level collections:

Delayed

Dask delayed is an interface used to parallelize generic Python code that does not fit into high level collections like Dask Array or Dask DataFrame. Python functions decorated with Dask delayed adopt a lazy evaluation strategy by deferring execution and generating a task graph with the function and its arguments. The Python function will only execute when .compute is invoked. Dask delayed can be used as a function dask.delayed or as a decorator @dask.delayed.

Futures

Dask Futures, an immediate (non-lazy) alternative to Dask delayed, provides a real-time task framework that extends Python’s concurrent.futures interface, which provides a high-level interface for asynchronous execution of callables.

It is common to combine high and low-level interfaces. For example, users can run Dask array/bag/dataframe to load and pre-process data, then switch to Dask delayed for a custom algorithm that is specific to their domain, then switch back to Dask array/dataframe to clean up and store results.

Scheduling

Dask’s high and low-level collections create a directed acyclic graph of tasks,[22] which represents the relationship between computation tasks. A node in a task graph represents a Python function that performs a unit of computation and an edge represents the data dependency between the upstream and downstream task. After a task graph is generated, the task scheduler manages the workflow based on the given task graph by assigning tasks to workers in a manner that improves parallelism and respects the data dependencies.

Dask provides two families of schedulers: single-machine scheduler and distributed scheduler.

Single-machine scheduler

A single-machine scheduler is the default scheduler which provides basic features on local processes or thread pool and is meant to be used on a single machine. It is simple and cheap to use but does not scale.

Local threads:A threaded scheduler leverages Python’s concurrent.futures.ThreadPoolExecuter to execute computations. It has a low memory footprint and does not require any setup. As all the computations occur in the same process, threaded schedulers incur minimal task overhead and no cost from transfer of data between tasks. Due to Python’s Global Interpreter Lock, local threads provide parallelism only when the computation is primarily non-Python code, which is the case for Pandas DataFrame, Numpy arrays or other Python/C/C++ based projects.
  • Local process:A multiprocessing scheduler leverages Python’s concurrent.futures.ProcessPoolExecutor to execute computations. Tasks and its dependencies are transferred from the main process to a local process, executed, and the results are transferred back to the main process. This allows bypassing of issues with Python’s Global Interpretable Lock and provides parallelism for computation tasks with primarily Python code. However, transferring data between the main and local processes degrades performance, especially in cases when the size of data transferred is large.
  • Single thread:A single threaded scheduler executes computation with no parallelism. It is used for debugging purposes.
  • Distributed scheduler

    Dask’s distributed scheduler[23] can be set up on a local machine or scale out on a cluster. Dask can work with resource managers, such as Hadoop YARN, Kubernetes, or PBS, Slurm, SGD and LSF for High Performance Computing (HPC) clusters.

    Dask-ML

    Dask-ML is compatible with scikit-learn’s estimator API of fit, transform and predict and is well integrated with machine learning and deep learning frameworks such XGBoost, LightGBM, PyTorch, Keras, and TensorFlow through scikit-learn compatible wrappers.

    Integrations

    scikit-learn integration

    Selected scikit-learn estimators and utilities can be parallelized[24] through executing jobs across multiple CPU cores using the Joblib library. The number of processes are determined by the n_jobs parameters. By default, the Joblib library uses loky as its multi-processing back-end. Dask offers an alternative Joblib backend which is useful for scaling of Joblib-backed scikit-learn algorithms out to a cluster of machines for compute constrained workloads.

    For memory constrained workloads, Dask offers alternatives, such as Parallel Meta-estimators[25] for parallelizing and scaling out tasks that are not parallelized within scikit-learn and Incremental Hyperparameter Optimization[26] for scaling hyper-parameter search and parallelized estimators.[27]

    XGBoost & LightGBM integrations

    XGBoost[28] and LightGBM[29] are popular algorithms that are based on Gradient Boosting and both are integrated with Dask for distributed learning. Dask does not power XGBoost or LightGBM, rather it facilitates setting up of the cluster, scheduler, and workers required then hands off the data to the machine learning framework to perform distributed training.

    Training an XGBoost model with Dask,[30] a Dask cluster is composed of a central scheduler and multiple distributed workers, is accomplished by spinning up an XGBoost scheduler in the same process running the Dask central scheduler and XGBoost worker in the same process running the Dask workers. Dask workers then hand over the Pandas DataFrame to the local XGBoost worker for distributed training.

    PyTorch integration

    Skorch[31] is a scikit-learn compatible wrapper for PyTorch, which enables Dask-ML to be used together with PyTorch.

    Keras and TensorFlow integrations

    SciKeras[32] is an scikit-learn compatible wrapper for Keras models which enables Dask-ML to be used with Keras.

    Applications

    Retail

    Examples of retail use include:

    Life sciences

    Dask is used for high resolution, 4-dimensional, cellular imagery by Harvard Medical School, Howard Hughes Medical Institute, Chan Zuckerberg Initiative, and the UC Berkeley Advanced Bioimaging Center. They record the evolution and movements of a 3-dimensional cell over time, in maximum detail. This generates large amounts of data that is difficult to analyze with traditional methods. Dask helps them scale their data analysis workflows with its API that resembles NumPy, Pandas, and scikit-learn code.

    Dask is also used at the Novartis Institute for Biomedical Research to scale machine learning prototypes.

    Finance industry

    Geophysical sciences

    Dask is used in Climate Science, Energy, Hydrology, Meteorology, and Satellite Imaging by companies such as NASA, LANL, PANGEO:[37] Earth Science and the UK Meteorology Office.[38]

    Oceanographers produce massive simulated datasets of the Earth’s oceans and researchers can look at large seismology datasets from sensors around the world, collect a large number of observations from satellites and weather stations, and run big simulations.

    Software libraries

    Dask is integrated into many libraries, such as Pangeo[39] and xarray; time series software like Prophet[40] and tsfresh;[41] ETL/ML software like scikit-learn,[42] RAPIDS,[43] and XGBoost; workflow management tools like Apache Airflow[44] and Prefect.[45]

    History

    2014–2015

    Dask was originally developed at Continuum Analytics, a for-profit Python consulting company that eventually became Anaconda, Inc.,[46] the creator of many open-source packages and the Anaconda Python distribution. Dask grew out of the Blaze[47] project, a DARPA[48] funded project to accelerate computation in open source.

    Blaze was an ambitious project that tried to redefine computation, storage, compression, and data science APIs for Python, led originally by Travis Oliphant and Peter Wang, the co-founders of Anaconda. However, Blaze’s approach of being an ecosystem-in-a-package meant that it was harder for new users to easily adopt.

    Instead of rewriting a software ecosystem, Dask’s team intended to augment the existing one with the right component. With this idea in mind, on December 21, 2014 Matthew Rocklin created Dask. The purpose[49] of Dask was originally to parallelize NumPy so that it could use one full workstation computer, which was common in finance shops at the time.

    2015–2017

    The first projects to really adopt Dask were Xarray[50] (commonly used in geo sciences) and Scikit-Image[51] (commonly used in image processing). Dask was integrated into Xarray within a few months of being created. It provided Dask with its first user community, which remains to this day.

    Understanding that there is demand for a lightweight parallelism solution for Pandas DataFrames[52] and machine learning tools, such as scikit-learn, Dask quickly evolved to support other projects as well.

    2018

    Since 2018, other teams and institutions in academia, tech companies, and large corporations such as NASA, UK Met Office, Blue Yonder and Nvidia, became interested in Dask and began integrating it into their systems.

    Dask received support from a diverse set of sources:[53] the US Government (DARPA grant), the Gordon and Betty Moore Foundation, Anaconda, NSF, NASA (US research grants with collaborations like Pangeo) and Nvidia.

    2020–present

    In 2020, Matthew Rocklin founded Coiled Computing, Inc.[54] to provide further support for Dask development and enable companies to deploy Dask clusters in the cloud. In May 2021, the company raised $21 million in series A funding led by Bessemer Venture Partners.[55]

    References

    1. Web site: Dask . 2022-05-12 . dask.org.
    2. Web site: Matthew Rocklin - Bio . 2022-05-12 . matthewrocklin.com .
    3. Web site: GitHub, Dask, 2014 . 2022-05-12 . github.com . 2022-05-12 . https://web.archive.org/web/20220512075452/https://github.com/dask/dask/commit/05488db498c1561d266c7b676b8a89021c03a9e7 . live .
    4. Web site: GitHub, Dask, 2022 . 2022-05-12 . github.com . 2022-08-31 . https://web.archive.org/web/20220831130744/https://github.com/dask/dask . live .
    5. News: Caulfield . Brian . Walmart, NVIDIA Discuss How They're Working Together to Transform Retail . blogs.nvidia.com . 2022-05-12 . 2022-05-21 . https://web.archive.org/web/20220521090540/https://blogs.nvidia.com/blog/2019/07/11/walmart-nvidia/ . live .
    6. News: Sharma Meenakshi . Gonsalves Nick . Transforming Model Training Workflows at Wayfair . aboutwayfair.com . 2022-05-12.
    7. News: Eswaramoorthy . Pavithra . Who Uses Dask? . coiled.io . 2022-05-12 . 2022-05-12 . https://web.archive.org/web/20220512075454/https://coiled.io/blog/who-uses-dask/ . live .
    8. News: Bowne-Anderson . Hugo . Dask and TensorFlow in Production at Grubhub . coiled.io . 2022-05-12.
    9. Web site: Companies Currently Using Dask . 2022-05-12 . discovery.hgdata.com . 2023-03-24 . https://web.archive.org/web/20230324002122/https://discovery.hgdata.com/product/dask . live .
    10. Web site: DASK . 2022-05-12 . nvidia.com.
    11. News: Eswaramoorthy . Pavithra . Distributed Machine Learning at Capital One . coiled.io . 2022-05-12 . 2022-05-12 . https://web.archive.org/web/20220512075453/https://coiled.io/blog/distributed-machine-learning-at-capital-one/#:~:text=Capital%20One%20uses%20SSH%2C%20YARN,increasingly%20popular%2C%20like%20Coiled%20Cloud! . live .
    12. Web site: Using Dask at NAS . 2022-05-12 . nas.nasa.gov . 2022-10-21 . https://web.archive.org/web/20221021000707/https://www.nas.nasa.gov/hecc/support/kb/using-dask-at-nas_648.html . live .
    13. Web site: Scalable computing with Dask . 2022-05-12 . ULHPC Tutorials . 2022-08-29 . https://web.archive.org/web/20220829171140/https://ulhpc-tutorials.readthedocs.io/en/latest/python/advanced/dask-ml/ . live .
    14. Web site: DataFrame - Dask documentation . 2022-05-12 . docs.dask.org . 2022-05-12 . https://web.archive.org/web/20220512093208/https://docs.dask.org/en/stable/dataframe.html . live .
    15. Web site: Bag - Dask documentation . 2022-05-12 . docs.dask.org . 2022-05-12 . https://web.archive.org/web/20220512093209/https://docs.dask.org/en/stable/bag.html . live .
    16. Web site: Array - Dask documentation . 2022-05-12 . docs.dask.org . 2022-05-12 . https://web.archive.org/web/20220512093207/https://docs.dask.org/en/stable/array.html . live .
    17. News: Eswaramoorthy . Pavithra . What is Dask? . coiled.io . 2022-05-12 . 2022-05-12 . https://web.archive.org/web/20220512093207/https://coiled.io/blog/what-is-dask/ . live .
    18. Web site: Dask-ML . 2022-05-12 . ml.dask.org . 2022-05-17 . https://web.archive.org/web/20220517105722/https://ml.dask.org/ . live .
    19. Web site: Parallel computing with Dask . 2022-05-12 . docs.xarray.dev . 2022-05-16 . https://web.archive.org/web/20220516011816/https://docs.xarray.dev/en/latest/user-guide/dask.html . live .
    20. Web site: Delayed - Dask documentation . 2022-05-12 . docs.dask.org . 2022-05-12 . https://web.archive.org/web/20220512093208/https://docs.dask.org/en/latest/delayed.html . live .
    21. Web site: Futures - Dask documentation . 2022-05-12 . docs.dask.org . 2022-05-12 . https://web.archive.org/web/20220512093223/https://docs.dask.org/en/latest/futures.html . live .
    22. Web site: Specification - Dask documentation . 2022-05-12 . docs.dask.org . 2022-05-12 . https://web.archive.org/web/20220512093217/https://docs.dask.org/en/latest/spec.html . live .
    23. Web site: Dask scheduler - Dask documentation . 2022-05-12 . docs.dask.org . 2022-05-12 . https://web.archive.org/web/20220512093250/https://docs.dask.org/en/stable/deploying.html . live .
    24. Web site: Computing with scikit-learn . 2022-05-12 . scikit-learn.org.
    25. Web site: Parallel Prediction and Transformation - Dask documentation . 2022-05-12 . ml.dask.org . 2022-05-12 . https://web.archive.org/web/20220512112826/https://ml.dask.org/meta-estimators.html#parallel-meta-estimators . live .
    26. Web site: Incremental Hyperparameter Optimization - Dask documentation . 2022-05-12 . ml.dask.org . 2022-06-19 . https://web.archive.org/web/20220619121158/https://ml.dask.org/hyper-parameter-search.html#hyperparameter-incremental . live .
    27. Web site: API Reference - Dask documentation . 2022-05-12 . ml.dask.org . 2022-05-12 . https://web.archive.org/web/20220512112825/https://ml.dask.org/modules/api.html#api . live .
    28. Web site: Distributed XGBoost with Dask . 2022-05-12 . XGBoost Tutorials . 2022-04-25 . https://web.archive.org/web/20220425022918/https://xgboost.readthedocs.io/en/stable/tutorials/dask.html . live .
    29. Web site: How Distributed LightGBM Works. Dask . 2022-05-12 . LightGBM. Distributed Learning Guide . 2022-05-21 . https://web.archive.org/web/20220521082437/https://lightgbm.readthedocs.io/en/latest/Parallel-Learning-Guide.html#dask . live .
    30. Web site: Rocklin . Matthew . Dask and Pandas and XGBoost . 2022-05-12 . matthewrocklin.com.
    31. Web site: Skorch documentation . 2022-05-12 . Skorch . 2022-06-16 . https://web.archive.org/web/20220616014208/https://skorch.readthedocs.io/en/stable/ . live .
    32. Web site: SciKeras documentation . 2022-05-12 . adriangb.com . 2022-10-06 . https://web.archive.org/web/20221006082421/https://www.adriangb.com/scikeras/stable/ . live .
    33. Web site: Dask Usage at Blue Yonder . 2022-05-12 . tech.blueyonder.com . 19 June 2020 . 2021-04-23 . https://web.archive.org/web/20210423071218/https://tech.blueyonder.com/dask-usage-at-blue-yonder/ . live .
    34. News: Bowne-Anderson . Hugo . Search at Grubhub and User Intent . coiled.io . 2022-05-12 . 2022-05-12 . https://web.archive.org/web/20220512131130/https://coiled.io/blog/grubhub-science-thursday/ . live .
    35. News: McEntee Ryan . McCarty Mike . Dask & RAPIDS: The Next Big Thing for Data Science & ML . capitalone.com . 2022-05-12 . 2022-05-16 . https://web.archive.org/web/20220516164051/https://www.capitalone.com/tech/machine-learning/dask-and-rapids-data-science-and-machine-learning-at-capital-one/ . live .
    36. News: Patel . Harshil . Which library should I use? Apache Spark, Dask, and Pandas Performance Compared (With Benchmarks) . censius.ai . 2022-05-12.
    37. Web site: Adapting Dask to Data Intensive Geoscience Research . 2022-05-12 . coiled.wistia.com.
    38. Web site: Met Office . 2022-05-12 . metoffice.gov.uk . 2022-01-28 . https://web.archive.org/web/20220128234951/https://www.metoffice.gov.uk/ . live .
    39. Web site: Pangeo . 2022-05-12 . pangeo.io . 2022-05-23 . https://web.archive.org/web/20220523022047/https://pangeo.io/ . live .
    40. Web site: Forecasting with HEAVY.AI and Prophet . 2022-05-12 . docs.heavy.ai . 2022-05-23 . https://web.archive.org/web/20220523001711/https://docs.heavy.ai/data-science/additional-examples/forecasting-with-omnisci-and-prophet . live .
    41. Web site: Dask - the simple way. Tsfresh documentation . 2022-05-12 . tsfresh.readthedocs.io . 2022-06-19 . https://web.archive.org/web/20220619170927/https://tsfresh.readthedocs.io/en/latest/text/large_data.html . live .
    42. Web site: scikit-learn . 2022-05-12 . scikit-learn . 2022-05-12 . https://web.archive.org/web/20220512144656/https://scikit-learn.org/stable/ . live .
    43. Web site: Scale Python with Dask on GPUs . 2022-05-12 . rapids.ai.
    44. Web site: Dask Executor - Apache Airflow Documentation . 2022-05-12 . airflow.apache.org . 2022-05-11 . https://web.archive.org/web/20220511061021/https://airflow.apache.org/docs/apache-airflow/stable/executor/dask.html . live .
    45. Web site: Deployment: Dask. Prefect Docs. . 2022-05-12 . docs.prefect.io . 2022-06-06 . https://web.archive.org/web/20220606173413/https://docs.prefect.io/core/advanced_tutorials/dask-cluster.html . live .
    46. Web site: Anaconda . 2022-05-12 . anaconda.com . 2022-05-12 . https://web.archive.org/web/20220512015735/https://www.anaconda.com/ . live .
    47. Web site: The Blaze Ecosystem . 2022-05-12 . blaze.pydata.org . 2022-04-09 . https://web.archive.org/web/20220409032654/https://blaze.pydata.org/ . live .
    48. Web site: DARPA . 2022-05-12 . darpa . 2020-01-15 . https://web.archive.org/web/20200115130051/https://www.darpa.mil/ . live .
    49. Web site: Dask History. . 2022-05-12 . Coiled. YouTube . 2022-05-12 . https://web.archive.org/web/20220512144656/https://www.youtube.com/watch?v=5nJVg8j11h0&t=20s . live .
    50. Web site: Xarray . 2022-05-12 . xarray.pydata.org . 2022-05-17 . https://web.archive.org/web/20220517033814/https://xarray.pydata.org/en/stable/ . live .
    51. Web site: Image processing in Python . 2022-05-12 . scikit-image . 2022-05-11 . https://web.archive.org/web/20220511063919/https://scikit-image.org/ . live .
    52. Web site: pandas . 2022-05-12 . pandas . 2012-02-13 . https://web.archive.org/web/20120213062503/https://pandas.pydata.org/ . live .
    53. Web site: Rocklin . Matthew . Funding Dask, a brief history . matthewrocklin.com . 2022-05-12 . 2022-05-12 . https://web.archive.org/web/20220512144656/https://matthewrocklin.com/blog/work/2020/01/08/founding-1 . live .
    54. Web site: Coiled: Python for Data Science on the Cloud with Dask . 2022-05-12 . coiled.io.
    55. News: Wiggers . Kyle . Data and AI operations startup Coiled nabs $21M . VentureBeat . 2022-05-12 . 2022-05-12 . https://web.archive.org/web/20220512144656/https://venturebeat.com/2021/05/18/data-and-ai-operations-startup-coiled-nabs-21m/ . live .