Probably the best curated list of data science software in Python.
Machine Learning
General Purpouse Machine Learning
- scikit-learn – Machine learning in Python.
- Shogun – Machine learning toolbox.
- xLearn – High Performance, Easy-to-use, and Scalable Machine Learning Package.
- cuML – RAPIDS Machine Learning Library.
- modAL – Modular active learning framework for Python3.
- Sparkit-learn – PySpark + scikit-learn = Sparkit-learn.
- mlpack – A scalable C++ machine learning library (Python bindings).
- dlib – Toolkit for making real world machine learning and data analysis applications in C++ (Python bindings).
- MLxtend – Extension and helper modules for Python’s data analysis and machine learning libraries.
- hyperlearn – 50%+ Faster, 50%+ less RAM usage, GPU support re-written Sklearn, Statsmodels.
- Reproducible Experiment Platform (REP) – Machine Learning toolbox for Humans.
- scikit-multilearn – Multi-label classification for python.
- seqlearn – Sequence classification toolkit for Python.
- pystruct – Simple structured learning framework for Python.
- sklearn-expertsys – Highly interpretable classifiers for scikit learn, producing easily understood decision rules instead of black box models.
- RuleFit – Implementation of the rulefit.
- metric-learn – Metric learning algorithms in Python.
- pyGAM – Generalized Additive Models in Python.
- Karate Club – An unsupervised machine learning library for graph structured data.
- Little Ball of Fur – A library for sampling graph structured data.
- causalml – Uplift modeling and causal inference with machine learning algorithms.
Time Series
- sktime – A unified framework for machine learning with time series.
- tslearn – Machine learning toolkit dedicated to time-series data.
- tick – Module for statistical learning, with a particular emphasis on time-dependent modelling.
- Prophet – Automatic Forecasting Procedure.
- PyFlux – Open source time series library for Python.
- bayesloop – Probabilistic programming framework that facilitates objective model selection for time-varying parameter models.
- luminol – Anomaly Detection and Correlation library.
- dateutil – Powerful extensions to the standard datetime module
- maya – makes it very easy to parse a string and for changing timezones
Automated Machine Learning
- TPOT – Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
- auto-sklearn – An automated machine learning toolkit and a drop-in replacement for a scikit-learn estimator.
- MLBox – A powerful Automated Machine Learning python library.
Ensemble Methods
- ML-Ensemble – High performance ensemble learning.
- Stacking – Simple and useful stacking library, written in Python.
- stacked_generalization – Library for machine learning stacking generalization.
- vecstack – Python package for stacking (machine learning technique).
Imbalanced Datasets
- imbalanced-learn – Module to perform under sampling and over sampling with various techniques.
- imbalanced-algorithms – Python-based implementations of algorithms for learning on imbalanced data.
Random Forests
- rpforest – A forest of random projection trees.
- sklearn-random-bits-forest – Wrapper of the Random Bits Forest program written by (Wang et al., 2016).
- rgf_python – Python Wrapper of Regularized Greedy Forest.
Extreme Learning Machine
- Python-ELM – Extreme Learning Machine implementation in Python.
- Python Extreme Learning Machine (ELM) – A machine learning technique used for classification/regression tasks.
- hpelm – High performance implementation of Extreme Learning Machines (fast randomized neural networks).
Kernel Methods
- pyFM – Factorization machines in python.
- fastFM – A library for Factorization Machines.
- tffm – TensorFlow implementation of an arbitrary order Factorization Machine.
- liquidSVM – An implementation of SVMs.
- scikit-rvm – Relevance Vector Machine implementation using the scikit-learn API.
- ThunderSVM – A fast SVM Library on GPUs and CPUs.
Gradient Boosting
- XGBoost – Scalable, Portable and Distributed Gradient Boosting.
- LightGBM – A fast, distributed, high performance gradient boosting.
- CatBoost – An open-source gradient boosting on decision trees library.
- ThunderGBM – Fast GBDTs and Random Forests on GPUs.
Deep Learning
PyTorch
- PyTorch – Tensors and Dynamic neural networks in Python with strong GPU acceleration.
- torchvision – Datasets, Transforms and Models specific to Computer Vision.
- torchtext – Data loaders and abstractions for text and NLP.
- torchaudio – An audio library for PyTorch.
- ignite – High-level library to help with training neural networks in PyTorch.
- PyToune – A Keras-like framework and utilities for PyTorch.
- skorch – A scikit-learn compatible neural network library that wraps pytorch.
- PyTorchNet – An abstraction to train neural networks.
- pytorch_geometric – Geometric Deep Learning Extension Library for PyTorch.
- Catalyst – High-level utils for PyTorch DL & RL research.
- pytorch_geometric_temporal – Temporal Extension Library for PyTorch Geometric.
TensorFlow
- TensorFlow – Computation using data flow graphs for scalable machine learning by Google.
- TensorLayer – Deep Learning and Reinforcement Learning Library for Researcher and Engineer.
- TFLearn – Deep learning library featuring a higher-level API for TensorFlow.
- Sonnet – TensorFlow-based neural network library.
- tensorpack – A Neural Net Training Interface on TensorFlow.
- Polyaxon – A platform that helps you build, manage and monitor deep learning models.
- NeuPy – NeuPy is a Python library for Artificial Neural Networks and Deep Learning (previously: ).
- tfdeploy – Deploy tensorflow graphs for fast evaluation and export to tensorflow-less environments running numpy.
- tensorflow-upstream – TensorFlow ROCm port.
- TensorFlow Fold – Deep learning with dynamic computation graphs in TensorFlow.
- tensorlm – Wrapper library for text generation / language models at char and word level with RNN.
- TensorLight – A high-level framework for TensorFlow.
- Mesh TensorFlow – Model Parallelism Made Easier.
- Ludwig – A toolbox, that allows to train and test deep learning models without the need to write code.
- Keras – A high-level neural networks API running on top of TensorFlow.
- keras-contrib – Keras community contributions.
- Hyperas – Keras + Hyperopt: A very simple wrapper for convenient hyperparameter.
- Elephas – Distributed Deep learning with Keras & Spark.
- Hera – Train/evaluate a Keras model, get metrics streamed to a dashboard in your browser.
- Spektral – Deep learning on graphs.
- qkeras – A quantization deep learning library.
MXNet
- MXNet – Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler.
- Gluon – A clear, concise, simple yet powerful and efficient API for deep learning (now included in MXNet).
- MXbox – Simple, efficient and flexible vision toolbox for mxnet framework.
- gluon-cv – Provides implementations of the state-of-the-art deep learning models in computer vision.
- gluon-nlp – NLP made easy.
- Xfer – Transfer Learning library for Deep Neural Networks.
- MXNet – HIP Port of MXNet.
Others
- Tangent – Source-to-Source Debuggable Derivatives in Pure Python.
- autograd – Efficiently computes derivatives of numpy code.
- Myia – Deep Learning framework (pre-alpha).
- nnabla – Neural Network Libraries by Sony.
- Caffe – A fast open framework for deep learning.
- hipCaffe – The HIP port of Caffe.
Web Scraping
- BeautifulSoup: The easiest library to scrape static websites for beginners
- Scrapy: Fast and extensible scraping library. Can write rules and create customized scraper without touching the coure
- Selenium: Use Selenium Python API to access all functionalities of Selenium WebDriver in an intuitive way like a real user.
- Pattern: High level scraping for well-establish websites such as Google, Twitter, and Wikipedia. Also has NLP, machine learning algorithms, and visualization
- twitterscraper: Efficient library to scrape twitter
Data Manipulation
Data Containers
- pandas – Powerful Python data analysis toolkit.
- pandas_profiling – Create HTML profiling reports from pandas DataFrame objects
- cuDF – GPU DataFrame Library.
- blaze – NumPy and pandas interface to Big Data.
- pandasql – Allows you to query pandas DataFrames using SQL syntax.
- pandas-gbq – pandas Google Big Query.
- xpandas – Universal 1d/2d data containers with Transformers .functionality for data analysis by The Alan Turing Institute.
- pysparkling – A pure Python implementation of Apache Spark’s RDD and DStream interfaces.
- Arctic – High performance datastore for time series and tick data.
- datatable – Data.table for Python.
- koalas – pandas API on Apache Spark.
- modin – Speed up your pandas workflows by changing a single line of code.
- swifter – A package which efficiently applies any function to a pandas dataframe or series in the fastest available manner.
- pandas_flavor – A package which allow to write your own flavor of Pandas easily.
- pandas-log – A package which allow to provide feedback about basic pandas operations and find both buisness logic and performance issues.
- vaex – Out-of-Core DataFrames for Python, ML, visualize and explore big tabular data at a billion rows per second.
Pipelines
- pdpipe – Sasy pipelines for pandas DataFrames.
- SSPipe – Python pipe (|) operator with support for DataFrames and Numpy and Pytorch.
- pandas-ply – Functional data manipulation for pandas.
- Dplython – Dplyr for Python.
- sklearn-pandas – pandas integration with sklearn.
- Dataset – Helps you conveniently work with random or sequential batches of your data and define data processing.
- pyjanitor – Clean APIs for data cleaning.
- meza – A Python toolkit for processing tabular data.
- Prodmodel – Build system for data science pipelines.
- dopanda – Hints and tips for using pandas in an analysis environment.
- CircleCi: Automates your software builds, tests, and deployments.
Feature Engineering
General
- Featuretools – Automated feature engineering.
- skl-groups – A scikit-learn addon to operate on set/”group”-based features.
- Feature Forge – A set of tools for creating and testing machine learning feature.
- few – A feature engineering wrapper for sklearn.
- scikit-mdr – A sklearn-compatible Python implementation of Multifactor Dimensionality Reduction (MDR) for feature construction.
- tsfresh – Automatic extraction of relevant features from time series.
Feature Selection
- scikit-feature – Feature selection repository in python.
- boruta_py – Implementations of the Boruta all-relevant feature selection method.
- BoostARoota – A fast xgboost feature selection algorithm.
- scikit-rebate – A scikit-learn-compatible Python implementation of ReBATE, a suite of Relief-based feature selection algorithms for Machine Learning.
Visualization
General Purposes
- Matplotlib – Plotting with Python.
- seaborn – Statistical data visualization using matplotlib.
- prettyplotlib – Painlessly create beautiful matplotlib plots.
- python-ternary – Ternary plotting library for python with matplotlib.
- missingno – Missing data visualization module for Python.
- chartify – Python library that makes it easy for data scientists to create charts.
- physt – Improved histograms.
Interactive plots
- animatplot – A python package for animating plots build on matplotlib.
- plotly – A Python library that makes interactive and publication-quality graphs.
- Bokeh – Interactive Web Plotting for Python.
- Altair – Declarative statistical visualization library for Python. Can easily do many data transformation within the code to create graph
- bqplot – Plotting library for IPython/Jupyter notebooks
- pyecharts – Migrated from Echarts, a charting and visualization library, to Python’s interactive visual drawing library.
Map
- folium – Makes it easy to visualize data on an interactive open street map
- geemap – Python package for interactive mapping with Google Earth Engine (GEE)
Automatic Plotting
- HoloViews – Stop plotting your data – annotate your data and let it visualize itself.
- AutoViz: Visualize data automatically with 1 line of code (ideal for machine learning)
- SweetViz: Visualize and compare datasets, target values and associations, with one line of code.
NLP
- pyLDAvis: Visualize interactive topic model
Deployment
- datapane – A collection of APIs to turn scripts and notebooks into interactive reports.
- binder – Enable sharing and execute Jupyter Notebooks
- fastapi – Modern, fast (high-performance), web framework for building APIs with Python
- streamlit – Make it easy to deploy machine learning model
Model Explanation
- Shapley – A data-driven framework to quantify the value of classifiers in a machine learning ensemble.
- Alibi – Algorithms for monitoring and explaining machine learning models.
- anchor – Code for “High-Precision Model-Agnostic Explanations” paper.
- aequitas – Bias and Fairness Audit Toolkit.
- Contrastive Explanation – Contrastive Explanation (Foil Trees).
- yellowbrick – Visual analysis and diagnostic tools to facilitate machine learning model selection.
- scikit-plot – An intuitive library to add plotting functionality to scikit-learn objects.
- shap – A unified approach to explain the output of any machine learning model.
- ELI5 – A library for debugging/inspecting machine learning classifiers and explaining their predictions.
- Lime – Explaining the predictions of any machine learning classifier.
- FairML – FairML is a python toolbox auditing the machine learning models for bias.
- L2X – Code for replicating the experiments in the paper Learning to Explain: An Information-Theoretic Perspective on Model Interpretation.
- PDPbox – Partial dependence plot toolbox.
- pyBreakDown – Python implementation of R package breakDown.
- PyCEbox – Python Individual Conditional Expectation Plot Toolbox.
- Skater – Python Library for Model Interpretation.
- model-analysis – Model analysis tools for TensorFlow.
- themis-ml – A library that implements fairness-aware machine learning algorithms.
- treeinterpreter – Interpreting scikit-learn’s decision tree and random forest predictions.
- AI Explainability 360 – Interpretability and explainability of data and machine learning models.
- Auralisation – Auralisation of learned features in CNN (for audio).
- CapsNet-Visualization – A visualization of the CapsNet layers to better understand how it works.
- lucid – A collection of infrastructure and tools for research in neural network interpretability.
- Netron – Visualizer for deep learning and machine learning models (no Python code, but visualizes models from most Python Deep Learning frameworks).
- FlashLight – Visualization Tool for your NeuralNetwork.
- tensorboard-pytorch – Tensorboard for pytorch (and chainer, mxnet, numpy, …).
- mxboard – Logging MXNet data for visualization in TensorBoard.
Reinforcement Learning
- OpenAI Gym – A toolkit for developing and comparing reinforcement learning algorithms.
- Coach – Easy experimentation with state of the art Reinforcement Learning algorithms.
- garage – A toolkit for reproducible reinforcement learning research.
- OpenAI Baselines – High-quality implementations of reinforcement learning algorithms.
- Stable Baselines – A set of improved implementations of reinforcement learning algorithms based on OpenAI Baselines.
- RLlib – Scalable Reinforcement Learning.
- Horizon – A platform for Applied Reinforcement Learning.
- TF-Agents – A library for Reinforcement Learning in TensorFlow.
- TensorForce – A TensorFlow library for applied reinforcement learning.
- TRFL – TensorFlow Reinforcement Learning.
- Dopamine – A research framework for fast prototyping of reinforcement learning algorithms.
- keras-rl – Deep Reinforcement Learning for Keras.
- ChainerRL – A deep reinforcement learning library built on top of Chainer.
Probabilistic Methods
- pomegranate – Probabilistic and graphical models for Python.
- pyro – A flexible, scalable deep probabilistic programming library built on PyTorch.
- ZhuSuan – Bayesian Deep Learning.
- PyMC – Bayesian Stochastic Modelling in Python.
- PyMC3 – Python package for Bayesian statistical modeling and Probabilistic Machine Learning.
- sampled – Decorator for reusable models in PyMC3.
- Edward – A library for probabilistic modeling, inference, and criticism.
- InferPy – Deep Probabilistic Modelling Made Easy.
- GPflow – Gaussian processes in TensorFlow.
- PyStan – Bayesian inference using the No-U-Turn sampler (Python interface).
- gelato – Bayesian dessert for Lasagne.
- sklearn-bayes – Python package for Bayesian Machine Learning with scikit-learn API.
- skggm – Estimation of general graphical models.
- pgmpy – A python library for working with Probabilistic Graphical Models.
- skpro – Supervised domain-agnostic prediction framework for probabilistic modelling by The Alan Turing Institute.
- Aboleth – A bare-bones TensorFlow framework for Bayesian deep learning and Gaussian process approximation.
- PtStat – Probabilistic Programming and Statistical Inference in PyTorch.
- PyVarInf – Bayesian Deep Learning methods with Variational Inference for PyTorch.
- emcee – The Python ensemble sampling toolkit for affine-invariant MCMC.
- hsmmlearn – A library for hidden semi-Markov models with explicit durations.
- pyhsmm – Bayesian inference in HSMMs and HMMs.
- GPyTorch – A highly efficient and modular implementation of Gaussian Processes in PyTorch.
- MXFusion – Modular Probabilistic Programming on MXNet.
- sklearn-crfsuite – A scikit-learn inspired API for CRFsuite.
Genetic Programming
- gplearn – Genetic Programming in Python.
- DEAP – Distributed Evolutionary Algorithms in Python.
- karoo_gp – A Genetic Programming platform for Python with GPU support.
- monkeys – A strongly-typed genetic programming framework for Python.
- sklearn-genetic – Genetic feature selection module for scikit-learn.
Optimization
- Spearmint – Bayesian optimization.
- BoTorch – Bayesian optimization in PyTorch.
- scikit-opt – Heuristic Algorithms for optimization.
- SMAC3 – Sequential Model-based Algorithm Configuration.
- Optunity – Is a library containing various optimizers for hyperparameter tuning.
- hyperopt – Distributed Asynchronous Hyperparameter Optimization in Python.
- hyperopt-sklearn – Hyper-parameter optimization for sklearn.
- sklearn-deap – Use evolutionary algorithms instead of gridsearch in scikit-learn.
- sigopt_sklearn – SigOpt wrappers for scikit-learn methods.
- Bayesian Optimization – A Python implementation of global optimization with gaussian processes.
- SafeOpt – Safe Bayesian Optimization.
- scikit-optimize – Sequential model-based optimization with a
scipy.optimize
interface. - Solid – A comprehensive gradient-free optimization framework written in Python.
- PySwarms – A research toolkit for particle swarm optimization in Python.
- Platypus – A Free and Open Source Python Library for Multiobjective Optimization.
- GPflowOpt – Bayesian Optimization using GPflow.
- POT – Python Optimal Transport library.
- Talos – Hyperparameter Optimization for Keras Models.
- nlopt – Library for nonlinear optimization (global and local, constrained or unconstrained).
Natural Language Processing
- NLTK – Modules, data sets, and tutorials supporting research and development in Natural Language Processing.
- CLTK – The Classical Language Toolkik.
- gensim – Topic Modelling for Humans.
- PSI-Toolkit – A natural language processing toolkit.
- pyMorfologik – Python binding for Morfologik.
- skift – Scikit-learn wrappers for Python fastText.
- Phonemizer – Simple text to phonemes converter for multiple languages.
- flair – Very simple framework for state-of-the-art NLP.
- spaCy – Industrial-Strength Natural Language Processing.
Computer Audition
- librosa – Python library for audio and music analysis.
- Yaafe – Audio features extraction.
- aubio – A library for audio and music analysis.
- Essentia – Library for audio and music analysis, description and synthesis.
- LibXtract – A simple, portable, lightweight library of audio feature extraction functions.
- Marsyas – Music Analysis, Retrieval and Synthesis for Audio Signals.
- muda – A library for augmenting annotated audio data.
- madmom – Python audio and music signal processing library.
Computer Vision
- OpenCV – Open Source Computer Vision Library.
- scikit-image – Image Processing SciKit (Toolbox for SciPy).
- imgaug – Image augmentation for machine learning experiments.
- imgaug_extension – Additional augmentations for imgaug.
- Augmentor – Image augmentation library in Python for machine learning.
- albumentations – Fast image augmentation library and easy to use wrapper around other libraries.
Statistics
- pandas_summary – Extension to pandas dataframes describe function.
- Pandas Profiling – Create HTML profiling reports from pandas DataFrame objects.
- statsmodels – Statistical modeling and econometrics in Python.
- stockstats – Supply a wrapper
StockDataFrame
based on thepandas.DataFrame
with inline stock statistics/indicators support. - weightedcalcs – A pandas-based utility to calculate weighted means, medians, distributions, standard deviations, and more.
- scikit-posthocs – Pairwise Multiple Comparisons Post-hoc Tests.
- Alphalens – Performance analysis of predictive (alpha) stock factors.
Distributed Computing
- Horovod – Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
- PySpark – Exposes the Spark programming model to Python.
- Veles – Distributed machine learning platform.
- Jubatus – Framework and Library for Distributed Online Machine Learning.
- DMTK – Microsoft Distributed Machine Learning Toolkit.
- PaddlePaddle – PArallel Distributed Deep LEarning.
- dask-ml – Distributed and parallel machine learning.
- Distributed – Distributed computation in Python.
Experimentation
- Sacred – A tool to help you configure, organize, log and reproduce experiments.
- Xcessiv – A web-based application for quick, scalable, and automated hyperparameter tuning and stacked ensembling.
- Persimmon – A visual dataflow programming language for sklearn.
- Ax – Adaptive Experimentation Platform.
- Neptune – A lightweight ML experiment tracking, results visualization and management tool.
Evaluation
- recmetrics – Library of useful metrics and plots for evaluating recommender systems.
- Metrics – Machine learning evaluation metric.
- sklearn-evaluation – Model evaluation made easy: plots, tables and markdown reports.
- AI Fairness 360 – Fairness metrics for datasets and ML models, explanations and algorithms to mitigate bias in datasets and models.
Computations
- numpy – The fundamental package needed for scientific computing with Python.
- Dask – Parallel computing with task scheduling.
- bottleneck – Fast NumPy array functions written in C.
- CuPy – NumPy-like API accelerated with CUDA.
- scikit-tensor – Python library for multilinear algebra and tensor factorizations.
- numdifftools – Solve automatic numerical differentiation problems in one or more variables.
- quaternion – Add built-in support for quaternions to numpy.
- adaptive – Tools for adaptive and parallel samping of mathematical functions.
Spatial Analysis
Quantum Computing
- PennyLane – Quantum machine learning, automatic differentiation, and optimization of hybrid quantum-classical computations.
- QML – A Python Toolkit for Quantum Machine Learning.
Conversion
- sklearn-porter – Transpile trained scikit-learn estimators to C, Java, JavaScript and others.
- ONNX – Open Neural Network Exchange.
- MMdnn – A set of tools to help users inter-operate among different deep learning frameworks.
Contributing
Contributions are welcome!
Read the contribution guideline.
License
This work is licensed under the Creative Commons Attribution 4.0 International License – CC BY 4.0
Leave a Reply