Awesome Python Data Science

superior_hosting_service

Probably the best curated list of data science software in Python.


py datascience

Machine Learning


General Purpouse Machine Learning

  • scikit-learn – Machine learning in Python. 
  • Shogun – Machine learning toolbox.
  • xLearn – High Performance, Easy-to-use, and Scalable Machine Learning Package.
  • cuML – RAPIDS Machine Learning Library.  
  • modAL – Modular active learning framework for Python3. 
  • Sparkit-learn – PySpark + scikit-learn = Sparkit-learn.  
  • mlpack – A scalable C++ machine learning library (Python bindings).
  • dlib – Toolkit for making real world machine learning and data analysis applications in C++ (Python bindings).
  • MLxtend – Extension and helper modules for Python’s data analysis and machine learning libraries. 
  • hyperlearn – 50%+ Faster, 50%+ less RAM usage, GPU support re-written Sklearn, Statsmodels.  
  • Reproducible Experiment Platform (REP) – Machine Learning toolbox for Humans. 
  • scikit-multilearn – Multi-label classification for python. 
  • seqlearn – Sequence classification toolkit for Python. 
  • pystruct – Simple structured learning framework for Python. 
  • sklearn-expertsys – Highly interpretable classifiers for scikit learn, producing easily understood decision rules instead of black box models. 
  • RuleFit – Implementation of the rulefit. 
  • metric-learn – Metric learning algorithms in Python. 
  • pyGAM – Generalized Additive Models in Python.
  • Karate Club – An unsupervised machine learning library for graph structured data.
  • Little Ball of Fur – A library for sampling graph structured data.
  • causalml – Uplift modeling and causal inference with machine learning algorithms. 

Time Series

  • sktime – A unified framework for machine learning with time series. 
  • tslearn – Machine learning toolkit dedicated to time-series data. 
  • tick – Module for statistical learning, with a particular emphasis on time-dependent modelling. 
  • Prophet – Automatic Forecasting Procedure.
  • PyFlux – Open source time series library for Python.
  • bayesloop – Probabilistic programming framework that facilitates objective model selection for time-varying parameter models.
  • luminol – Anomaly Detection and Correlation library.
  • dateutil – Powerful extensions to the standard datetime module
  • maya – makes it very easy to parse a string and for changing timezones

Automated Machine Learning

  • TPOT – Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming. 
  • auto-sklearn – An automated machine learning toolkit and a drop-in replacement for a scikit-learn estimator. 
  • MLBox – A powerful Automated Machine Learning python library.

Ensemble Methods

  • ML-Ensemble – High performance ensemble learning. 
  • Stacking – Simple and useful stacking library, written in Python. 
  • stacked_generalization – Library for machine learning stacking generalization. 
  • vecstack – Python package for stacking (machine learning technique). 

Imbalanced Datasets

  • imbalanced-learn – Module to perform under sampling and over sampling with various techniques. 
  • imbalanced-algorithms – Python-based implementations of algorithms for learning on imbalanced data.  

Random Forests

Extreme Learning Machine

  • Python-ELM – Extreme Learning Machine implementation in Python. 
  • Python Extreme Learning Machine (ELM) – A machine learning technique used for classification/regression tasks.
  • hpelm – High performance implementation of Extreme Learning Machines (fast randomized neural networks). 

Kernel Methods

  • pyFM – Factorization machines in python. 
  • fastFM – A library for Factorization Machines. 
  • tffm – TensorFlow implementation of an arbitrary order Factorization Machine.  
  • liquidSVM – An implementation of SVMs.
  • scikit-rvm – Relevance Vector Machine implementation using the scikit-learn API.
  • ThunderSVM – A fast SVM Library on GPUs and CPUs.

Gradient Boosting

  • XGBoost – Scalable, Portable and Distributed Gradient Boosting.
  • LightGBM – A fast, distributed, high performance gradient boosting.
  • CatBoost – An open-source gradient boosting on decision trees library. 
  • ThunderGBM – Fast GBDTs and Random Forests on GPUs. 

Deep Learning


PyTorch

  • PyTorch – Tensors and Dynamic neural networks in Python with strong GPU acceleration.
  • torchvision – Datasets, Transforms and Models specific to Computer Vision.
  • torchtext – Data loaders and abstractions for text and NLP.
  • torchaudio – An audio library for PyTorch.
  • ignite – High-level library to help with training neural networks in PyTorch. 
  • PyToune – A Keras-like framework and utilities for PyTorch.
  • skorch – A scikit-learn compatible neural network library that wraps pytorch.  
  • PyTorchNet – An abstraction to train neural networks. 
  • pytorch_geometric – Geometric Deep Learning Extension Library for PyTorch. 
  • Catalyst – High-level utils for PyTorch DL & RL research. 
  • pytorch_geometric_temporal – Temporal Extension Library for PyTorch Geometric. 

TensorFlow

  • TensorFlow – Computation using data flow graphs for scalable machine learning by Google. 
  • TensorLayer – Deep Learning and Reinforcement Learning Library for Researcher and Engineer. 
  • TFLearn – Deep learning library featuring a higher-level API for TensorFlow. 
  • Sonnet – TensorFlow-based neural network library. 
  • tensorpack – A Neural Net Training Interface on TensorFlow. 
  • Polyaxon – A platform that helps you build, manage and monitor deep learning models. 
  • NeuPy – NeuPy is a Python library for Artificial Neural Networks and Deep Learning (previously: ). 
  • tfdeploy – Deploy tensorflow graphs for fast evaluation and export to tensorflow-less environments running numpy. 
  • tensorflow-upstream – TensorFlow ROCm port.  
  • TensorFlow Fold – Deep learning with dynamic computation graphs in TensorFlow. 
  • tensorlm – Wrapper library for text generation / language models at char and word level with RNN. 
  • TensorLight – A high-level framework for TensorFlow. 
  • Mesh TensorFlow – Model Parallelism Made Easier. 
  • Ludwig – A toolbox, that allows to train and test deep learning models without the need to write code. 
  • Keras – A high-level neural networks API running on top of TensorFlow. 
  • keras-contrib – Keras community contributions. 
  • Hyperas – Keras + Hyperopt: A very simple wrapper for convenient hyperparameter. 
  • Elephas – Distributed Deep learning with Keras & Spark. 
  • Hera – Train/evaluate a Keras model, get metrics streamed to a dashboard in your browser. 
  • Spektral – Deep learning on graphs. 
  • qkeras – A quantization deep learning library. 

MXNet

  • MXNet – Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler. 
  • Gluon – A clear, concise, simple yet powerful and efficient API for deep learning (now included in MXNet). 
  • MXbox – Simple, efficient and flexible vision toolbox for mxnet framework. 
  • gluon-cv – Provides implementations of the state-of-the-art deep learning models in computer vision. 
  • gluon-nlp – NLP made easy. 
  • Xfer – Transfer Learning library for Deep Neural Networks. 
  • MXNet – HIP Port of MXNet.  

Others

  • Tangent – Source-to-Source Debuggable Derivatives in Pure Python.
  • autograd – Efficiently computes derivatives of numpy code.
  • Myia – Deep Learning framework (pre-alpha).
  • nnabla – Neural Network Libraries by Sony.
  • Caffe – A fast open framework for deep learning.
  • hipCaffe – The HIP port of Caffe. 

DISCONTINUED PROJECTS

Web Scraping


  • BeautifulSoup: The easiest library to scrape static websites for beginners
  • Scrapy: Fast and extensible scraping library. Can write rules and create customized scraper without touching the coure
  • Selenium: Use Selenium Python API to access all functionalities of Selenium WebDriver in an intuitive way like a real user.
  • Pattern: High level scraping for well-establish websites such as Google, Twitter, and Wikipedia. Also has NLP, machine learning algorithms, and visualization
  • twitterscraper: Efficient library to scrape twitter

Data Manipulation


Data Containers

  • pandas – Powerful Python data analysis toolkit.
  • pandas_profiling – Create HTML profiling reports from pandas DataFrame objects
  • cuDF – GPU DataFrame Library.  
  • blaze – NumPy and pandas interface to Big Data. 
  • pandasql – Allows you to query pandas DataFrames using SQL syntax. 
  • pandas-gbq – pandas Google Big Query. 
  • xpandas – Universal 1d/2d data containers with Transformers .functionality for data analysis by The Alan Turing Institute.
  • pysparkling – A pure Python implementation of Apache Spark’s RDD and DStream interfaces. 
  • Arctic – High performance datastore for time series and tick data.
  • datatable – Data.table for Python. 
  • koalas – pandas API on Apache Spark. 
  • modin – Speed up your pandas workflows by changing a single line of code. 
  • swifter – A package which efficiently applies any function to a pandas dataframe or series in the fastest available manner.
  • pandas_flavor – A package which allow to write your own flavor of Pandas easily.
  • pandas-log – A package which allow to provide feedback about basic pandas operations and find both buisness logic and performance issues.
  • vaex – Out-of-Core DataFrames for Python, ML, visualize and explore big tabular data at a billion rows per second.

Pipelines

  • pdpipe – Sasy pipelines for pandas DataFrames.
  • SSPipe – Python pipe (|) operator with support for DataFrames and Numpy and Pytorch.
  • pandas-ply – Functional data manipulation for pandas. 
  • Dplython – Dplyr for Python. 
  • sklearn-pandas – pandas integration with sklearn.  
  • Dataset – Helps you conveniently work with random or sequential batches of your data and define data processing.
  • pyjanitor – Clean APIs for data cleaning. 
  • meza – A Python toolkit for processing tabular data.
  • Prodmodel – Build system for data science pipelines.
  • dopanda – Hints and tips for using pandas in an analysis environment. 
  • CircleCi: Automates your software builds, tests, and deployments.

Feature Engineering


General

  • Featuretools – Automated feature engineering.
  • skl-groups – A scikit-learn addon to operate on set/”group”-based features. 
  • Feature Forge – A set of tools for creating and testing machine learning feature. 
  • few – A feature engineering wrapper for sklearn. 
  • scikit-mdr – A sklearn-compatible Python implementation of Multifactor Dimensionality Reduction (MDR) for feature construction. 
  • tsfresh – Automatic extraction of relevant features from time series. 

Feature Selection

  • scikit-feature – Feature selection repository in python.
  • boruta_py – Implementations of the Boruta all-relevant feature selection method. 
  • BoostARoota – A fast xgboost feature selection algorithm. 
  • scikit-rebate – A scikit-learn-compatible Python implementation of ReBATE, a suite of Relief-based feature selection algorithms for Machine Learning. 

Visualization


General Purposes

  • Matplotlib – Plotting with Python.
  • seaborn – Statistical data visualization using matplotlib.
  • prettyplotlib – Painlessly create beautiful matplotlib plots.
  • python-ternary – Ternary plotting library for python with matplotlib.
  • missingno – Missing data visualization module for Python.
  • chartify – Python library that makes it easy for data scientists to create charts.
  • physt – Improved histograms.

Interactive plots

  • animatplot – A python package for animating plots build on matplotlib.
  • plotly – A Python library that makes interactive and publication-quality graphs.
  • Bokeh – Interactive Web Plotting for Python.
  • Altair – Declarative statistical visualization library for Python. Can easily do many data transformation within the code to create graph
  • bqplot – Plotting library for IPython/Jupyter notebooks
  • pyecharts – Migrated from Echarts, a charting and visualization library, to Python’s interactive visual drawing library. 

Map

  • folium – Makes it easy to visualize data on an interactive open street map
  • geemap – Python package for interactive mapping with Google Earth Engine (GEE)

Automatic Plotting

  • HoloViews – Stop plotting your data – annotate your data and let it visualize itself.
  • AutoViz: Visualize data automatically with 1 line of code (ideal for machine learning)
  • SweetViz: Visualize and compare datasets, target values and associations, with one line of code.

NLP

  • pyLDAvis: Visualize interactive topic model

Deployment


  • datapane – A collection of APIs to turn scripts and notebooks into interactive reports.
  • binder – Enable sharing and execute Jupyter Notebooks
  • fastapi – Modern, fast (high-performance), web framework for building APIs with Python
  • streamlit – Make it easy to deploy machine learning model

Model Explanation


  • Shapley – A data-driven framework to quantify the value of classifiers in a machine learning ensemble.
  • Alibi – Algorithms for monitoring and explaining machine learning models.
  • anchor – Code for “High-Precision Model-Agnostic Explanations” paper.
  • aequitas – Bias and Fairness Audit Toolkit.
  • Contrastive Explanation – Contrastive Explanation (Foil Trees). 
  • yellowbrick – Visual analysis and diagnostic tools to facilitate machine learning model selection. 
  • scikit-plot – An intuitive library to add plotting functionality to scikit-learn objects. 
  • shap – A unified approach to explain the output of any machine learning model. 
  • ELI5 – A library for debugging/inspecting machine learning classifiers and explaining their predictions.
  • Lime – Explaining the predictions of any machine learning classifier. 
  • FairML – FairML is a python toolbox auditing the machine learning models for bias. 
  • L2X – Code for replicating the experiments in the paper Learning to Explain: An Information-Theoretic Perspective on Model Interpretation.
  • PDPbox – Partial dependence plot toolbox.
  • pyBreakDown – Python implementation of R package breakDown. 
  • PyCEbox – Python Individual Conditional Expectation Plot Toolbox.
  • Skater – Python Library for Model Interpretation.
  • model-analysis – Model analysis tools for TensorFlow. 
  • themis-ml – A library that implements fairness-aware machine learning algorithms. 
  • treeinterpreter – Interpreting scikit-learn’s decision tree and random forest predictions. 
  • AI Explainability 360 – Interpretability and explainability of data and machine learning models.
  • Auralisation – Auralisation of learned features in CNN (for audio).
  • CapsNet-Visualization – A visualization of the CapsNet layers to better understand how it works.
  • lucid – A collection of infrastructure and tools for research in neural network interpretability.
  • Netron – Visualizer for deep learning and machine learning models (no Python code, but visualizes models from most Python Deep Learning frameworks).
  • FlashLight – Visualization Tool for your NeuralNetwork.
  • tensorboard-pytorch – Tensorboard for pytorch (and chainer, mxnet, numpy, …).
  • mxboard – Logging MXNet data for visualization in TensorBoard. 

Reinforcement Learning


  • OpenAI Gym – A toolkit for developing and comparing reinforcement learning algorithms.
  • Coach – Easy experimentation with state of the art Reinforcement Learning algorithms.
  • garage – A toolkit for reproducible reinforcement learning research.
  • OpenAI Baselines – High-quality implementations of reinforcement learning algorithms.
  • Stable Baselines – A set of improved implementations of reinforcement learning algorithms based on OpenAI Baselines.
  • RLlib – Scalable Reinforcement Learning.
  • Horizon – A platform for Applied Reinforcement Learning.
  • TF-Agents – A library for Reinforcement Learning in TensorFlow. 
  • TensorForce – A TensorFlow library for applied reinforcement learning. 
  • TRFL – TensorFlow Reinforcement Learning. 
  • Dopamine – A research framework for fast prototyping of reinforcement learning algorithms.
  • keras-rl – Deep Reinforcement Learning for Keras. 
  • ChainerRL – A deep reinforcement learning library built on top of Chainer.

Probabilistic Methods


  • pomegranate – Probabilistic and graphical models for Python. 
  • pyro – A flexible, scalable deep probabilistic programming library built on PyTorch. 
  • ZhuSuan – Bayesian Deep Learning. 
  • PyMC – Bayesian Stochastic Modelling in Python.
  • PyMC3 – Python package for Bayesian statistical modeling and Probabilistic Machine Learning. 
  • sampled – Decorator for reusable models in PyMC3.
  • Edward – A library for probabilistic modeling, inference, and criticism. 
  • InferPy – Deep Probabilistic Modelling Made Easy. 
  • GPflow – Gaussian processes in TensorFlow. 
  • PyStan – Bayesian inference using the No-U-Turn sampler (Python interface).
  • gelato – Bayesian dessert for Lasagne. 
  • sklearn-bayes – Python package for Bayesian Machine Learning with scikit-learn API. 
  • skggm – Estimation of general graphical models. 
  • pgmpy – A python library for working with Probabilistic Graphical Models.
  • skpro – Supervised domain-agnostic prediction framework for probabilistic modelling by The Alan Turing Institute
  • Aboleth – A bare-bones TensorFlow framework for Bayesian deep learning and Gaussian process approximation. 
  • PtStat – Probabilistic Programming and Statistical Inference in PyTorch. 
  • PyVarInf – Bayesian Deep Learning methods with Variational Inference for PyTorch. 
  • emcee – The Python ensemble sampling toolkit for affine-invariant MCMC.
  • hsmmlearn – A library for hidden semi-Markov models with explicit durations.
  • pyhsmm – Bayesian inference in HSMMs and HMMs.
  • GPyTorch – A highly efficient and modular implementation of Gaussian Processes in PyTorch. 
  • MXFusion – Modular Probabilistic Programming on MXNet. 
  • sklearn-crfsuite – A scikit-learn inspired API for CRFsuite. 

Genetic Programming


  • gplearn – Genetic Programming in Python. 
  • DEAP – Distributed Evolutionary Algorithms in Python.
  • karoo_gp – A Genetic Programming platform for Python with GPU support. 
  • monkeys – A strongly-typed genetic programming framework for Python.
  • sklearn-genetic – Genetic feature selection module for scikit-learn. 

Optimization


  • Spearmint – Bayesian optimization.
  • BoTorch – Bayesian optimization in PyTorch. 
  • scikit-opt – Heuristic Algorithms for optimization.
  • SMAC3 – Sequential Model-based Algorithm Configuration.
  • Optunity – Is a library containing various optimizers for hyperparameter tuning.
  • hyperopt – Distributed Asynchronous Hyperparameter Optimization in Python.
  • hyperopt-sklearn – Hyper-parameter optimization for sklearn. 
  • sklearn-deap – Use evolutionary algorithms instead of gridsearch in scikit-learn. 
  • sigopt_sklearn – SigOpt wrappers for scikit-learn methods. 
  • Bayesian Optimization – A Python implementation of global optimization with gaussian processes.
  • SafeOpt – Safe Bayesian Optimization.
  • scikit-optimize – Sequential model-based optimization with a scipy.optimize interface.
  • Solid – A comprehensive gradient-free optimization framework written in Python.
  • PySwarms – A research toolkit for particle swarm optimization in Python.
  • Platypus – A Free and Open Source Python Library for Multiobjective Optimization.
  • GPflowOpt – Bayesian Optimization using GPflow. 
  • POT – Python Optimal Transport library.
  • Talos – Hyperparameter Optimization for Keras Models.
  • nlopt – Library for nonlinear optimization (global and local, constrained or unconstrained).

Natural Language Processing


  • NLTK – Modules, data sets, and tutorials supporting research and development in Natural Language Processing.
  • CLTK – The Classical Language Toolkik.
  • gensim – Topic Modelling for Humans.
  • PSI-Toolkit – A natural language processing toolkit.
  • pyMorfologik – Python binding for Morfologik.
  • skift – Scikit-learn wrappers for Python fastText. 
  • Phonemizer – Simple text to phonemes converter for multiple languages.
  • flair – Very simple framework for state-of-the-art NLP.
  • spaCy – Industrial-Strength Natural Language Processing.

Computer Audition


  • librosa – Python library for audio and music analysis.
  • Yaafe – Audio features extraction.
  • aubio – A library for audio and music analysis.
  • Essentia – Library for audio and music analysis, description and synthesis.
  • LibXtract – A simple, portable, lightweight library of audio feature extraction functions.
  • Marsyas – Music Analysis, Retrieval and Synthesis for Audio Signals.
  • muda – A library for augmenting annotated audio data.
  • madmom – Python audio and music signal processing library.

Computer Vision


  • OpenCV – Open Source Computer Vision Library.
  • scikit-image – Image Processing SciKit (Toolbox for SciPy).
  • imgaug – Image augmentation for machine learning experiments.
  • imgaug_extension – Additional augmentations for imgaug.
  • Augmentor – Image augmentation library in Python for machine learning.
  • albumentations – Fast image augmentation library and easy to use wrapper around other libraries.

Statistics


  • pandas_summary – Extension to pandas dataframes describe function. 
  • Pandas Profiling – Create HTML profiling reports from pandas DataFrame objects. 
  • statsmodels – Statistical modeling and econometrics in Python.
  • stockstats – Supply a wrapper StockDataFrame based on the pandas.DataFrame with inline stock statistics/indicators support.
  • weightedcalcs – A pandas-based utility to calculate weighted means, medians, distributions, standard deviations, and more.
  • scikit-posthocs – Pairwise Multiple Comparisons Post-hoc Tests.
  • Alphalens – Performance analysis of predictive (alpha) stock factors.

Distributed Computing


  • Horovod – Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. 
  • PySpark – Exposes the Spark programming model to Python. 
  • Veles – Distributed machine learning platform.
  • Jubatus – Framework and Library for Distributed Online Machine Learning.
  • DMTK – Microsoft Distributed Machine Learning Toolkit.
  • PaddlePaddle – PArallel Distributed Deep LEarning.
  • dask-ml – Distributed and parallel machine learning. 
  • Distributed – Distributed computation in Python.

Experimentation


  • Sacred – A tool to help you configure, organize, log and reproduce experiments.
  • Xcessiv – A web-based application for quick, scalable, and automated hyperparameter tuning and stacked ensembling.
  • Persimmon – A visual dataflow programming language for sklearn.
  • Ax – Adaptive Experimentation Platform. 
  • Neptune – A lightweight ML experiment tracking, results visualization and management tool.

Evaluation


  • recmetrics – Library of useful metrics and plots for evaluating recommender systems.
  • Metrics – Machine learning evaluation metric.
  • sklearn-evaluation – Model evaluation made easy: plots, tables and markdown reports. 
  • AI Fairness 360 – Fairness metrics for datasets and ML models, explanations and algorithms to mitigate bias in datasets and models.

Computations


  • numpy – The fundamental package needed for scientific computing with Python.
  • Dask – Parallel computing with task scheduling. 
  • bottleneck – Fast NumPy array functions written in C.
  • CuPy – NumPy-like API accelerated with CUDA.
  • scikit-tensor – Python library for multilinear algebra and tensor factorizations.
  • numdifftools – Solve automatic numerical differentiation problems in one or more variables.
  • quaternion – Add built-in support for quaternions to numpy.
  • adaptive – Tools for adaptive and parallel samping of mathematical functions.

Spatial Analysis


  • GeoPandas – Python tools for geographic data. 
  • PySal – Python Spatial Analysis Library.

Quantum Computing


  • PennyLane – Quantum machine learning, automatic differentiation, and optimization of hybrid quantum-classical computations.
  • QML – A Python Toolkit for Quantum Machine Learning.

Conversion


  • sklearn-porter – Transpile trained scikit-learn estimators to C, Java, JavaScript and others.
  • ONNX – Open Neural Network Exchange.
  • MMdnn – A set of tools to help users inter-operate among different deep learning frameworks.

Contributing

Contributions are welcome! 
Read the contribution guideline.

License

This work is licensed under the Creative Commons Attribution 4.0 International License – CC BY 4.0