Awesome AI Training-List πŸš€

date
Apr 29, 2023
slug
all-ai-training-resources-in-one-place
status
Published
tags
Research
summary
All AI Training resources in one place
type
Post

Distributed:

Deepspeed

Eleuter/DeeperSpeed:

HuggingFace Accelerate:

πŸ€— Accelerate is a library that enables the same PyTorch code to be run across any distributed configuration by adding just four lines of code! In short, training and inference at scale made simple, efficient and adaptable.
Fully Sharded Data Parallel
To accelerate training huge models on larger batch sizes, we can use a fully sharded data parallel model. This type of data parallel paradigm enables fitting more data and larger models by sharding the optimizer states, gradients and parameters. To read more about it and the benefits, check out the Fully Sharded Data Parallel blog. We have integrated the latest PyTorch’s Fully Sharded Data Parallel (FSDP) training feature. All you need to do is enable it through the config.

AutoTrain:

πŸ€— AutoTrain is a no-code tool for training state-of-the-art models for Natural Language Processing (NLP) tasks, for Computer Vision (CV) tasks, and for Speech tasks and even for Tabular tasks. It is built on top of the awesome tools developed by the Hugging Face team, and it is designed to be easy to use.

Onnx Runtime:

ONNX Runtime inference can enable faster customer experiences and lower costs, supporting models from deep learning frameworks such as PyTorch and TensorFlow/Keras as well as classical machine learning libraries such as scikit-learn, LightGBM, XGBoost, etc. ONNX Runtime is compatible with different hardware, drivers, and operating systems, and provides optimal performance by leveraging hardware accelerators where applicable alongside graph optimizations and transforms. Learn more β†’

NVIDIA APEX:

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch

Nvidia DALI:

A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.

ColossalAI:

Reinforcement:

carperai/trlx:

trlX is a distributed training framework designed from the ground up to focus on fine-tuning large language models with reinforcement learning using either a provided reward function or a reward-labeled dataset.

TRL - Transformer Reinforcement Learning

Efficiency:

LoRA: Low-Rank Adaptation of Large Language Models

Language:

Triton by openai:

Jax:

  • Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more

Compilers:

Hidet

Hidet is an open-source deep learning compiler, written in Python. It supports end-to-end compilation of DNN models from PyTorch and ONNX to efficient cuda kernels. A series of graph-level and operator-level optimizations are applied to optimize the performance.

Quantization:

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale: https://arxiv.org/abs/2208.07339

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers: https://arxiv.org/abs/2210.17323

GPTQ-for-LLaMa:

GPTQ is SOTA one-shot weight quantization method

bitsandbytes:

The bitsandbytes is a lightweight wrapper around CUDA custom functions, in particular 8-bit optimizers, matrix multiplication (LLM.int8()), and quantization functions.

AutoGPTQ

An easy-to-use model quantization package with user-friendly apis, based on GPTQ algorithm.

Frameworks:

Ray:

Lightning:


Β© APAC AI 2022 - 2024