Parallelformers: An Efficient Model Parallelization Toolkit for Deployment

Last update: Dec 28, 2022

Overview

Parallelformers, which is based on Megatron LM, is designed to make model parallelization easier.
You can parallelize various models in HuggingFace Transformers on multiple GPUs with a single line of code.
Currently, Parallelformers only supports inference. Training features are NOT included.

What's New:

July 18, 2021 Released Parallelformers 1.0.

Why Parallelformers?

You can load a model that is too large for a single GPU. For example, using Parallelformers, you can load a model of 12GB on two 8 GB GPUs. In addition, you can save your precious money because usually multiple smaller size GPUs are less costly than a single larger size GPU.

Installation

Parallelformers can be easily installed using the pip package manager. All the dependencies such as torch, transformers, and dacite should be installed automatically with the following command. Be careful that the name is plural.

pip install parallelformers

Getting Started

1. Create a HuggingFace transformers model.

You don't need to call .half() or .cuda() as those functions will be invoked automatically. It is more memory efficient to start parallelization on the CPU.

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neo-2.7B")
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-2.7B")

2. Put the `model` in the `parallelize()` function.

from parallelformers import parallelize

parallelize(model, num_gpus=2, fp16=True, verbose='detail')

Since nvidia-smi shows the reserved cache area, it is difficult to check the exact allocated memory. To check the allocated memory state well, you can set the verbose option as 'detail' or 'simple'. (default is None)

|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |    2721 MB |    2967 MB |    2967 MB |  251905 KB |
|       from large pool |    2720 MB |    2966 MB |    2966 MB |  251904 KB |
|       from small pool |       1 MB |       1 MB |       1 MB |       1 KB |
|---------------------------------------------------------------------------|

GPU:0 => 2.72GB

|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 1                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |    2721 MB |    2967 MB |    2967 MB |  251905 KB |
|       from large pool |    2720 MB |    2966 MB |    2966 MB |  251904 KB |
|       from small pool |       1 MB |       1 MB |       1 MB |       1 KB |
|---------------------------------------------------------------------------|

GPU:1 => 2.72GB

3. Do Inference as usual.

You don't have to call .cuda() when creating input tokens. Note that you should input both input tokens and attention masks to the model. (**inputs is the recommended way for this)

inputs = tokenizer("Parallelformers is", return_tensors="pt")

outputs = model.generate(
    **inputs,
    num_beams=5,
    no_repeat_ngram_size=4,
    max_length=15,
)

print(f"Output: {tokenizer.batch_decode(outputs)[0]}")

Output: Parallelformers is an open-source library for parallel programming ...

4. Deploy the model to the server as usual.

The parallelization process does not affect the web server because they are automatically synchronized.

") def generate_text(text): inputs = tokenizer(text, return_tensors="pt") outputs = model.generate( **inputs, num_beams=5, no_repeat_ngram_size=4, max_length=15, ) outputs = tokenizer.batch_decode( outputs, skip_special_tokens=True, ) return { "inputs": text, "outputs": outputs[0], } app.run(host="0.0.0.0", port=5000) ">

from flask import Flask

app = Flask(__name__)


@app.route("/generate_text/")
def generate_text(text):
    inputs = tokenizer(text, return_tensors="pt")
    
    outputs = model.generate(
        **inputs,
        num_beams=5,
        no_repeat_ngram_size=4,
        max_length=15,
    )
    
    outputs = tokenizer.batch_decode(
        outputs,
        skip_special_tokens=True,
    )
    
    return {
        "inputs": text,
        "outputs": outputs[0],
    }


app.run(host="0.0.0.0", port=5000)

You can send a request to the web server as follows:

$ curl -X get "YOUR_IP:5000/generate_text/Messi"

And the following result should be returned.

{"inputs": "Messi", "outputs": "Messi is the best player in the world right now. He is the"}

5. Check the current GPU states.

You can check GPU states using .memory_allocated(), .memory_reserved() and .memory_chached() to make sure the parallelization is successful.

model.memory_allocated()
model.memory_reserved()
model.memory_chached()

{'cuda:0':XXXXXX, 'cuda:1':XXXXXX}

6. Manage the model parallelization states.

You can manage model parallelization states using .cuda(), .cpu() and .to(). The model parallelization process ends if you call those functions.

model.cuda()

print(torch.cuda.memory_summary(0))
print(torch.cuda.memory_summary(1))

Check the allocated memory status using torch.cuda.memory_summary().

|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |    5121 MB |    5121 MB |    5121 MB |    1024 B  |
|       from large pool |    5120 MB |    5120 MB |    5120 MB |       0 B  |
|       from small pool |       1 MB |       1 MB |       1 MB |    1024 B  |
|---------------------------------------------------------------------------|

GPU0 => 5.12GB

|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 1                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |       0 B  |    1024 B  |    1024 B  |    1024 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |    1024 B  |    1024 B  |    1024 B  |
|---------------------------------------------------------------------------|

GPU1 => 0.00GB

If you switch to the CPU mode, it works like this.

model.cpu()

print(torch.cuda.memory_summary(0))
print(torch.cuda.memory_summary(1))

|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |       0 B  |    5121 MB |    5121 MB |    5121 MB |
|       from large pool |       0 B  |    5120 MB |    5120 MB |    5120 MB |
|       from small pool |       0 B  |       1 MB |       1 MB |       1 MB |
|---------------------------------------------------------------------------|

GPU0 => 0.00GB

|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 1                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |       0 B  |    1024 B  |    1024 B  |    1024 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |    1024 B  |    1024 B  |    1024 B  |
|---------------------------------------------------------------------------|

GPU1 => 0.00GB

Supported Models

Currently, most models in Huggingface transformers are supported. All layers in the models listed below can be parallelized. They include vision models like ViT, CLIP and speech models like Wav2Vec2 as well as language models.

Fully Supported Models

ALBERT
BART
BARThez (=BERT)
BERT
BERTweet (=BERT)
BertJapanese (=BERT)
BertGeneration
Blenderbot
Blenderbot Samll
BORT (=BERT)
CamemBERT (=RoBERTa)
CLIP
CPM
CTRL
DeBERTa
DeBERTa-v2
DeiT
DETR
DialoGPT (=GPT2)
DistilBERT
DPR (=BERT)
ELECTRA
FlauBERT (=XLM)
FSMT
Funnel Transformer
herBERT (=RoBERTa)
I-BERT
LayoutLM
LED
Longformer
LUKE
LXMERT
MarianMT
M2M100
MBart
Mobile BERT
MPNet
MT5 (=T5)
Megatron BERT (=BERT)
Megatron GPT2 (=GPT2)
OpenAI GPT
OpenAI GPT2
GPTNeo
Hubert
Pegasus
PhoBERT (=RoBERTa)
Reformer
RetriBERT
RoBERTa
RoFormer
Speech2Text
T5
ByT5 (=T5)
TAPAS
TransformerXL
ViT
VisualBERT
Wav2Vec2
XLM
XLM-RoBERTa (=RoBERTa)
XLNet
XLSR-Wave2Vec2

At present the following models are partly supported or not supported.

Partly Supported Models

BigBird
BigBirdPegasus
ConvBERT
ProphetNet
XLM-ProphetNet

Unsupported Models

SqueezeBERT
RAG

Advanced Usage

Refer to POLICY.md

FAQ

Refer to FAQ.md.

Contributing

Refer to CONTRIBUTING.md

Documentation

For more detailed information, see full documentation

Citation

If you find this library useful, please consider citing:

@misc{parallelformers,
  author       = {Ko, Hyunwoong},
  title        = {Parallelformers: An Efficient Model Parallelization Toolkit for Deployment},
  howpublished = {\url{https://github.com/tunib-ai/parallelformers}},
  year         = {2021},
}

LICENSE

Parallelformers is licensed under the terms of the Apache License 2.0.

Comments

AssertionError: Model should be on CPU before parallelization. It is more memory-efficient.

Hello, first of all congratulations for this amazing project. It's simple, efficient and versatile. Very useful.

In some cases, it happens that one has several GPUs, but not enough RAM to parallelize the model. When loading the model on GPU, and then parallelizing, I'm getting the below error: AssertionError: Model should be on CPU before parallelization. It is more memory-efficient.

It doesn't stop the script, but it seems that the parallelization fails.

My question is: is it possible to load the initial model on GPU instead of CPU (even if it's not memory-efficient) or not at all?

Thanks!

opened by juliensalinas 29

KoGPT3와 연동시 품질 이슈

안녕하세요 GPTJForCausalLM모델을 지원하는지 확인하려고 KoGPT3를 가지고 parallelformers 라이브러리로 인퍼런스 해보는 걸 테스트하고 있었는데요.

실행코드는 아래와 같습니다.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM 
from parallelformers import parallelize

tokenizer = AutoTokenizer.from_pretrained(
  'kakaobrain/kogpt', revision='KoGPT6B-ryan1.5b-float16',  # or float32 version: revision=KoGPT6B-ryan1.5b
  bos_token='[BOS]', eos_token='[EOS]', unk_token='[UNK]', pad_token='[PAD]', mask_token='[MASK]'
)
model = AutoModelForCausalLM.from_pretrained(
  'kakaobrain/kogpt', revision='KoGPT6B-ryan1.5b-float16',  # or float32 version: revision=KoGPT6B-ryan1.5b
  pad_token_id=tokenizer.eos_token_id,
  torch_dtype='auto'
)

parallelize(model, num_gpus=2, fp16=True, verbose='detail')

prompt = '''[공부, 학생, 힘들] => 힘들더라도 학생의 본분은 공부입니다
[시작, 떨림, 긴장] => 새로운 시작은 항상 떨리고 긴장되죠 파이팅!!
[방어, 제철, 겨울] => 겨울에는 방어가 제철이죠 방어회 어떠세요?
[겸손, 인생, 변화] => 인생은 어떻게 변할지 몰라요 항상 겸손한 태도를 갖춰야해요
[학교, 선생님, 은혜] => 학창시절 선생님의 은혜를 잊지 못해요 감사합니다.
[입사, 회사, 신입] =>'''

temperature = 0.8
max_length = 140
batch_size = 5

inputs = tokenizer([prompt]*batch_size, return_tensors="pt")
## **inputs의 경우
gen_tokens = model.generate(**inputs, do_sample=True, temperature=temperature, max_length=max_length)
## input_ids와 attention_mask를 넣을 경우
## gen_tokens = model.generate(input_ids=inputs.input_ids, attention_mask=inputs.attention_mask, do_sample=True, temperature=temperature, max_length=max_length)
generated = tokenizer.batch_decode(gen_tokens)

OUTPUT은 아래와 같습니다.

parallelformers를 쓰지 않았을 경우
parallelformers를 쓸 경우 (**inputs 일 경우)
parallelformers를 쓸 경우 (input_ids와 attention_mask만 넣을 경우)

위처럼 parallelformers로 래핑을 했을 때 품질이 떨어지는 경우가 발생하는데 (문법자체가 어긋나는 결과가 나오는..) 혹시 제가 잘못사용하고 있는건지 아니면 gpt3는 지원을 안하는 건지 물어보려 이슈 남깁니다 :)..

opened by BangDaeng 13

RuntimeError: Cannot re-initialize CUDA in forked subprocess

How to reproduce

I'm getting the following error while trying to run the example in the getting started document

Process ParallelProcess-1:
Traceback (most recent call last):
  File "/usr/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/parallelformers/parallel/process.py", line 251, in run
    engine = ParallelEngine(
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/parallelformers/parallel/engine.py", line 53, in __init__
    self.mp_group = self.create_process_group(backend)
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/parallelformers/parallel/engine.py", line 106, in create_process_group
    torch.cuda.set_device(int(os.getenv("LOCAL_RANK", "0")))
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/torch/cuda/__init__.py", line 314, in set_device
    torch._C._cuda_setDevice(device)
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/torch/cuda/__init__.py", line 207, in _lazy_init
    raise RuntimeError(
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
Process ParallelProcess-2:
Traceback (most recent call last):
  File "/usr/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/parallelformers/parallel/process.py", line 251, in run
    engine = ParallelEngine(
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/parallelformers/parallel/engine.py", line 53, in __init__
    self.mp_group = self.create_process_group(backend)
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/parallelformers/parallel/engine.py", line 104, in create_process_group
    dist.init_process_group(backend=backend)
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 627, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/home/ubuntu/.venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 242, in _store_based_barrier
    worker_count = store.add(store_key, 0)
RuntimeError: Connection reset by peer

This is my code. I'm running it on a AWS g5.12xlarge instance with 4 GPUs

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-neo-2.7B")
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neo-2.7B")

from parallelformers import parallelize

parallelize(model, num_gpus=2, fp16=True, verbose='detail')

inputs = tokenizer("Parallelformers is", return_tensors="pt")

outputs = model.generate(
    **inputs,
    num_beams=5,
    no_repeat_ngram_size=4,
    max_length=15,
)

print(f"Output: {tokenizer.batch_decode(outputs)[0]}")

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A10G         On   | 00000000:00:1B.0 Off |                    0 |
|  0%   29C    P8    19W / 300W |      2MiB / 23028MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A10G         On   | 00000000:00:1C.0 Off |                    0 |
|  0%   29C    P8    16W / 300W |      2MiB / 23028MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A10G         On   | 00000000:00:1D.0 Off |                    0 |
|  0%   29C    P8    16W / 300W |      2MiB / 23028MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A10G         On   | 00000000:00:1E.0 Off |                    0 |
|  0%   30C    P8    15W / 300W |      2MiB / 23028MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

I pip installed multiprocess https://pypi.org/project/multiprocess/ as initially I kept getting importing multiprocess as mp, multiprocess not found. Then I noticed there was a PR that removed torch.multiprocessing done by @Oaklight . Maybe I'm not using the right multiprocessing library? Reverting it back to torch.multiprocessing caused the same error noticed by @Oaklight .

Environment

OS : Ubuntu
Python version : 3.9
Transformers version : 4.21.0
Whether to use Docker: Nope
Misc.: Cuda 11.6

bug

opened by cabal-daniel 12

Support for GPT-J
Thanks for the great repo! I have tried it out, it's really amazing to lead such a large model in multiple GPUs.

Describe a requested feature

Currently, GPT-J is supported only in HF 4.7.0 and by installing

pip install git+https://github.com/finetuneanon/[email protected]

In your requirement, there is HF 4.8.0, and needs to load several new models. Soon gpt-j will be fully integrated in HF: https://github.com/huggingface/transformers/pull/12243

I am wondering if is there an easy way to have back compatibility, or include GPT-J soon.

Thanks again for your great repo 👍🏻

-- Andrea
enhancement
opened by andreamad8 11

AttributeError: Can't get attribute 'MegatronPolicy' on

Trying to use parallelformers with the megatron-11b pip package. The MegatronPolicy class is as-provided from megatron-11b pypi webpage

How to reproduce

from megatron_11b import MegatronForCausalLM, MegatronTokenizer

tokenizer = MegatronTokenizer.from_pretrained("./megatron-11B")
model = MegatronForCausalLM.from_pretrained("./megatron-11B")

# https://tunib-ai.github.io/parallelformers/intro/POLICY.html

from parallelformers.policies.base import Policy, Layer
from parallelformers.utils.dist_utils import AllReduceLinear
from megatron_11b.modeling_megatron import MegatronDecoderLayer


class MegatronPolicy(Policy):

    @staticmethod
    def replace_arguments(config, world_size):
        return {
            # 1. reduce hidden size
            "self_attn.embed_dim": config.d_model // world_size,

            # 2. reduce number of heads
            "self_attn.num_heads": config.encoder_attention_heads // world_size,
        }

    @staticmethod
    def attn_qkv():
        return [
            Layer(
                weight="self_attn.q_proj.weight",
                bias="self_attn.q_proj.bias",
            ),
            Layer(
                weight="self_attn.k_proj.weight",
                bias="self_attn.k_proj.bias",
            ),
            Layer(
                weight="self_attn.v_proj.weight",
                bias="self_attn.v_proj.bias",
            ),
        ]

    @staticmethod
    def attn_out():
        return [
            Layer(
                weight="self_attn.out_proj.weight",
                bias="self_attn.out_proj.bias",
                replace=AllReduceLinear,
            ),
        ]

    @staticmethod
    def mlp_in():
        return [
            Layer(
                weight="fc1.weight",
                bias="fc1.bias",
            ),
        ]

    @staticmethod
    def mlp_out():
        return [
            Layer(
                weight="fc2.weight",
                bias="fc2.bias",
                replace=AllReduceLinear,
            ),
        ]

    @staticmethod
    def original_layer_class():
        return MegatronDecoderLayer

from parallelformers import parallelize

parallelize(model, num_gpus=8, fp16=True, verbose='detail', custom_policies=[MegatronPolicy])

Environment

OS : Ubuntu LTS 20.04
Python version : 3.8
Transformers version : 4.4.2
Whether to use Docker: no
Misc.: it's executed in a jupyter notebook, which might be the source of the problem: https://stackoverflow.com/a/65001152

bug

opened by Oaklight 6

다중 Model 로드 방법

How to reproduce

먼저 좋은 프로젝트를 만들어 주셔서 감사의 말씀을 드립니다.
현재 1080 GPU 8개가 있는 서버에서 Flask 를 사용하여 한국어 모델을 여러개를 올려보는 테스트를 해보고 있는데요.
1개의 모델을 여러개의 GPU에 올리는 부분들은 잘 되는데 동시에 여러 모델을 올릴 때 아래와 같은 에러가 발생하고 있습니다.
혹시 여러 모델을 동시에 올릴 경우 추가적으로 해야할 작업이 있을까요?
타깃 GPU의 경우에는 모델 호출 전 Environments 의 CUDA_VISIBLE_DEVICES를 조절하여 변경하고 있습니다.
ex > os.environ["CUDA_VISIBLE_DEVICES"]="0" , parallelize(model_1, ... )

 > os.environ["CUDA_VISIBLE_DEVICES"]="1" ,  parallelize(model_2, ... )

.... ( 두번 째 모델 로드 시 에러 발생 )
===========================================================       
model name :  ./model/ko-gpt-trinity-1.2B-v0.5
CUDA_VISIBLE_DEVICES :  1
request_gpu :  1                            
used_gpu    :  2
===========================================================          
Process ParallelProcess-2:                                         
Traceback (most recent call last):                                        
  File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()                  
  File "/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/parallelformers/parallel/process.py", line 254, in run
    custom_policies=self.custom_policies,
  File "/opt/conda/lib/python3.7/site-packages/parallelformers/parallel/engine.py", line 53, in __init__
    self.mp_group = self.create_process_group(backend)
  File "/opt/conda/lib/python3.7/site-packages/parallelformers/parallel/engine.py", line 104, in create_process_group
    dist.init_process_group(backend=backend)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 576, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 229, in _env_rendezvous_handler
    store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 158, in _create_c10d_store
    hostname, port, world_size, start_daemon, timeout, multi_tenant=True
RuntimeError: Address already in use

parallelformers/parallel/engine.py 부분에서 dist.init_process_group 을 할 때 에러가 발생하는 것 같은데요.
parallelize 호출 시 어떻게 변경하면 다양한 모델들을 동시에 올릴 수 있을까요?

    def create_process_group(self, backend: str):
        """
        Create Pytorch distributed process group
        Args:
            backend (str): distributed backend
        Returns:
            ProcessGroupNCCL: process group for parallization
        """
        if not dist.is_initialized():
            dist.init_process_group(backend=backend)

        torch.cuda.set_device(int(os.getenv("LOCAL_RANK", "0")))
        new_group = dist.new_group([i for i in range(self.num_gpus)])

        return new_group

Environment

OS : Ubuntu 18.04
Python version :3.7.11
Transformers version : 4.15.0
Whether to use Docker: FROM pytorch/pytorch:1.9.1-cuda11.1-cudnn8-devel
Misc.: " flask 내에서 parallelformers를 활용한 다중 모델 로드"

question

opened by Don9wanKim 6

Error using google/UL2 model

The model: google/ul2

The Hardware: 2x RTX Titan AMD Ryzen 9 5900X 12-Core Processor 64Gb RAM

The Environment: Python 3.9.13 Pytorch 1.12.0+cu102 NVIDIA-SMI 495.29.05 Driver Version: 495.29.05 CUDA Version: 11.5

Code used:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from parallelformers import parallelize
import torch

tokenizer = AutoTokenizer.from_pretrained("google/ul2")
model = AutoModelForSeq2SeqLM.from_pretrained("google/ul2")

parallelize(model, num_gpus=2, fp16=True, verbose='detail')

input_string = "[S2S] Mr. Dursley was the director of a firm called Grunnings, which made drills. He was a big, solid man with a bald head. Mrs. Dursley was thin and blonde and more than the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbours. The Dursleys had a small son called Dudley and in their opinion there was no finer boy anywhere <extra_id_0>"

inputs = tokenizer(input_string, return_tensors="pt")

outputs = model.generate(**inputs, max_length=200)

print(tokenizer.decode(outputs[0]))

Error Message:

$ python test.py 
/home/******/miniconda3/envs/ul2/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 16 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
Bus error (core dumped)

Is this something I can fix? I would love to use this large model, as it's near SOTA on everything :)

opened by dnhkng 4

Support for GPT2-XL

Thank you for the great project!

How to reproduce

https://github.com/snoop2head/Language_Model_Memorization/blob/2c5db6f9bdd0206cba87d13b158d8c27ce0e55a7/parallel_inference.py#L39-L82

Tested and works for gpt2, gpt2-medium, gpt2-large
If AutoModelForCausalLM is changed into gpt2-xl, it yields the following error message

File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/parallelformers/parallel/process.py", line 193, in inference
    outputs = function_(
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1294, in generate
    return self.greedy_search(
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1689, in greedy_search
    outputs = self(
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 1058, in forward
    transformer_outputs = self.transformer(
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 901, in forward
    outputs = block(
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 401, in forward
    attn_outputs = self.attn(
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 325, in forward
    query = self._split_heads(query, self.num_heads, self.head_dim)
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 290, in _split_heads
    tensor = tensor.view(new_shape)
RuntimeError: shape '[116, 5, 12, 64]' is invalid for input of size 464000
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/parallelformers/parallel/process.py", line 193, in inference
    outputs = function_(
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1294, in generate
    return self.greedy_search(
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/transformers/generation_utils.py", line 1689, in greedy_search
    outputs = self(
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 1058, in forward
    transformer_outputs = self.transformer(
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 901, in forward
    outputs = block(
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 401, in forward
    attn_outputs = self.attn(
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 325, in forward
    query = self._split_heads(query, self.num_heads, self.head_dim)
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 290, in _split_heads
    tensor = tensor.view(new_shape)
RuntimeError: shape '[116, 5, 12, 64]' is invalid for input of size 464000

bug

opened by snoop2head 3

docker support

We will continue to log problems with Docker containers on this thread. And we aim to solve it. Ultimately, the goal is to deploy the model in a Kubernetes environment. If anyone has any problems with the Docker environment, please feel free to leave issues. We will actively review and resolve them.
enhancement

opened by hyunwoongko 2
Support for OPT

Hi,

Would it be possible to support new OPT models (a suite of GPT-like models)?

Here's the official doc: https://huggingface.co/docs/transformers/model_doc/opt

Thanks for your great work!
enhancement

opened by mrzjy 1

GPT2 parallelism does not work on the Tesla K80

How to reproduce

from transformers import AutoModelForCausalLM, AutoTokenizer
from parallelformers import parallelize

model = AutoModelForCausalLM.from_pretrained("distilgpt2")
tokenizer = AutoTokenizer.from_pretrained("distilgpt2")

parallelize(model, num_gpus=2, fp16=True, verbose='detail')

inputs = tokenizer("Parallelformers is", return_tensors="pt")

outputs = model.generate(
    **inputs,
    max_length=15,
)

print(f"Output: {tokenizer.batch_decode(outputs)[0]}")

Problem

The system distributes the model between GPUs, but when generating the second GPU is 100% loaded and does not leave this state. Generation failed.

Environment

PyTorch version: 1.10.1+cu113
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.17

Python version: 3.7.13 (default, Mar 29 2022, 02:18:16)  [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-4.15.0-187-generic-x86_64-with-debian-buster-sid
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: 
GPU 0: NVIDIA Tesla K80
GPU 1: NVIDIA Tesla K80

Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.21.6
[pip3] torch==1.10.1+cu113
[conda] numpy                     1.21.6                   pypi_0    pypi
[conda] torch                     1.10.1+cu113             pypi_0    pypi

bug

opened by 0x7o 1

Do not check if an object is pickable
Title

Speed up results serialization

Description

This implements the easiest solution to #46 by simply removing the check. It wasn't tested with python 3.6 (which uses standalone dataclasses lib). It may break on dynamically created dataclasses.

Linked Issues

resolved #46
opened by mkardas 1
Speed up results serialization

Describe a requested feature

I was running some performance tests and I noticed that checking if an object is pickable: https://github.com/tunib-ai/parallelformers/blob/ccaea515ee2e4d7540f2a275f6cdb0c33a7780f0/parallelformers/parallel/process.py#L209 takes a lot of time when the output is big (f.e., when a model returns a large logits tensor), because the whole object is being serialized into memory and then deserialized. I wonder what are the cases in which check_pickable helps, as dataclasses and ModelOutput should be as pickable as its dictionary representation.

If the check is still needed, I guess the code could be still sped up by modifying an object only on pickle failure. That would require some workarounds (perhaps overriding https://github.com/python/cpython/blob/9dc787ea96916552695e79397588fdfa68f22024/Lib/multiprocessing/queues.py#L275) so I want to make sure the check is still necessary, before giving it a shot. Another option is to always check for https://github.com/tunib-ai/parallelformers/blob/ccaea515ee2e4d7540f2a275f6cdb0c33a7780f0/parallelformers/parallel/process.py#L236-L239 and modify the object even if it's pickable, but that would remove custom fields added outside a definition of a given class.
enhancement

opened by mkardas 0
RuntimeError: CUDA error: peer access is not supported between these two devices
I tried running the example from the readme but received the above error. Does that mean that my hardware is not supported?

Environment

OS : Ubuntu

Python version : 3.7.11

Transformers version : 4.23.1

Whether to use Docker: no

GPUs: NVIDIA GeForce RTX 2080 Ti

bug
opened by Dorcoh4 0

Bug with T511b inference

How to reproduce

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer,AutoModelForCausalLM
from parallelformers import parallelize
model = AutoModelForCausalLM.from_pretrained('EleutherAI/gpt-neo-2.7B')
parallelize(model, num_gpus=4, fp16 = False)

Environment

OS : 18.04.4 LTS (Bionic Beaver) Ubuntu
Python version : 3.7.3
Transformers version : 4.22.1
Whether to use Docker: No
Misc.: N/A

bug

opened by ZeyiLiao 0

OSError: [Errno 9] Bad file descriptor

How to reproduce

Using a p4d.24xlarge:

from parallelformers import parallelize
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "facebook/opt-66b"
batch_size = [1]
batch = [["out story begins on"] * bs for bs in batch_size]
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
inputs = [tokenizer(seq, return_tensors="pt").input_ids for seq in batch]
parallelize(model, num_gpus=8, fp16=True)
for _ in range(100):
    model.generate(
        torch.cat(inputs, dim=0),
        do_sample=True,
        max_length=2048,
        num_return_sequences=1,
    )

It loads okay and begins performing inference. Can see all 8 GPUs at 90+% utilization using nvidia-smi for a while. Then eventually one GPU drops to 0%, the others jump to 100%. Terminal shows:

Traceback (most recent call last):                                                                         
  File "/home/ubuntu/miniconda3/envs/deepspeed/lib/python3.8/multiprocessing/queues.py", line 239, in _feed
    obj = _ForkingPickler.dumps(obj)                                                                       
  File "/home/ubuntu/miniconda3/envs/deepspeed/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps                                                                                                         
    cls(buf, protocol).dump(obj)                                                                           
  File "/home/ubuntu/miniconda3/envs/deepspeed/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 367, in reduce_storage                                                                          
    df = multiprocessing.reduction.DupFd(fd)                                                               
  File "/home/ubuntu/miniconda3/envs/deepspeed/lib/python3.8/multiprocessing/reduction.py", line 198, in DupFd                                                                                                        
    return resource_sharer.DupFd(fd)                                                                       
  File "/home/ubuntu/miniconda3/envs/deepspeed/lib/python3.8/multiprocessing/resource_sharer.py", line 48, in __init__                                                                                                
    new_fd = os.dup(fd)                                                                                    
OSError: [Errno 9] Bad file descriptor

It then seems to hang forever from there.

I do realize this stacktrace doesn't give enough enough to get back to parallelformers, which is frustrating. Maybe it's actually a bug in PyTorch or Multiprocessing?

Environment

OS : Ubuntu 20.04.4 LTS
Python version : 3.8.13
Transformers version : 4.24.0
Whether to use Docker : No
Misc. : N/A

bug

opened by aws-stdun 0

Releases(v1.2.7)

v1.2.7(Jul 27, 2022)
Fix some multiprocessing bugs reported from https://github.com/tunib-ai/parallelformers/issues/32

Add OPT models from https://github.com/tunib-ai/parallelformers/issues/30

Add a guide for multiprocessing error in README.md

Source code(tar.gz)
Source code(zip)
v1.2.4(Dec 29, 2021)

Fix gpu overallocation issue by adding deallocation before forward
Source code(tar.gz)
Source code(zip)
v1.2.3(Dec 29, 2021)

Remove redundant operations
Source code(tar.gz)
Source code(zip)
v.1.2.2(Dec 28, 2021)
[#17] Fix performance issue about random sampling

Source code(tar.gz)
Source code(zip)
v1.2(Dec 17, 2021)

[#16] Remove assertion to force cpu usage
Source code(tar.gz)
Source code(zip)
v1.1(Dec 6, 2021)
[#4] Add GPTJ

[#14] Add MegatronBert

Source code(tar.gz)
Source code(zip)
1.0.1(Jul 19, 2021)
Issues

[#4] Backward compatibility patch

[#5] Bug about AlbertModel

Patches

[#7] Backward compatibility patch

support transformers from 4.2.0

[#6] Fix bug about AlbertModel

Source code(tar.gz)
Source code(zip)
1.0(Jul 18, 2021)
Parallelformers, which is based on Megatron LM, is designed to make model parallelization easier.

You can parallelize various models in HuggingFace Transformers on multiple GPUs with a single line of code.

Currently, Parallelformers only supports inference. Training features are NOT included.

Source code(tar.gz)
Source code(zip)

Parallelformers: An Efficient Model Parallelization Toolkit for Deployment

Related tags

Overview

What's New:

Why Parallelformers?

Installation

Getting Started

1. Create a HuggingFace transformers model.

2. Put the model in the parallelize() function.

3. Do Inference as usual.

4. Deploy the model to the server as usual.

5. Check the current GPU states.

6. Manage the model parallelization states.

Supported Models

Advanced Usage

FAQ

Contributing

Documentation

Citation

LICENSE

Comments

How to reproduce

Environment

Describe a requested feature

How to reproduce

Environment

How to reproduce

Environment

How to reproduce

How to reproduce

Problem

Environment

Title

Description

Linked Issues

Describe a requested feature

Environment

How to reproduce

Environment

How to reproduce

Environment

Releases(v1.2.7)

v1.2.7(Jul 27, 2022)

v1.2.4(Dec 29, 2021)

v1.2.3(Dec 29, 2021)

v.1.2.2(Dec 28, 2021)

v1.2(Dec 17, 2021)

v1.1(Dec 6, 2021)

1.0.1(Jul 19, 2021)

Issues

Patches

1.0(Jul 18, 2021)

Owner

TUNiB

Python Multithreading without GIL

A concurrent sync tool which works with multiple sources and targets.

Parallelformers: An Efficient Model Parallelization Toolkit for Deployment

A Python package for easy multiprocessing, but faster than multiprocessing

Thread-safe asyncio-aware queue for Python

rosny is a lightweight library for building concurrent systems.

Jug: A Task-Based Parallelization Framework

SCOOP (Scalable COncurrent Operations in Python)

Raise asynchronous exceptions in other thread, control the timeout of blocks or callables with a context manager or a decorator

Unsynchronize asyncio by using an ambient event loop, or executing in separate threads or processes.

aiomisc - miscellaneous utils for asyncio

A lightweight (serverless) native python parallel processing framework based on simple decorators and call graphs.

A curated list of awesome Python asyncio frameworks, libraries, software and resources

🌀 Pykka makes it easier to build concurrent applications.

Trio – a friendly Python library for async concurrency and I/O

Backport of the concurrent.futures package to Python 2.6 and 2.7

AnyIO is an asynchronous networking and concurrency library that works on top of either asyncio or trio.

Simple package to enhance Python's concurrent.futures for memory efficiency

Ultra fast asyncio event loop.

2. Put the `model` in the `parallelize()` function.