炼丹心得

清空显存

sudo fuser -v /dev/nvidia* |awk '{for(i=1;i<=NF;i++)print "kill -9 " $i;}' | sudo sh

或者查找

fuser -v /dev/nvidia*

https://zhuanlan.zhihu.com/p/637164912

nvitop

nvidia-ml-py is conflict with nvidia-ml-py3 and pynvml. All these three packages will install module pynvml.py. You should uninstall the others and reinstall nvidia-ml-py:

pip3 uninstall nvidia-ml-py3 pynvml
pip3 install --force-reinstall nvidia-ml-py==11.450.51

Or you can create a new virtual environment and install nvitop in that.

https://github.com/XuehaiPan/nvitop/issues/4#issuecomment-894659795

vLLM

如何在一个脚本里,释放 vllm.LLM 及其显存,并启用另外的 vllm.LLM:

        #llm is a vllm.LLM object
        import gc
        import torch
        from vllm.model_executor.parallel_utils.parallel_state import destroy_model_parallel
        import os

        #avoid huggingface/tokenizers process dead lock
        os.environ["TOKENIZERS_PARALLELISM"] = "false"
        destroy_model_parallel()
        #del a vllm.executor.ray_gpu_executor.RayGPUExecutor object
        del llm.llm_engine.model_executor
        del llm
        gc.collect()
        torch.cuda.empty_cache()
        import ray
        ray.shutdown()

https://github.com/vllm-project/vllm/issues/1908#issuecomment-2074543512

文章目录