AI推理效能深度研究：vLLM 多节点多卡部署架构与优化实践

张雯 2025-05-16 19:15

关于 #普元 #AI探索者：在 AI 技术的浩瀚星空中，基础研究是探索未知、开拓新领域的基石。本文是“普元 AI 基础研究系列”中的一篇，将带您深入解读我们在 AI 基础研究方面的成果与洞察。

普元 AI 基础研究系列（之四）

随着大型语言模型（LLM）能力的飞速发展，其在各行各业的应用日益广泛。然而，将这些强大的模型从研究阶段高效、稳定地部署到生产环境，尤其是在需要利用多节点、多图形处理器（GPU）资源的复杂场景下，面临着诸多严峻挑战。简单地执行一系列部署命令往往不足以应对这些挑战，它需要对底层系统架构、资源精细化管理、性能极致调优以及长期运维策略进行综合且深入的考量。

本文旨在深入探讨vLLM在多节点多卡环境下的部署架构与优化实践，分享在此过程中遇到的问题、具体的解决方案以及由此总结出的核心方法论与实践心得，为面临类似挑战的工程技术人员提供一份有价值的参考与指引。

现有的问题与影响

大型语言模型（LLM）提供了变革性功能，彻底改变了我们与信息交互和自动执行任务的方式。然而，将这些强大的模型从研究环境过渡到强大、可扩展的生产推理服务，尤其是在多节点、多 GPU 配置中，充满了重大挑战。忽视这些复杂性不仅仅是一种不便，它甚至可能在 AI 计划获得牵引力之前就使其瘫痪，从而导致严重的负面后果。

性能不佳和用户不满

问题：幼稚或配置不当的部署通常会导致高推理延迟（响应缓慢）和低吞吐量（每秒处理的请求很少）。如果采用通信开销、资源争用或低效的模型并行策略，则这种情况在分布式设置中会加剧。

负面影响：用户面临的应用程序速度缓慢令人沮丧，无法进行实时交互，批处理效率低下。这直接转化为糟糕的用户体验、高放弃率和未能达到服务水平目标（SLO），最终损害产品的声誉和可行性。

飙升的运营成本和资源浪费

问题： GPU是昂贵的资产。低效的部署会导致这些资源严重利用不足，其中 GPU 处于空闲状态或远低于其容量运行。不正确的内存管理（例如，对于 KV 缓存）可能会导致频繁的内存不足（OOM）错误或需要过度预置。

负面影响：这直接转化为虚高的云账单或本地硬件上的资本支出浪费。每次推理的成本变得高得令人望而却步，这使得 AI 服务在经济上不可持续，并阻碍了其广泛采用或盈利能力。

可扩展性瓶颈和增长受阻

问题：设计不当的分布式推理系统将难以随着用户需求的增加或尝试部署更大、功能更强大（且资源密集型）的模型而扩展。如果底层架构存在固有的瓶颈，则添加更多节点或 GPU 可能无法产生成比例的性能提升。

负面影响：该服务达到性能上限，无法为不断增长的用户群提供服务或利用更新、更强大的 LLM。这限制了业务增长，阻止了高级功能的引入，并将竞争优势拱手让给了具有更强大 MLOps 功能的组织。

运营噩梦和可靠性问题

问题：管理跨多个节点的依赖项（驱动程序、CUDA、库）、确保环境一致性、配置节点间通信（例如，对于 Ray、NCCL）以及调试分布式系统中的问题比单节点设置复杂得多。

负面影响：这会导致部署时间延长，服务不稳定，容易频繁崩溃或难以诊断的错误，并给工程和运营团队带来过重的负担。宝贵的资源从创新转移到持续的救火上，这增加了压力并降低了团队生产力。

延迟创新和错失的机会

问题：如果部署和扩展 LLM 推理的过程是一场持久的斗争，那么整个 MLOps 生命周期就会变慢。对新模型或推理优化技术的实验成为一项高摩擦活动。

负面影响：组织创新和应对市场变化的能力受到阻碍。AI 驱动的新功能触达用户的速度很慢，而且利用尖端 AI 进步的潜力尚未实现，导致错过市场窗口并落后于竞争对手。

因此，使用 vLLM 等工具以分布式方式部署 LLM 的架构完善、有条不紊的方法不仅有益，而且对于释放其真正潜力同时保持运营健全性和经济可行性至关重要。

方法论、实践心得与深入思考

为解决上述问题，本章节将从规划设计、环境构建、参数调优、监控运维等多个维度，提炼并分享在实践中总结的核心方法、关键心得以及一些前瞻性的思考。

1. 规划与设计先行 (Planning & Design First)

明确需求与场景 (Define Needs & Scenarios)

业务目标：部署 LLM 推理是为了什么？是内部实验、小规模应用还是大规模生产？这将直接影响对吞吐量、延迟、并发数和可用性的要求。

模型选择：Qwen2.5-1.5B 是一个相对较小的模型。选择它可以是基于快速验证、资源限制或特定任务的考虑。对于更大的模型，显存和计算资源的需求会急剧上升，多节点部署的必要性更为凸显。

预算与资源：评估可用的硬件资源（GPU型号、数量、显存、CPU、内存、网络带宽）和预算。像本例中 RTX 3090 和 H800 的混合部署，虽然可行，但也需要注意不同性能节点可能带来的瓶颈。

架构选择 (Architectural Choices)

单节点 vs. 多节点：当单节点GPU资源无法满足模型大小（需要张量并行）或吞吐量需求（需要流水线并行或更多副本）时，多节点是必然选择。

vLLM 的分布式策略：理解 vLLM 主要依赖 Ray 实现分布式。Ray 的 head node 和 worker node 概念是核心。Head node 负责协调，worker node 执行计算。

网络拓扑：多节点部署对节点间网络延迟和带宽高度敏感。NCCL 和 Gloo 等通信库依赖高效的网络。确保节点间，尤其是 GPU 间，有低延迟、高带宽的连接（如 InfiniBand，或至少是高速以太网）。本例中指定

NCCL_SOCKET_IFNAME 和 GLOO_SOCKET_IFNAME

确保通信走在正确的、可能更优的网络接口上。

2. 环境标准化与可复现性 (Standardization & Reproducibility)

容器化是基石 (Containerization as Foundation)

一致性：容器环境确保了从驱动、CUDA、Python库到 vLLM 本身各版本的一致性，极大地减少了“在我机器上能跑”的问题。

隔离性：避免了与宿主机其他应用的潜在冲突。

可移植性：比如docker save/load 使得在不同节点（甚至离线环境）快速部署成为可能。

版本控制 (Version Control)

不仅是代码（如 run_cluster.sh），还包括容器镜像版本 (vllm/vllm-openai:v0.7.2)、模型版本、CUDA 版本、驱动版本等。当出现问题或需要回滚、升级时，清晰的版本记录至关重要。

基础设施即代码 (Infrastructure as Code - IaC) - 进阶思考

对于更复杂的环境，可以考虑使用 Ansible, Terraform 等工具来自动化节点的准备工作（如驱动、Docker、nvidia-container-toolkit的安装和配置），进一步提升标准化和效率。

3. 参数调优与性能理解 (Parameter Tuning & Performance Understanding)

核心参数解析

--tensor-parallel-size

决定模型如何在多个 GPU 间进行张量并行切分。总 GPU 数量必须是 tensor-parallel-size 的整数倍（如果 pipeline-parallel-size 为1）。 本例中2个节点各1块卡，总共2块卡，所以设置为2是合理的，表示模型被切分到2个GPU上。

--max-model-len

模型能处理的最大序列长度。它显著影响显存占用（KV Cache 大小与序列长度和批大小成正比）。设置过小可能无法处理长文本，设置过大会浪费显存或导致 OOM。

--gpu_memory_utilization

vLLM 使用的 GPU 显存比例。通常设置为0.9-0.95。本例设置为0.5，可能是出于保守或与其他应用共享GPU的考虑，但也会限制vLLM可用的KV Cache空间，从而影响并发处理能力。

--shm-size (在 docker run 中)

共享内存大小。Ray 和其他一些框架可能需要较大的共享内存进行进程间通信。默认值可能不足，需要根据实际情况调整。

理解瓶颈

显存 (GPU Memory): 最常见的瓶颈。模型权重、KV Cache、激活值都会消耗显存。ray status 和 nvidia-smi 是排查的关键。

计算 (GPU Compute): 对于计算密集型操作，GPU 的算力是瓶颈。

网络 (Network): 在多节点张量并行或流水线并行中，节点间通信可能成为瓶颈。

CPU/IO: 模型加载、数据预处理等阶段可能受CPU或磁盘IO限制。

迭代调优 (Iterative Tuning)

从保守配置开始调整与观察：从一个保守的配置开始，逐步增加负载（如并发请求）或调整参数（如 gpu_memory_utilization, max_num_seqs 等）并观察性能指标（吞吐量、延迟、显存使用）和系统稳定性。

日志分析： vLLM 和 Ray 的日志（如本例中展示的 CUDA graph capturing、内存分析等）是理解内部状态和发现问题的宝贵信息源。

4. 监控、运维与容错 (Monitoring, Operations & Fault Tolerance)

关键监控指标

GPU 利用率、显存使用率、温度。

网络吞吐量、延迟（特别是节点间）。

Ray 集群状态 (ray status)，Actor 状态。

vLLM 服务本身的 QPS、延迟、错误率。

系统层面 CPU、内存、磁盘使用率。

日志管理

确保日志（容器日志、Ray日志、vLLM应用日志）能够被收集、聚合和查询，例如使用 ELK Stack 或 Grafana Loki。

健康检查与自动恢复 (Health Checks & Auto-Recovery) - 进阶思考

Ray 本身有一定的容错机制。对于生产环境，可以考虑结合 Kubernetes 等编排工具，实现更完善的健康检查、自动重启和故障转移。

安全考量 (Security Considerations)

网络隔离、API 认证授权、容器安全扫描等。

5. 针对本案例的特定观察与建议

混合 GPU 环境：RTX 3090 (24G) 和 H800 (80G) 的组合。vLLM/Ray 会尝试平衡负载，但性能瓶颈通常会出现在较弱的节点上（此例中是 RTX 3090 的显存和算力）。在规划 tensor-parallel-size 和整体预期性能时需要考虑到这一点。

模型挂载：将模型文件直接挂载到 /root/.cache/huggingface 是一个常见且直接的做法。对于生产环境或需要频繁更新模型的场景，可以考虑使用更专业的模型存储方案（如对象存储 + 缓存机制）。

run_cluster.sh 脚本： 这个脚本提供了一个很好的起点。可以根据需要进行扩展，例如增加更多的环境变量配置、更灵活的 Ray 启动参数等。

CUDA Graph Capturing：日志中提到了 "Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static." CUDA Graph 可以通过减少 CPU 开销来提升特定计算模式的性能，但如果输入形状（如序列长度）变化剧烈且不在捕获范围内，可能会有性能回退或额外开销。理解其适用场景很重要。

结语

vLLM 的多节点多卡部署是一个系统工程。成功的部署不仅仅依赖于正确执行命令，更依赖于对底层技术（GPU、网络、Ray、vLLM原理）的理解，以及在规划、实施、调优和运维各个阶段的细致考量。从小处着手，逐步扩展，持续监控，迭代优化，是通往稳定高效推理服务的必经之路。这份记录为社区提供了一个宝贵的实践案例。

附录：部署实践步骤

本例中记录了两台机器，一台机器一块 RTX 3090 (24G) 显卡的环境此节点为 1 是 head，另一台为 H800 (80G) 显卡环境为节点 2 是 worker，使用 vLLM 0.7.2 版本，部署 Qwen2.5-1.5B-Instruct 模型的过程及遇到的问题，供类似环境使用 vLLM 进行多节点多卡推理参考。

本文参考官方部署方法

https://docs.vllm.ai/en/stable/serving/distributed_serving.html

1.部署清单

部署 nvidia 显卡驱动
部署 cuda 12.4
部署 nvidia-container-toolkit
部署某种容器环境
模型 Qwen2.5-1.5B-Instruct 准备
部署 vLLM 镜像

2.部署nvidia显卡驱动

全新环境可以跳过卸载

bash ./NVIDIA-Linux-x86_64-XXXXX.run --uninstall # 部署安装驱动
bash ./NVIDIA-Linux-x86_64-XXXXX.run # 查证驱动
nvidia-smi

3. 部署 nvidia-container-toolkit

如使用 repo 仓库部署，可以直接使用 dnf install nvidia-container-toolkit 或 apt install nvidia-container-toolkit

容器环境以行业内广泛应用的Docker为例，便于大家理解。

apt-get install nvidia-container-toolkit # 验证
nvidia-ctk -h
# 配置 docker 支持 nvidia gpu
nvidia-ctk runtime configure --runtime=docker # 重启 docker
systemctl restart docker
# 验证 docker 是否支持 nvidia
docker info | grep Runtimes

4. 部署 cuda 12.4

（略）

5. 部署容器环境

（略）

以行业内广泛应用的Docker为例，便于大家理解部署步骤。

tar -xzf docker-27.4.0.tgz2
cp docker/*/usr/local/bin/
docker -v

将 https://github.com/containerd/containerd/blob/main/containerd.service 内容保存至

/usr/lib/systemd/system/containerd.service

[Unit]
Description=containerd container runtime
Documentation=https://containerd.io
After=network.target local-fs.target dbus.service
[Service]
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/local/bin/containerd
Type=notify
Delegate=yes
KillMode=process
Restart=always
RestartSec=5
# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNPROC=infinity
LimitCORE=infinity
# Comment TasksMax if your systemd version does not supports it.
# Only systemd 226 and above support this version.
TasksMax=infinity
OOMScoreAdjust=-999

systemctl enable --now containerd
systemctl status containerd

将下面内容保存至

/usr/lib/systemd/system/docker.service

[Unit]
Description=DockerApplicationContainerEngine
Documentation=https://docs.docker.com
After=network-online.target firewalld.service
Wants=network-online.target
[Service]
Type=notify
ExecStart=/usr/local/bin/dockerd
ExecReload=/bin/kill -s HUP $MAINPID
TimeoutStartSec=0
RestartSec=2
Restart=always
StartLimitBurst=3
StartLimitInterval=60s
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
Delegate=yes
KillMode=process
OOMScoreAdjust=-500
[Install]
WantedBy=multi-user.target

systemctl enable --now docker
systemctl status docker

6. 模型 Qwen2.5-1.5B-Instruct 准备

git lfs install
git clone https://www.modelscope.cn/Qwen/Qwen2.5-1.5B-Instruct.git

7. 部署 vllm

以行业内广泛应用的Docker为例，便于大家理解具体的部署情况。选择节点 1 作为 head node，节点 2 作为 worker node。

7.1 下载 vLLM 镜像

节点 1 下载 vllm 镜像：

docker pull vllm/vllm-openai:v0.7.2

导出镜像，复制到节点 2：

docker save vllm/vllm-openai:v0.7.2| gzip > images.tar.gz

在两台机器分别准备好镜像后，将

https://github.com/vllm-project/vllm/blob/main/examples/online_serving/run_cluster.sh

存放至 /date/model/ ：

#!/bin/bash
# Check for minimum number of required arguments
if[ $# -lt 4 ]; then
echo "Usage: $0 docker_image head_node_address --head|--worker path_to_hf_home [additional_args...]"
exit 1
fi
# Assign the first three arguments and shift them away
DOCKER_IMAGE="$1"
HEAD_NODE_ADDRESS="$2"
NODE_TYPE="$3"# Should be --head or --worker
PATH_TO_HF_HOME="$4"
shift 4
# Additional arguments are passed directly to the Docker command
ADDITIONAL_ARGS=("$@")
# Validate node type
if["${NODE_TYPE}"!="--head"]&&["${NODE_TYPE}"!="--worker"];then
echo "Error: Node type must be --head or --worker"
exit 1
fi
# Define a function to cleanup on EXIT signal
cleanup(){
docker stop node
docker rm node
}
trap cleanup EXIT
# Command setup for head or worker node
RAY_START_CMD="ray start --block"
if["${NODE_TYPE}"=="--head"];then
RAY_START_CMD+=" --head --port=6379"
else
RAY_START_CMD+=" --address=${HEAD_NODE_ADDRESS}:6379"
fi
# Run the docker command with the user specified parameters and additional arguments
docker run \
--entrypoint /bin/bash \
--network host \
--name node \
--shm-size 10.24g \
--gpus all \
-v "${PATH_TO_HF_HOME}:/root/.cache/huggingface" \
"${ADDITIONAL_ARGS[@]}" \
"${DOCKER_IMAGE}"-c "${RAY_START_CMD}"

7.2 配置 VLLM 集群

节点 1：

bash run_cluster.sh vllm/vllm-openai:v0.7.2192.168.1.1--head /data/model -e VLLM_HOST_IP=192.168.1.1-e NCCL_SOCKET_IFNAME=eth0 -e GLOO_SOCKET_IFNAME=eth1

节点2：

bash run_cluster.sh vllm/vllm-openai:v0.7.2192.168.1.1--worker /data/model -e VLLM_HOST_IP=192.168.1.1-e NCCL_SOCKET_IFNAME=eth0 -e GLOO_SOCKET_IFNAME=eth1

注意：两个节点执行脚本指定的都是 head 节点的 IP。启动默认输出在屏幕，如果需要后台可以配合 nohup 命令使用。

在任意节点通过以下命令进入容器：

docker exec -ti node bash

7.3 查看集群状态

ray status

root@user:~/.cache/huggingface/Qwen# ray status
=======Autoscaler status:2025-02-2322:38:04.501687=======
Node status
-----------------------------------
Active:
1 node_e28141d1456fec8fd4a62dee4c7d9b260425e259b4d1aae177334e9a0
1 node_af0088dd67581f97aae78a7da5cfa8d4c532374616e4e0f54ba28089
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
-----------------------------------
Usage:
0.0/168.0 CPU
0.0/5.0 GPU
0B/1.02TiB memory
0B/1.46GiB object_store_memory
Demands:
(no resource demands)
root@user:~/.cache/huggingface/Qwen#

7.4 启动 vLLM 服务

节点 1 启动 vllm 服务：

# 根据 2 个节点和每个节点 1 个 GPU 设置总的 tensor-parallel-size
cd /root/.cache/huggingface/Qwen# 确认模型挂载的目录
vllm serve "Qwen2.5-1.5B-Instruct"--tensor-parallel-size 2--max-model-len 128--gpu_memory_utilization=0.5

root@user:~/.cache/huggingface/Qwen# vllm serve "Qwen2.5-1.5B-Instruct" --tensor-parallel-size 2 --max-model-len 128 --gpu_memory_utilization 0.5
INFO 02-2321:55:00 __init__.py:190]Automatically detected platform cuda.
INFO 02-2321:55:00 api_server.py:840] vLLM API server version 0.7.2
INFO 02-2321:55:00 api_server.py:841] args:Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_log_stats=False, disable_log_requests=False, enable_frontend_multiprocessing=False, enable_request_id_header=False, enable_auto_tool_choice=False, enable_reasoning_parser=None, tool_call_parser=None, reasoning_parser=None, tool_parser_plugin=None, model='Qwen2.5-1.5B-Instruct', tokenizer=None, skip_tokenizer_init=False, tokenizer_mode='auto', trust_remote_code=False, revision=None, code_revision=None, tokenizer_revision=None, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', tensor_parallel_size=2, pipeline_parallel_size=1, seed=0, swap_space=4, gpu_memory_utilization=0.5, max_num_batched_tokens=None, max_num_seqs=None, max_logprobs=20, disable_log_probs=False, enforce_eager=False, max_cpu_loras=None, fully_sharded_loras=False, enable_lora=False, lora_modules=None, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, enable_chunked_prefill=None, speculative_model=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, speculative_num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_disable_mqa_scorer=False, speculative_model_quantization=None, speculative_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_alpha=None, typical_acceptance_sampler_posterior_threshold=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, override_generation_config=None, override_pooler_config=None, compilation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, disable_fastapi_docs=False, enable_prompt_tokens_details=False, dispatch_function=<function serve at 0x7f73ab678b80>)
INFO 02-2321:55:01 api_server.py:206]Started engine process with PID 2815
INFO 02-2321:55:05 __init__.py:190]Automatically detected platform cuda.
INFO 02-2321:55:07 config.py:542]This model supports multiple tasks:{'generate','reward','classify','embed','score'}.Defaulting to 'generate'.
INFO 02-2321:55:07 config.py:1401]Defaulting to use mp for distributed inference
INFO 02-2321:55:12 config.py:542]This model supports multiple tasks:{'embed','classify','reward','score','generate'}.Defaulting to 'generate'.
INFO 02-2321:55:12 config.py:1401]Defaulting to use mp for distributed inference
INFO 02-2321:55:12 llm_engine.py:234]Initializing a vLLM engine (v0.7.2) with config: model='Qwen2.5-1.5B-Instruct', speculative_config=None, tokenizer='Qwen2.5-1.5B-Instruct', skip_tokenizer_init=False, tokenizer_mode='auto', revision=None, trust_remote_code=False, tokenizer_revision=None, dtype=torch.bfloat16, max_seq_len=128, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name='Qwen2.5-1.5B-Instruct', num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True,
WARNING 02-2321:55:12 multiproc_worker_utils.py:300]ReducingTorch parallelism from 20 threads to 1 to avoid unnecessary CPU contention.Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 02-2321:55:12 custom_cache_manager.py:19]SettingTriton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
INFO 02-2321:55:13 cuda.py:230]UsingFlashAttention backend.
INFO 02-2321:55:16 __init__.py:190]Automatically detected platform cuda.
(VllmWorkerProcess pid=2994) INFO 02-2321:55:17 multiproc_worker_utils.py:229]Worker ready; awaiting tasks
(VllmWorkerProcess pid=2994) INFO 02-2321:55:18 cuda.py:230]UsingFlashAttention backend.
(VllmWorkerProcess pid=2994) INFO 02-2321:55:19 utils.py:950]Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=2994) INFO 02-2321:55:19 pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorkerProcess pid=2994) INFO 02-2321:55:19 custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
(VllmWorkerProcess pid=2994) INFO 02-2321:55:19 shm_broadcast.py:258] vLLM message queue communication handle:Handle(connect_ip='127.0.0.1', local_reader_ranks=[1], buffer_handle=(1,4194304,6,'psm_4f6d3ba1'), local_subscribe_port=48537, remote_subscribe_port=None)
INFO 02-2321:55:19 model_runner.py:1110]Starting to load model Qwen2.5-1.5B-Instruct...
(VllmWorkerProcess pid=2994) INFO 02-2321:55:19 model_runner.py:1110]Starting to load model Qwen2.5-1.5B-Instruct...
Loading safetensors checkpoint shards:0%Completed||0/1[00:00<?,?it/s]
Loading safetensors checkpoint shards:100%Completed|██████████|1/1[00:00<00:00,1.03it/s]
Loading safetensors checkpoint shards:100%Completed|██████████|1/1[00:00<00:00,1.03it/s]

INFO 02-2321:55:21 model_runner.py:1115]Loading model weights took 1.4495 GB
(VllmWorkerProcess pid=2994) INFO 02-2321:55:21 model_runner.py:1115]Loading model weights took 1.4495 GB
INFO 02-2321:55:24 worker.py:267]Memory profiling takes 3.59 seconds
INFO 02-2321:55:24 worker.py:267]The current vLLM instance can use total_gpu_memory (23.69GiB) x gpu_memory_utilization (0.50)=11.85GiB
INFO 02-2321:55:24 worker.py:267] model weights take 1.45GiB; non_torch_memory takes 0.40GiB;PyTorch activation peak memory takes 1.39GiB; the rest of the memory reserved for KV Cache is 8.60GiB.
(VllmWorkerProcess pid=2994) INFO 02-2321:55:25 worker.py:267]Memory profiling takes 3.65 seconds
(VllmWorkerProcess pid=2994) INFO 02-2321:55:25 worker.py:267]The current vLLM instance can use total_gpu_memory (23.69GiB) x gpu_memory_utilization (0.50)=11.85GiB
(VllmWorkerProcess pid=2994) INFO 02-2321:55:25 worker.py:267] model weights take 1.45GiB; non_torch_memory takes 0.37GiB;PyTorch activation peak memory takes 0.09GiB; the rest of the memory reserved for KV Cache is 9.93GiB.
INFO 02-2321:55:25 executor_base.py:110]# CUDA blocks: 40269, # CPU blocks: 18724
INFO 02-2321:55:25 executor_base.py:115]Maximum concurrency for128 tokens per request:5033.62x
(VllmWorkerProcess pid=2994) INFO 02-2321:55:29 model_runner.py:1434]Capturing cudagraphs for decoding.This may lead to unexpected consequences if the model is not static.To run the model in eager mode,set'enforce_eager=True' or use '--enforce-eager'in the CLI.If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode.You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes:97%|█████████▋|34/35[00:15<00:00,2.24it/s](VllmWorkerProcess pid=2994) INFO 02-2321:55:45 custom_all_reduce.py:226]Registering1995 cuda graph addresses
Capturing CUDA graph shapes:100%|██████████|35/35[00:16<00:00,2.13it/s]
INFO 02-2321:55:45 custom_all_reduce.py:226]Registering1995 cuda graph addresses
(VllmWorkerProcess pid=2994) INFO 02-2321:55:45 model_runner.py:1562]Graph capturing finished in17 secs, took 0.85GiB
INFO 02-2321:55:45 model_runner.py:1562]Graph capturing finished in16 secs, took 0.85GiB
INFO 02-2321:55:45 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 24.62 seconds
INFO 02-2321:55:46 api_server.py:756]Using supplied chat template:
INFO 02-2321:55:46 api_server.py:756]None
INFO 02-2321:55:46 launcher.py:21]Available routes are:
INFO 02-2321:55:46 launcher.py:29]Route:/openapi.json,Methods: HEAD, GET
INFO 02-2321:55:46 launcher.py:29]Route:/docs,Methods: HEAD, GET
INFO 02-2321:55:46 launcher.py:29]Route:/docs/oauth2-redirect,Methods: HEAD, GET
INFO 02-2321:55:46 launcher.py:29]Route:/redoc,Methods: HEAD, GET
INFO 02-2321:55:46 launcher.py:29]Route:/health,Methods: GET
INFO 02-2321:55:46 launcher.py:29]Route:/ping,Methods: GET
INFO 02-2321:55:46 launcher.py:29]Route:/tokenize,Methods: POST
INFO 02-2321:55:46 launcher.py:29]Route:/detokenize,Methods: POST
INFO 02-2321:55:46 launcher.py:29]Route:/v1/models,Methods: GET
INFO 02-2321:55:46 launcher.py:29]Route:/version,Methods: GET
INFO 02-2321:55:46 launcher.py:29]Route:/v1/chat/completions,Methods: POST
INFO 02-2321:55:46 launcher.py:29]Route:/v1/completions,Methods: POST
INFO 02-2321:55:46 launcher.py:29]Route:/v1/embeddings,Methods: POST
INFO 02-2321:55:46 launcher.py:29]Route:/pooling,Methods: POST
INFO 02-2321:55:46 launcher.py:29]Route:/score,Methods: POST
INFO 02-2321:55:46 launcher.py:29]Route:/v1/score,Methods: POST
INFO 02-2321:55:46 launcher.py:29]Route:/rerank,Methods: POST
INFO 02-2321:55:46 launcher.py:29]Route:/v1/rerank,Methods: POST
INFO 02-2321:55:46 launcher.py:29]Route:/v2/rerank,Methods: POST
INFO 02-2321:55:46 launcher.py:29]Route:/invocations,Methods: POST
INFO:Started server process [2768]
INFO:Waitingfor application startup.
INFO:Application startup complete.
INFO:Uvicorn running on http://0.0.0.0:8000(Press CTRL+C to quit)

作者：陆吾（花名）

专注于构建高可用、高性能的IT基础设施，精通云计算技术，实现资源的高效管理与弹性伸缩，保障系统稳定运行。深入理解操作系统、网络协议及安全策略，凭借专业技能与丰富经验，为复杂IT环境提供全方位、精细化的运维保障。

推荐阅读

AI 基础研究

AI 基础研究

AI 应用实践

阅读 0