-
Notifications
You must be signed in to change notification settings - Fork 430
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Your current environment
The output of `python collect_env.py`
Collecting environment information...
PyTorch version: 2.7.1+cpu
Is debug build: False
OS: Debian GNU/Linux 11 (bullseye) (x86_64)
GCC version: (Debian 10.2.1-6) 10.2.1 20210110
Clang version: Could not collect
CMake version: version 3.31.2
Libc version: glibc-2.31
Python version: 3.9.19 (main, Aug 29 2025, 23:48:01) [GCC 10.2.1 20210110] (64-bit runtime)
Python platform: Linux-5.10.135.bsk.6-amd64-x86_64-with-glibc2.31
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 46 bits physical, 57 bits virtual
CPU(s): 228
On-line CPU(s) list: 0-227
Thread(s) per core: 2
Core(s) per socket: 57
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 207
Model name: INTEL(R) XEON(R) PLATINUM 8582C
Stepping: 2
CPU MHz: 2600.000
BogoMIPS: 5200.00
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 5.3 MiB
L1i cache: 3.6 MiB
L2 cache: 228 MiB
L3 cache: 600 MiB
NUMA node0 CPU(s): 0-113
NUMA node1 CPU(s): 114-227
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Mitigation; TSX disabled
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx_vnni avx512_bf16 wbnoinvd arat avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid cldemote movdiri movdir64b fsrm md_clear serialize tsxldtrk arch_lbr amx_bf16 avx512_fp16 amx_tile amx_int8 arch_capabilities
Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] pyzmq==27.0.2
[pip3] torch==2.7.1+cpu
[pip3] torch_npu==2.7.1+git217cd40
[pip3] torchvision==0.22.1+cpu
[pip3] transformers==4.53.3
[conda] Could not collect
vLLM Version: 0.10.0
vLLM Ascend Version: 0.10.0rc2.dev0+g4604882a3.d20250901 (git sha: 4604882a3, date: 20250901)
ENV Variables:
ATB_OPSRUNNER_KERNEL_CACHE_LOCAL_COUNT=1
ATB_STREAM_SYNC_EVERY_RUNNER_ENABLE=0
ASCEND_LOG_LEVEL=DEBUG
ATB_OPSRUNNER_SETUP_CACHE_ENABLE=1
ATB_WORKSPACE_MEM_ALLOC_GLOBAL=0
ATB_DEVICE_TILING_BUFFER_BLOCK_NUM=32
ASCEND_VISIBLE_DEVICES=4,6
ATB_STREAM_SYNC_EVERY_KERNEL_ENABLE=0
VLLM_TORCH_PROFILER_DIR=/opt/tiger/torchrec_npu/prof
ASCEND_RUNTIME_OPTIONS=
ATB_OPSRUNNER_KERNEL_CACHE_GLOABL_COUNT=5
ATB_HOME_PATH=/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1
TORCH_DEVICE_BACKEND_AUTOLOAD=0
ASCEND_TOOLKIT_HOME=/usr/local/Ascend/ascend-toolkit/latest
ASCEND_PROCESS_LOG_PATH=/var/log/tiger/ascend_diag_logs/run_0/process_log
ATB_COMPARE_TILING_EVERY_KERNEL=0
ASCEND_OPP_PATH=/usr/local/Ascend/ascend-toolkit/latest/opp
VLLM_TARGET_DEVICE=npu
LD_LIBRARY_PATH=/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/tests/atbopstest:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling/lib/linux/x86_64:/opt/tiger/native_libhdfs/lib/native:/opt/tiger/jdk/jdk8u265-b01/jre/lib/amd64/server:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native/ufs:/opt/tiger/yarn_deploy/hadoop/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lzo/lib:/opt/tiger/workspace/Python-3.9.19/output/lib:/usr/local/Ascend/driver/lib64:/usr/local/Ascend/driver/lib64/common:/usr/local/Ascend/driver/lib64/driver:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling/lib/linux/$(arch)::/usr/local/Ascend/driver/lib64:/usr/local/Ascend/driver/lib64/common:/usr/local/Ascend/driver/lib64/driver:/usr/local/Ascend/toolbox/latest/Ascend-DMI/lib64
ASCEND_AICPU_PATH=/usr/local/Ascend/ascend-toolkit/latest
ATB_STREAM_SYNC_EVERY_OPERATION_ENABLE=0
ASCEND_HOME_PATH=/usr/local/Ascend/ascend-toolkit/latest
ATB_MATMUL_SHUFFLE_K_ENABLE=1
ATB_WORKSPACE_MEM_ALLOC_ALG_TYPE=1
ATB_HOST_TILING_BUFFER_BLOCK_NUM=128
ATB_SHARE_MEMORY_NAME_SUFFIX=
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
NPU:
+------------------------------------------------------------------------------------------------+
| npu-smi 24.1.rc2.8 Version: 24.1.rc2.8 |
+---------------------------+---------------+----------------------------------------------------+
| NPU Name | Health | Power(W) Temp(C) Hugepages-Usage(page)|
| Chip | Bus-Id | AICore(%) Memory-Usage(MB) HBM-Usage(MB) |
+===========================+===============+====================================================+
| 4 910B2C | OK | 96.5 43 0 / 0 |
| 0 | 0000:6B:02.0 | 0 0 / 0 3413 / 65536 |
+===========================+===============+====================================================+
| 6 910B2C | OK | 89.4 43 0 / 0 |
| 0 | 0000:73:02.0 | 0 0 / 0 3416 / 65536 |
CANN:
package_name=Ascend-cann-toolkit
version=8.2.RC1
innerversion=V100R001C22SPC001B231
compatible_version=[V100R001C15],[V100R001C18],[V100R001C19],[V100R001C20],[V100R001C21],[V100R001C23]
arch=x86_64
os=linux
path=/usr/local/Ascend/ascend-toolkit/8.2.RC1/x86_64-linux
🐛 Describe the bug
32卡多机环境,存在/usr/bin的写操作,当前环境下不能开放/usr/bin的写权限,而且为什么存在往/usr/bin下的写操作呢
[Set][Options]OpCompileProcessor init failed![FUNC:ReportInnnerError][FILE:log_inner.cpp][LINE:145]
(RayWorkerWrapper pid=169127, ip=[2605:340:cd51:4900:d515:0204:16f5:ed52]
ERROR 09-01 20:16:24
[worker_base.py:619] Error executing method 'init_worker'. This might cause deadlock in distributed execution
(RayWorkerWrapper pid=169127, ip=[2605:340:cd51:4900:d515:c204:16f5:ed52]
ERROR 09-01 20:16:24
[worker_base.py:619] Traceback (most recent call last):
(RayWorkerWrapper pid=169127, ip=[2605:340:cd51:4900:d515:c204:16f5:ed52]
ERROR 09-01 20:16:24
File "/usr/local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 611, in execute_method
[worker_base.py:619]
(RayWorkerWrapper pid=169127, ip=[2605:340:cd51:4900:d515:c204:16f5:ed52j
ERROR 09-01 20:16:24
[worker_base.py:619]
return run_method(self, method, args, kwargs)
(RayWorkerWrapper pid=169127, ip=[2605:340:cd51:4900:d515:c204:16f5:ed52]
ERROR 09-01 20:16:24
.....................................
[worker_base.py:619]
(RayWorkerWrapper pid=169127,
ip=[2605:340:cd51:4900:d515:c204:16f5:ed52])
ERROR 09-01 20:16:24
[worker_base.py:619]
File "/usr/local/lib/python3.11/site-packages/vllm/utils/__init__.py", line 2985, in run_method
RayWorkerWrapper pid=169127,
ip=[2605:340:cd51:4900:d515:c204:16f5:ed52])
ERROR 09-01 20:16:24
return func(*args, **kwargs)
[worker_base.py:619]
(RayWorkerWrapper pid=169127, ip=[2605:340:cd51:4900:d515:c204:16f5:ed52]
ERROR 09-01 20:16:24
....................
[worker_base.py:619]
(RayWorkerWrapper pid=169127, ip=[2605:340:cd51:4900:d515:c204:16f5:ed52]
ERROR 09-01 20:16:24
[worker_base.py:619]
File "/home/.local/lib/python3.11/site-packages/ray/util/tracing/tracing_helper.py", line 463, in _ressume_span
(RayWorkerWrapper pid=169127, ip=[2605:340:cd51:4900:d515:c204:16f5:ed52]
ERROR 09-01 20:16:24
return method(self, *_args, **_kwargs)
[worker_base.py:619]
................................
(RayWorkerWrapper pid=169127, ip=[2605:340:cd51:4900:d515:c204:16f5:ed52]
ERROR 09-01 20:16:24
[worker_base.py:619]
(RayWorkerWrapper pid=169127, ip=[2605:340:cd51:4900:a515:c204:16f5:ed52
ERROR 09-01 20:16:24
[worker_base.py:619]
File "/usr/local/lib/python3.11/site-packages/vllm/worker/worker_base.py", line 592, in init_worker
(RayWorkerWrapper pid=169127, ip=[2605:340:cd51:4900:d5155:c204:16f5:ed52]
ERROR 09-01 20:16:24
self.worker = worker_class(**kwargs)
[worker_base.py:619]
.....................
(RayWorkerWrapper pid=169127, ip=[2605:340:cd51:4900:d515:c204:16f5:ed52]
ERROR 09-01 20:16:24
[worker_base.py:619]
(RayWorkerWrapper pid=169127, ip=[2605:340:cd51:4900:d515:0204:16f5:ed52]
ERROR 09-01 20:16:24
File "/usr/local/lib/python3.11/site-packages/vllm_ascernd/worker/worker_v1.py", line 77, in __init__
[worker_base.py:619]
init_ascend_soc_version(
(RayWorkerWrapper pid=169127, ip=[2605:340:cd51:4900:0d515:c204:16f5:ed52]
ERROR 09-01 20:16:24
[worker_base.py:619]
(RayWorkerWrapper pid=169127, ip=[2605:340:cd51:4900:d515:c204:16f5:ed52]
ERROR 09-01 20:16:24
[worker_base.py:619]
File "/usr/local/lib/python3.11/site-packages/vllm_asctend/utils.py", line 494, in init_ascend_soc_version
(RayWorkerWrapper pid=169127, ip=[2605:340:cd51:4900:d515:c204:16f5:ed52]
ERROR 09-01 20:16:24
[worker_base.py:619]
soc_version = torch_npu.npu.get_soc_version()
ERROR 09-01 20:16:24
...............................
(RayWorkerWrapper pid=169127, ip=[2605:340:cd51:4900:d515:c204:16f5:ed52]
[worker_base.py:619]
(RayWorkerWrapper pid=169127, ip=[2605:340:cd51:4900:d[515:c204:16f5:ed52]
ERROR 09-01 20:16:24
File "/usr/local/lib/python3.11/site-packages/torch_npou/npu/_backends.py", line 97, in get_soc_version
[worker_base.py:619]
ERROR 09-01 20:16:24
(RayWorkerWrapper pid=169127, ip=[2605:340:cd51:4900:d515:c204:16f5:ed52]
[worker_base.py:619]
torch_npu.npu._lazy_init()
(RayWorkerWrapper pid=169127, ip=[2605:340:cd51:4900:a1515:c204:16f5:ed52]
ERROR 09-01 20:16:24
[worker_base.py:619]
File "/usr/local/lib/python3.11/site-packages/torch_npu/npu/__init__.py", line 242, in _lazy_init
(RayWorkerWrapper pid=169127, ip=[2605:340:cd51:4900:d515:c204:16f5:ed52j
ERROR 09-01 20:16:24
[worker_base.py:619]
torch_npu._C._npu_init()
(RayWorkerWrapper pid=169127, ip=[2605:340:cd51:4900:a515:c204:16f5:ed52]
ERROR 09-01 20:16:24
[worker_base.py:619]
RuntimeError: SetPrecisionMode:torch_npu/csrc/framework/LazyInitAclops.cpp:155 NPU function error: at_npu::native::AclSetCompileopt(aclCompileOpt
::ACL_PRECISION_MODE, precision_mode), error code is 500001
ERROR 09-01 20:16:24
[ERROR] 2025-09-01-20:16:24 (PID:169127, Device:0, RankID:-1) ERR00100 PTA call acl api failed
(RayWorkerWrapper pid=169127, ip=[2605:340:cd51:4900:d1515:c204:16f5:ed52])
[worker_base.py:619]
(RayWorkerWrapper pid=169127, ip=[2605:340:cd51:4900:a4515:c204:16f5:ed52]
ERROR 09-01 20:16:24
[worker_base.py:619]
[Error]: The internal ACL of the system is incorrect.
(RayWorkerWrapper pid=169127, ip=[2605:340:cd51:4900:d515:c204:16f5:ed52]
ERROR 09-01 20:16:24
[worker_base.py:619]
Rectify the fault based on the error information in the ascend log
[worker_base.py:619] E40023: [PID: 169127] 2025-09-01-20:106:24.365.068 Path [/usr/bin] for [ge.debugDir] is invalid.Result: access real path failed. Reason: Permissi
(RayWorkerWrapper pid=169127, ip=[2605:340:cd51:4900:d515:c204:16f5:ed52ๅี
ERROR 09-01 20:16:24
on denied.
(RayWorkerWrapper pid=169127, ip=[2605:340:cd51:4900:d[515:c204:16f5:ed52])
ERROR 09-01 20:16:24
Possible Cause: The path does not exist.
[worker_base.py:619]
(RayWorkerwrapper pid=169127, ip=[2605:340:cd51:4900:d515:c204::16f5:ed52j) ERROR 09-01 20:16:24 [worker_base.py:619]
Solution: Change the path to the effective value.
(RayWorkerWrapper pid=169127, ip=[2605:340:cd51:4900:d515:c204:16f5:ed52j
ERROR 09-01 20:16:24
TraceBack (most recent call last):
[worker_base.py:619]
ERROR 09-01 20:16:24
Failed to initialize TeConfigInfo.
(RayWorkerWrapper pid=169127, ip=[2605:340:cd51:4900:d5155:c204:16f5:ed52j
[worker_base.py:619]
(RayWorkerWrapper pid=169127, ip=[2605:340:cd51:4900:a515:c204:16f5:ed52
ERROR 09-01 20:16:24
[GraphOpt][InitializeInner][InitTbeFunc] Failed to init tbee.[FUNC:InitializeTeFusion][FILE:[tbe_op_store_adapter.cc](http://tbe_op_store_adapter.cc/)][LINE:1921]
[worker_base.py:619]
(RayWorkerWrapper pid=169127, ip=[2605:340:cd51:4900:d5155:c204:16f5:ed52]
ERROR 09-01 20:16:24
[GraphOpt][InitializeInner][InitTeFusion]: Failed to initiaalize TeFusion.[FUNC:InitializeInner][FILE:[tbe_op_store_adapter.cc](http://tbe_op_store_adapter.cc/)][LINE:1888]
[worker_base.py:619]
[SubGraphOpt][PreCompileOp][InitAdapter] InitializeAdapteradapter [tbe_op_adapter] failed! Ret [4294967295][FUNC:InitializeAdapter][FIL
(RayWorkerWrapper pid=169127, ip=[2605:340:cd51:4900:d515:c204:16f5:ed52]
ERROR 09-01 20:16:24
[worker_base.py:619]
:[op_store_adapter_manager.cc](http://op_store_adapter_manager.cc/)][LINE:79]
[SubGraphOpt][PreCompileOp][Init] Initialize op storeadapter failed, OpsStoreName[tbe-custom].[FUNC:Initialize][FILE:op_store_adapter_ma
(RayWorkerWrapper pid=169127, ip=[2605:340:cd51:4900:d515:c204:16f5:ed52
ERROR 09-01 20:16:24
[worker_base.py:619]
[nager.cc](http://nager.cc/)][LINE:120]
(RayWorkerWrapper pid=169127, ip=[2605:340:cd51:4900:d515:c204:16f5:ed52]
[FusionMngr][Init] Op store adapter manager init failed.[FUNC:Initialize][FILE:[fusion_manager.cc](http://fusion_manager.cc/)][LINE:115]
ERROR 09-01 20:16:24
[worker_base.py:619]
PluginManager InvokeAll failed.[FUNC:Initialize][FILE:ops_k:[ernel_manager.cc](http://ernel_manager.cc/)][LINE:83]
(RayWorkerWrapper pid=169127, ip=[2605:340:cd51:4900:d5155:c204:16f5:ed52jื
ERROR 09-01 20:16:24
[worker_base.py:619]
ip=[2605:340:cd51:4900:d515:c204:16f5:ed52]]
ERROR 09-01 20:16:24
OpsManager initialize failed.[FUNC:InnerInitialize][FILE:[gelib.cc](http://gelib.cc/)][LINE:239]
RayWorkerWrapper_pid=169127,
[worker_base.py:619]
ERROR 09-01 20:16:24
GELib::InnerInitialize failed.[FUNC:Initialize][FILE:gelibCC][LINE :164]
(RayWorkerWrapper pid=169127, ip=[2605:340:cd51:4900:d515:c204:16f5:ed52]
[worker_base.py:619]
GEInitialize failed.[FUNC:GEInitialize][FILE:[ge_api.co](http://ge_api.co/)[LINE:384]
(RayWorkerWrapper pid=169127, ip=[2605:340:cd51:4900:d1515:c204:16f5:ed52])
ERROR 09-01 20:16:24
[worker_base.py:619]
[Initialize][Ge]GEInitialize failed. ge result = 4294967295[FUINC:ReportCallError][FILE:log_inner.cpp][LINE:161]
ERROR 09-01 20:16:24
(RayWorkerWrapper pid=169127, ip=[2605:340:cd51:4900:d515:c204:16f5:ed52jื
[worker_base.py:619]
[Init][Compiler]Init compiler failed[FUNC:ReportInnerError][FILE:log_inner.cpp][LINE:145]
(RayWorkerWrapper pid=169127, ip=[2605:340:cd51:4900:d515:c204:16f5:ed52]
ERROR 09-01 20:16:24
[worker_base.py:619]
(RayWorkerWrapper pid=169127, ip=[2605:340:cd51:4900:d515:c204:16f5:ed52]
ERROR 09-01 20:16:24
[worker_base.py:619]
[Set][Options]OpCompileProcessor init failed![FUNC:Report][nnerError][FILE:log_inner.cpp][LINE:145]```
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working