Deployment framework
What framework did you use to deploy the model? I tried vllm with 8xH100 but got the following error.
2025-01-22T13:22:49.476492425Z [1;36m(VllmWorkerProcess pid=362)[0;0m INFO 01-22 05:22:49 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250122-052249.pkl...
2025-01-22T13:22:49.477126901Z [1;36m(VllmWorkerProcess pid=363)[0;0m INFO 01-22 05:22:49 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250122-052249.pkl...
2025-01-22T13:22:49.477129206Z [1;36m(VllmWorkerProcess pid=361)[0;0m INFO 01-22 05:22:49 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250122-052249.pkl...```
Can you provide the full log and your start up command?
I kept the logs for my 4xH200 experiment but got the same error for 8xH100
vllm parameters:--host 0.0.0.0 --port 8000 --model cognitivecomputations/DeepSeek-R1-AWQ --gpu-memory-utilization 0.95 --tensor-parallel-size=4 --trust_remote_code
Logs:
2025-01-22T13:07:50.421133598Z INFO 01-22 05:07:50 api_server.py:712] vLLM API server version 0.6.6.post1
2025-01-22T13:07:50.421303357Z INFO 01-22 05:07:50 api_server.py:713] args: Namespace(host='0.0.0.0', port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key='', lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='cognitivecomputations/DeepSeek-R1-AWQ', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=30000, guided_decoding_backend='xgrammar', logits_processor_pattern=None, distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.95, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=True, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)
2025-01-22T13:07:50.430906961Z INFO 01-22 05:07:50 api_server.py:199] Started engine process with PID 89
2025-01-22T13:07:50.643475046Z INFO 01-22 05:07:50 config.py:131] Replacing legacy 'type' key with 'rope_type'
2025-01-22T13:07:53.969666636Z INFO 01-22 05:07:53 config.py:131] Replacing legacy 'type' key with 'rope_type'
2025-01-22T13:07:55.208259634Z INFO 01-22 05:07:55 config.py:510] This model supports multiple tasks: {'score', 'generate', 'reward', 'classify', 'embed'}. Defaulting to 'generate'.
2025-01-22T13:07:55.844302051Z INFO 01-22 05:07:55 awq_marlin.py:109] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
2025-01-22T13:07:55.888077160Z INFO 01-22 05:07:55 config.py:1310] Defaulting to use mp for distributed inference
2025-01-22T13:07:55.888171171Z WARNING 01-22 05:07:55 cuda.py:98] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
2025-01-22T13:07:55.888191894Z WARNING 01-22 05:07:55 config.py:642] Async output processing is not supported on the current platform type cuda.
2025-01-22T13:07:58.487429442Z INFO 01-22 05:07:58 config.py:510] This model supports multiple tasks: {'embed', 'reward', 'classify', 'generate', 'score'}. Defaulting to 'generate'.
2025-01-22T13:07:59.106749422Z INFO 01-22 05:07:59 awq_marlin.py:109] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
2025-01-22T13:07:59.150778826Z INFO 01-22 05:07:59 config.py:1310] Defaulting to use mp for distributed inference
2025-01-22T13:07:59.150878529Z WARNING 01-22 05:07:59 cuda.py:98] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
2025-01-22T13:07:59.150900534Z WARNING 01-22 05:07:59 config.py:642] Async output processing is not supported on the current platform type cuda.
2025-01-22T13:07:59.173686852Z INFO 01-22 05:07:59 llm_engine.py:234] Initializing an LLM engine (v0.6.6.post1) with config: model='cognitivecomputations/DeepSeek-R1-AWQ', speculative_config=None, tokenizer='cognitivecomputations/DeepSeek-R1-AWQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=30000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=awq_marlin, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=cognitivecomputations/DeepSeek-R1-AWQ, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[],"max_capture_size":0}, use_cached_outputs=True,
2025-01-22T13:07:59.578249195Z WARNING 01-22 05:07:59 multiproc_worker_utils.py:312] Reducing Torch parallelism from 96 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
2025-01-22T13:07:59.583556350Z INFO 01-22 05:07:59 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
2025-01-22T13:07:59.644588714Z INFO 01-22 05:07:59 selector.py:120] Using Flash Attention backend.
2025-01-22T13:07:59.700087505Z [1;36m(VllmWorkerProcess pid=361)[0;0m INFO 01-22 05:07:59 selector.py:120] Using Flash Attention backend.
2025-01-22T13:07:59.700196810Z [1;36m(VllmWorkerProcess pid=361)[0;0m INFO 01-22 05:07:59 multiproc_worker_utils.py:222] Worker ready; awaiting tasks
2025-01-22T13:07:59.719623814Z [1;36m(VllmWorkerProcess pid=363)[0;0m INFO 01-22 05:07:59 selector.py:120] Using Flash Attention backend.
2025-01-22T13:07:59.719626424Z [1;36m(VllmWorkerProcess pid=362)[0;0m INFO 01-22 05:07:59 selector.py:120] Using Flash Attention backend.
2025-01-22T13:07:59.719717955Z [1;36m(VllmWorkerProcess pid=362)[0;0m INFO 01-22 05:07:59 multiproc_worker_utils.py:222] Worker ready; awaiting tasks
2025-01-22T13:07:59.719719661Z [1;36m(VllmWorkerProcess pid=363)[0;0m INFO 01-22 05:07:59 multiproc_worker_utils.py:222] Worker ready; awaiting tasks
2025-01-22T13:08:03.041911052Z [1;36m(VllmWorkerProcess pid=361)[0;0m INFO 01-22 05:08:03 utils.py:918] Found nccl from library libnccl.so.2
2025-01-22T13:08:03.041943685Z INFO 01-22 05:08:03 utils.py:918] Found nccl from library libnccl.so.2
2025-01-22T13:08:03.042058901Z [1;36m(VllmWorkerProcess pid=361)[0;0m INFO 01-22 05:08:03 pynccl.py:69] vLLM is using nccl==2.21.5
2025-01-22T13:08:03.042067625Z INFO 01-22 05:08:03 pynccl.py:69] vLLM is using nccl==2.21.5
2025-01-22T13:08:03.042084177Z [1;36m(VllmWorkerProcess pid=363)[0;0m INFO 01-22 05:08:03 utils.py:918] Found nccl from library libnccl.so.2
2025-01-22T13:08:03.042089699Z [1;36m(VllmWorkerProcess pid=362)[0;0m INFO 01-22 05:08:03 utils.py:918] Found nccl from library libnccl.so.2
2025-01-22T13:08:03.042269576Z [1;36m(VllmWorkerProcess pid=363)[0;0m INFO 01-22 05:08:03 pynccl.py:69] vLLM is using nccl==2.21.5
2025-01-22T13:08:03.042297664Z [1;36m(VllmWorkerProcess pid=362)[0;0m INFO 01-22 05:08:03 pynccl.py:69] vLLM is using nccl==2.21.5
2025-01-22T13:08:04.762844790Z INFO 01-22 05:08:04 custom_all_reduce_utils.py:204] generating GPU P2P access cache in /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
2025-01-22T13:08:19.714251803Z INFO 01-22 05:08:19 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
2025-01-22T13:08:19.714368438Z [1;36m(VllmWorkerProcess pid=361)[0;0m INFO 01-22 05:08:19 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
2025-01-22T13:08:19.714371653Z [1;36m(VllmWorkerProcess pid=362)[0;0m INFO 01-22 05:08:19 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
2025-01-22T13:08:19.714609456Z [1;36m(VllmWorkerProcess pid=363)[0;0m INFO 01-22 05:08:19 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
2025-01-22T13:08:19.747454967Z INFO 01-22 05:08:19 shm_broadcast.py:255] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3], buffer_handle=(3, 4194304, 6, 'psm_ed9c7126'), local_subscribe_port=53933, remote_subscribe_port=None)
2025-01-22T13:08:19.783713362Z INFO 01-22 05:08:19 model_runner.py:1094] Starting to load model cognitivecomputations/DeepSeek-R1-AWQ...
2025-01-22T13:08:19.783863117Z [1;36m(VllmWorkerProcess pid=361)[0;0m INFO 01-22 05:08:19 model_runner.py:1094] Starting to load model cognitivecomputations/DeepSeek-R1-AWQ...
2025-01-22T13:08:19.784442981Z [1;36m(VllmWorkerProcess pid=362)[0;0m INFO 01-22 05:08:19 model_runner.py:1094] Starting to load model cognitivecomputations/DeepSeek-R1-AWQ...
2025-01-22T13:08:19.784445640Z [1;36m(VllmWorkerProcess pid=363)[0;0m INFO 01-22 05:08:19 model_runner.py:1094] Starting to load model cognitivecomputations/DeepSeek-R1-AWQ...
2025-01-22T13:08:20.194644565Z Cache shape torch.Size([163840, 64])
2025-01-22T13:08:20.194662173Z INFO 01-22 05:08:20 weight_utils.py:251] Using model weights format ['*.safetensors']
2025-01-22T13:08:20.234273784Z [1;36m(VllmWorkerProcess pid=361)[0;0m Cache shape torch.Size([163840, 64])
2025-01-22T13:08:20.234294554Z [1;36m(VllmWorkerProcess pid=361)[0;0m INFO 01-22 05:08:20 weight_utils.py:251] Using model weights format ['*.safetensors']
2025-01-22T13:08:20.234579652Z [1;36m(VllmWorkerProcess pid=362)[0;0m Cache shape torch.Size([163840, 64])
2025-01-22T13:08:20.234583739Z [1;36m(VllmWorkerProcess pid=362)[0;0m INFO 01-22 05:08:20 weight_utils.py:251] Using model weights format ['*.safetensors']
2025-01-22T13:08:20.243174528Z [1;36m(VllmWorkerProcess pid=363)[0;0m Cache shape torch.Size([163840, 64])
2025-01-22T13:08:20.243179760Z [1;36m(VllmWorkerProcess pid=363)[0;0m INFO 01-22 05:08:20 weight_utils.py:251] Using model weights format ['*.safetensors']
2025-01-22T13:20:31.182095071Z
Loading safetensors checkpoint shards: 0% Completed | 0/74 [00:00<?, ?it/s]
2025-01-22T13:20:31.489554762Z
Loading safetensors checkpoint shards: 1% Completed | 1/74 [00:00<00:22, 3.25it/s]
2025-01-22T13:20:31.974300115Z
Loading safetensors checkpoint shards: 3% Completed | 2/74 [00:00<00:29, 2.43it/s]
2025-01-22T13:20:32.455546706Z
Loading safetensors checkpoint shards: 4% Completed | 3/74 [00:01<00:31, 2.25it/s]
2025-01-22T13:20:32.926504974Z
Loading safetensors checkpoint shards: 5% Completed | 4/74 [00:01<00:31, 2.20it/s]
2025-01-22T13:20:33.397254243Z
Loading safetensors checkpoint shards: 7% Completed | 5/74 [00:02<00:31, 2.17it/s]
2025-01-22T13:20:33.875270023Z
Loading safetensors checkpoint shards: 8% Completed | 6/74 [00:02<00:31, 2.14it/s]
2025-01-22T13:20:34.344715583Z
Loading safetensors checkpoint shards: 9% Completed | 7/74 [00:03<00:31, 2.14it/s]
2025-01-22T13:20:34.821748448Z
Loading safetensors checkpoint shards: 11% Completed | 8/74 [00:03<00:31, 2.13it/s]
2025-01-22T13:20:35.290056371Z
Loading safetensors checkpoint shards: 12% Completed | 9/74 [00:04<00:30, 2.13it/s]
2025-01-22T13:20:35.755523220Z
Loading safetensors checkpoint shards: 14% Completed | 10/74 [00:04<00:29, 2.13it/s]
2025-01-22T13:20:36.228502702Z
Loading safetensors checkpoint shards: 15% Completed | 11/74 [00:05<00:29, 2.13it/s]
2025-01-22T13:20:36.700871980Z
Loading safetensors checkpoint shards: 16% Completed | 12/74 [00:05<00:29, 2.12it/s]
2025-01-22T13:20:37.183470090Z
Loading safetensors checkpoint shards: 18% Completed | 13/74 [00:06<00:28, 2.11it/s]
2025-01-22T13:20:37.657741308Z
Loading safetensors checkpoint shards: 19% Completed | 14/74 [00:06<00:28, 2.11it/s]
2025-01-22T13:20:38.121128353Z
Loading safetensors checkpoint shards: 20% Completed | 15/74 [00:06<00:27, 2.12it/s]
2025-01-22T13:20:38.589453375Z
Loading safetensors checkpoint shards: 22% Completed | 16/74 [00:07<00:27, 2.13it/s]
2025-01-22T13:20:39.047142026Z
Loading safetensors checkpoint shards: 23% Completed | 17/74 [00:07<00:26, 2.14it/s]
2025-01-22T13:20:39.491344292Z
Loading safetensors checkpoint shards: 24% Completed | 18/74 [00:08<00:25, 2.18it/s]
2025-01-22T13:20:39.929711441Z
Loading safetensors checkpoint shards: 26% Completed | 19/74 [00:08<00:24, 2.21it/s]
2025-01-22T13:20:40.374986470Z
Loading safetensors checkpoint shards: 27% Completed | 20/74 [00:09<00:24, 2.22it/s]
2025-01-22T13:20:40.818969728Z
Loading safetensors checkpoint shards: 28% Completed | 21/74 [00:09<00:23, 2.23it/s]
2025-01-22T13:20:41.273748530Z
Loading safetensors checkpoint shards: 30% Completed | 22/74 [00:10<00:23, 2.22it/s]
2025-01-22T13:20:41.739147123Z
Loading safetensors checkpoint shards: 31% Completed | 23/74 [00:10<00:23, 2.20it/s]
2025-01-22T13:20:42.188972601Z
Loading safetensors checkpoint shards: 32% Completed | 24/74 [00:11<00:22, 2.21it/s]
2025-01-22T13:20:42.641780672Z
Loading safetensors checkpoint shards: 34% Completed | 25/74 [00:11<00:22, 2.21it/s]
2025-01-22T13:20:43.096641696Z
Loading safetensors checkpoint shards: 35% Completed | 26/74 [00:11<00:21, 2.20it/s]
2025-01-22T13:20:43.567797093Z
Loading safetensors checkpoint shards: 36% Completed | 27/74 [00:12<00:21, 2.18it/s]
2025-01-22T13:20:44.046209789Z
Loading safetensors checkpoint shards: 38% Completed | 28/74 [00:12<00:21, 2.15it/s]
2025-01-22T13:20:44.525739823Z
Loading safetensors checkpoint shards: 39% Completed | 29/74 [00:13<00:21, 2.13it/s]
2025-01-22T13:20:45.062838963Z
Loading safetensors checkpoint shards: 41% Completed | 30/74 [00:13<00:21, 2.04it/s]
2025-01-22T13:20:45.538771429Z
Loading safetensors checkpoint shards: 42% Completed | 31/74 [00:14<00:20, 2.06it/s]
2025-01-22T13:20:46.003535599Z
Loading safetensors checkpoint shards: 43% Completed | 32/74 [00:14<00:20, 2.09it/s]
2025-01-22T13:20:46.479112534Z
Loading safetensors checkpoint shards: 45% Completed | 33/74 [00:15<00:19, 2.09it/s]
2025-01-22T13:20:46.945277181Z
Loading safetensors checkpoint shards: 46% Completed | 34/74 [00:15<00:18, 2.11it/s]
2025-01-22T13:20:47.399506630Z
Loading safetensors checkpoint shards: 47% Completed | 35/74 [00:16<00:18, 2.13it/s]
2025-01-22T13:20:47.862872167Z
Loading safetensors checkpoint shards: 49% Completed | 36/74 [00:16<00:17, 2.14it/s]
2025-01-22T13:20:48.339609077Z
Loading safetensors checkpoint shards: 50% Completed | 37/74 [00:17<00:17, 2.13it/s]
2025-01-22T13:20:48.810059207Z
Loading safetensors checkpoint shards: 51% Completed | 38/74 [00:17<00:16, 2.13it/s]
2025-01-22T13:20:49.280713034Z
Loading safetensors checkpoint shards: 53% Completed | 39/74 [00:18<00:16, 2.13it/s]
2025-01-22T13:20:49.748002366Z
Loading safetensors checkpoint shards: 54% Completed | 40/74 [00:18<00:15, 2.13it/s]
2025-01-22T13:20:50.200210526Z
Loading safetensors checkpoint shards: 55% Completed | 41/74 [00:19<00:15, 2.15it/s]
2025-01-22T13:20:50.657614498Z
Loading safetensors checkpoint shards: 57% Completed | 42/74 [00:19<00:14, 2.16it/s]
2025-01-22T13:20:51.128247380Z
Loading safetensors checkpoint shards: 58% Completed | 43/74 [00:19<00:14, 2.15it/s]
2025-01-22T13:20:51.599344184Z
Loading safetensors checkpoint shards: 59% Completed | 44/74 [00:20<00:13, 2.14it/s]
2025-01-22T13:20:52.074519018Z
Loading safetensors checkpoint shards: 61% Completed | 45/74 [00:20<00:13, 2.13it/s]
2025-01-22T13:20:52.549870992Z
Loading safetensors checkpoint shards: 62% Completed | 46/74 [00:21<00:13, 2.12it/s]
2025-01-22T13:20:53.041993357Z
Loading safetensors checkpoint shards: 64% Completed | 47/74 [00:21<00:12, 2.09it/s]
2025-01-22T13:20:53.515416397Z
Loading safetensors checkpoint shards: 65% Completed | 48/74 [00:22<00:12, 2.10it/s]
2025-01-22T13:20:53.985782219Z
Loading safetensors checkpoint shards: 66% Completed | 49/74 [00:22<00:11, 2.11it/s]
2025-01-22T13:20:54.445680829Z
Loading safetensors checkpoint shards: 68% Completed | 50/74 [00:23<00:11, 2.13it/s]
2025-01-22T13:20:54.916269219Z
Loading safetensors checkpoint shards: 69% Completed | 51/74 [00:23<00:10, 2.13it/s]
2025-01-22T13:20:55.389394303Z
Loading safetensors checkpoint shards: 70% Completed | 52/74 [00:24<00:10, 2.12it/s]
2025-01-22T13:20:55.866349991Z
Loading safetensors checkpoint shards: 72% Completed | 53/74 [00:24<00:09, 2.11it/s]
2025-01-22T13:20:56.347850931Z
Loading safetensors checkpoint shards: 73% Completed | 54/74 [00:25<00:09, 2.10it/s]
2025-01-22T13:20:56.794412370Z
Loading safetensors checkpoint shards: 74% Completed | 55/74 [00:25<00:08, 2.14it/s]
2025-01-22T13:20:57.262317289Z
Loading safetensors checkpoint shards: 76% Completed | 56/74 [00:26<00:08, 2.14it/s]
2025-01-22T13:20:57.732185124Z
Loading safetensors checkpoint shards: 77% Completed | 57/74 [00:26<00:07, 2.14it/s]
2025-01-22T13:20:58.194820443Z
Loading safetensors checkpoint shards: 78% Completed | 58/74 [00:27<00:07, 2.14it/s]
2025-01-22T13:20:58.670495387Z
Loading safetensors checkpoint shards: 80% Completed | 59/74 [00:27<00:07, 2.13it/s]
2025-01-22T13:20:59.140341139Z
Loading safetensors checkpoint shards: 81% Completed | 60/74 [00:27<00:06, 2.13it/s]
2025-01-22T13:20:59.613002800Z
Loading safetensors checkpoint shards: 82% Completed | 61/74 [00:28<00:06, 2.13it/s]
2025-01-22T13:21:00.086442184Z
Loading safetensors checkpoint shards: 84% Completed | 62/74 [00:28<00:05, 2.12it/s]
2025-01-22T13:21:00.560259399Z
Loading safetensors checkpoint shards: 85% Completed | 63/74 [00:29<00:05, 2.12it/s]
2025-01-22T13:21:01.037240553Z
Loading safetensors checkpoint shards: 86% Completed | 64/74 [00:29<00:04, 2.11it/s]
2025-01-22T13:21:01.498437763Z
Loading safetensors checkpoint shards: 88% Completed | 65/74 [00:30<00:04, 2.13it/s]
2025-01-22T13:21:01.969160301Z
Loading safetensors checkpoint shards: 89% Completed | 66/74 [00:30<00:03, 2.13it/s]
2025-01-22T13:21:02.440027377Z
Loading safetensors checkpoint shards: 91% Completed | 67/74 [00:31<00:03, 2.13it/s]
2025-01-22T13:21:02.908381363Z
Loading safetensors checkpoint shards: 92% Completed | 68/74 [00:31<00:02, 2.13it/s]
2025-01-22T13:21:03.381695121Z
Loading safetensors checkpoint shards: 93% Completed | 69/74 [00:32<00:02, 2.12it/s]
2025-01-22T13:21:03.845546580Z
Loading safetensors checkpoint shards: 95% Completed | 70/74 [00:32<00:01, 2.13it/s]
2025-01-22T13:21:04.311999508Z
Loading safetensors checkpoint shards: 96% Completed | 71/74 [00:33<00:01, 2.14it/s]
2025-01-22T13:21:04.789659443Z
Loading safetensors checkpoint shards: 97% Completed | 72/74 [00:33<00:00, 2.12it/s]
2025-01-22T13:21:05.098397817Z
Loading safetensors checkpoint shards: 99% Completed | 73/74 [00:33<00:00, 2.37it/s]
2025-01-22T13:21:05.153431782Z
Loading safetensors checkpoint shards: 100% Completed | 74/74 [00:33<00:00, 2.18it/s]
2025-01-22T13:21:22.200528061Z [1;36m(VllmWorkerProcess pid=361)[0;0m INFO 01-22 05:21:22 model_runner.py:1099] Loading model weights took 85.5053 GB
2025-01-22T13:21:23.235285583Z [1;36m(VllmWorkerProcess pid=362)[0;0m INFO 01-22 05:21:23 model_runner.py:1099] Loading model weights took 85.5053 GB
2025-01-22T13:21:23.662143488Z [1;36m(VllmWorkerProcess pid=363)[0;0m INFO 01-22 05:21:23 model_runner.py:1099] Loading model weights took 85.5053 GB
2025-01-22T13:21:24.130898012Z INFO 01-22 05:21:24 model_runner.py:1099] Loading model weights took 85.5053 GB
2025-01-22T13:21:25.970200306Z [1;36m(VllmWorkerProcess pid=362)[0;0m INFO 01-22 05:21:25 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250122-052125.pkl...
2025-01-22T13:21:25.970911363Z INFO 01-22 05:21:25 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250122-052125.pkl...
2025-01-22T13:21:25.973483812Z [1;36m(VllmWorkerProcess pid=363)[0;0m INFO 01-22 05:21:25 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250122-052125.pkl...
2025-01-22T13:21:25.973539062Z [1;36m(VllmWorkerProcess pid=361)[0;0m INFO 01-22 05:21:25 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250122-052125.pkl...
2025-01-22T13:21:25.978851389Z [1;36m(VllmWorkerProcess pid=362)[0;0m INFO 01-22 05:21:25 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20250122-052125.pkl.
2025-01-22T13:21:25.979807043Z INFO 01-22 05:21:25 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20250122-052125.pkl.
2025-01-22T13:21:25.981052641Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks.
2025-01-22T13:21:25.981054883Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] Traceback (most recent call last):
2025-01-22T13:21:25.981057016Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
2025-01-22T13:21:25.981058711Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] return func(*args, **kwargs)
2025-01-22T13:21:25.981060648Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] ^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981061614Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1691, in execute_model
2025-01-22T13:21:25.981062571Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] hidden_or_intermediate_states = model_executable(
2025-01-22T13:21:25.981064343Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] ^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981067242Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
2025-01-22T13:21:25.981068486Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] return self._call_impl(*args, **kwargs)
2025-01-22T13:21:25.981069736Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981070984Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
2025-01-22T13:21:25.981072436Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] return forward_call(*args, **kwargs)
2025-01-22T13:21:25.981080706Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981082054Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v3.py", line 527, in forward
2025-01-22T13:21:25.981083090Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] hidden_states = self.model(input_ids, positions, kv_caches,
2025-01-22T13:21:25.981084195Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981085485Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
2025-01-22T13:21:25.981086427Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] return self._call_impl(*args, **kwargs)
2025-01-22T13:21:25.981087552Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981088523Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
2025-01-22T13:21:25.981089556Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] return forward_call(*args, **kwargs)
2025-01-22T13:21:25.981090564Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981091589Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v3.py", line 483, in forward
2025-01-22T13:21:25.981092501Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] hidden_states, residual = layer(positions, hidden_states,
2025-01-22T13:21:25.981093914Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981094910Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
2025-01-22T13:21:25.981096149Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] return self._call_impl(*args, **kwargs)
2025-01-22T13:21:25.981097296Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981098229Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
2025-01-22T13:21:25.981099158Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] return forward_call(*args, **kwargs)
2025-01-22T13:21:25.981100096Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981101044Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v3.py", line 412, in forward
2025-01-22T13:21:25.981101969Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] hidden_states = self.mlp(hidden_states)
2025-01-22T13:21:25.981104922Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] ^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981105903Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
2025-01-22T13:21:25.981106883Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] return self._call_impl(*args, **kwargs)
2025-01-22T13:21:25.981107795Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981108916Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
2025-01-22T13:21:25.981109861Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] return forward_call(*args, **kwargs)
2025-01-22T13:21:25.981110821Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981111777Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v3.py", line 158, in forward
2025-01-22T13:21:25.981112700Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] final_hidden_states = self.experts(
2025-01-22T13:21:25.981113637Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] ^^^^^^^^^^^^^
2025-01-22T13:21:25.981114804Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
2025-01-22T13:21:25.981115921Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] return self._call_impl(*args, **kwargs)
2025-01-22T13:21:25.981117012Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981118081Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
2025-01-22T13:21:25.981119209Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] return forward_call(*args, **kwargs)
2025-01-22T13:21:25.981120307Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981121683Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 522, in forward
2025-01-22T13:21:25.981123129Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] final_hidden_states = self.quant_method.apply(
2025-01-22T13:21:25.981124109Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] ^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981125040Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/awq_marlin.py", line 463, in apply
2025-01-22T13:21:25.981126118Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] return torch.ops.vllm.fused_marlin_moe(
2025-01-22T13:21:25.981127065Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981129560Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1116, in __call__
2025-01-22T13:21:25.981130829Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] return self._op(*args, **(kwargs or {}))
2025-01-22T13:21:25.981132369Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981133322Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/fused_marlin_moe.py", line 202, in fused_marlin_moe
2025-01-22T13:21:25.981134990Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] assert hidden_states.dtype == torch.float16
2025-01-22T13:21:25.981135915Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981136986Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] AssertionError
2025-01-22T13:21:25.981138483Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]
2025-01-22T13:21:25.981140056Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] The above exception was the direct cause of the following exception:
2025-01-22T13:21:25.981141343Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]
2025-01-22T13:21:25.981142564Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] Traceback (most recent call last):
2025-01-22T13:21:25.981143687Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/multiproc_worker_utils.py", line 230, in _run_worker_process
2025-01-22T13:21:25.981144781Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] output = executor(*args, **kwargs)
2025-01-22T13:21:25.981145931Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] ^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981147028Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
2025-01-22T13:21:25.981148165Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] return func(*args, **kwargs)
2025-01-22T13:21:25.981149122Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] ^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981150069Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 202, in determine_num_available_blocks
2025-01-22T13:21:25.981150994Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] self.model_runner.profile_run()
2025-01-22T13:21:25.981151936Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
2025-01-22T13:21:25.981152905Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] return func(*args, **kwargs)
2025-01-22T13:21:25.981154018Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] ^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981155223Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1331, in profile_run
2025-01-22T13:21:25.981158094Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] self.execute_model(model_input, kv_caches, intermediate_tensors)
2025-01-22T13:21:25.981159193Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
2025-01-22T13:21:25.981160159Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] return func(*args, **kwargs)
2025-01-22T13:21:25.981161245Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] ^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981162372Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner_base.py", line 152, in _wrapper
2025-01-22T13:21:25.981179430Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] raise type(err)(
2025-01-22T13:21:25.981180548Z [1;36m(VllmWorkerProcess pid=362)[0;0m ERROR 01-22 05:21:25
Add --dtype float16
or use the new moe_wna16
kernel which needs to be built from source.
Did Work on 8x A100 with the Steup below, but I did not check the output quality, only around 1.2 T/S.
Can run inference Batch size of 10 with max len of 10.000
Might be faster with: - VLLM_ATTENTION_BACKEND=FLASHINFER # FOR 8bit KV Cache
But i did not check, probably better Bang for a Buck using H100s.
vllm:
image: vllm/vllm-openai:latest
ipc: host
ports:
- 8000:8000
volumes:
- ../hf_home:/root/.cache/huggingface/
environment:
- HF_HOME=/root/.cache/huggingface/
- TRANSFORMERS_OFFLINE=0
- HF_DATASET_OFFLINE=1
- NCCL_P2P_LEVEL=NVL
- NCCL_SHM_DISABLE=1
- NCCL_SOCKET_IFNAME=eth0
- CUDA_LAUNCH_BLOCKING=1
- TORCH_USE_CUDA_DSA=1
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [ "gpu" ]
command: >
--model cognitivecomputations/DeepSeek-R1-AWQ
--trust-remote-code
--host 0.0.0.0
--port 8000
--max-model-len 10000
--tensor-parallel-size 8
--gpu_memory_utilization 0.99
--swap-space 32
--kv-cache-dtype fp8
--enforce-eager
--dtype float16
vllm serve cognitivecomputations/DeepSeek-V3-AWQ
--trust-remote-code
--host 0.0.0.0
--port 8080
--max-model-len 10000
--tensor-parallel-size 8
--gpu_memory_utilization 0.99
--swap-space 32
--kv-cache-dtype fp8
--enforce-eager
--dtype float16
works and gives 5.2T/s
vllm serve cognitivecomputations/DeepSeek-R1-AWQ
--trust-remote-code
--host 0.0.0.0
--port 8080
--max-model-len 10000
--tensor-parallel-size 8
--gpu_memory_utilization 0.99
--swap-space 32
--kv-cache-dtype fp8
--enforce-eager
--dtype float16
works and 5.2T/s for 8 x A100 as well
For a Batch Size of 1 ?
combined over 10 Batches I had 12.6 avg throughput.
Vllms average throughput log is for all concurrent requests combined.
If it is for a Batch Size of 1, any Ideas what could cause this huge performance difference on the same Hardware ?
Use the new kernel for a lot of performance boost and CUDA graph support.
Use the new kernel for a lot of performance boost and CUDA graph support.
new kernel means vllm kernel or tensorcore?
How is the model quality?
I used
vllm serve cognitivecomputations/DeepSeek-R1-AWQ
--trust-remote-code
--host 0.0.0.0
--port 8080
--max-model-len 10000
--tensor-parallel-size 8
--gpu_memory_utilization 0.99
--swap-space 32
--kv-cache-dtype fp8
--enforce-eager
--dtype float16
The model outputs random things.
I removed --kv-cache-dtype fp8. It is working.
The release version of vllm (0.7.2) still has bug when I run
--kv-cache-dtype fp8_e5m2 --calculate-kv-scales .
Will try to build vllm then run.
A100 80G x8
Hi, how did you build the environment?vllm 0.7.1 and 0.7.2 have this problem
ubuntu22.04
conda create -n vllm_deepseek-r1 python=3.12
pip install vllm
root@compute02:~# nvidia-smi
Sat Feb 8 18:55:56 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04 Driver Version: 535.171.04 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100 80GB PCIe Off | 00000000:4B:00.0 Off | 0 |
| N/A 44C P0 53W / 300W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100 80GB PCIe Off | 00000000:4C:00.0 Off | 0 |
| N/A 45C P0 56W / 300W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A100 80GB PCIe Off | 00000000:4E:00.0 Off | 0 |
| N/A 39C P0 52W / 300W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A100 80GB PCIe Off | 00000000:4F:00.0 Off | 0 |
| N/A 44C P0 59W / 300W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 4 NVIDIA A100 80GB PCIe Off | 00000000:CB:00.0 Off | 0 |
| N/A 45C P0 57W / 300W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 5 NVIDIA A100 80GB PCIe Off | 00000000:CC:00.0 Off | 0 |
| N/A 45C P0 58W / 300W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 6 NVIDIA A100 80GB PCIe Off | 00000000:CD:00.0 Off | 0 |
| N/A 40C P0 56W / 300W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 7 NVIDIA A100 80GB PCIe Off | 00000000:CE:00.0 Off | 0 |
| N/A 43C P0 58W / 300W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
root@compute02:~# /share/menkeyi/.conda/envs/vllm_deepseek-r1/bin/python -m vllm.entrypoints.openai.api_server --trust-remote-code --tensor-parallel-size 8 --max-model-len 32768 --gpu-memory-utilization 0.97 --dtype float16 --enforce-eager --model ./DeepSeek-R1-AWQ/
INFO 02-08 18:45:23 __init__.py:183] Automatically detected platform cuda.
INFO 02-08 18:45:24 api_server.py:838] vLLM API server version 0.7.1
INFO 02-08 18:45:24 api_server.py:839] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, enable_reasoning=False, reasoning_parser=None, tool_call_parser=None, tool_parser_plugin='', model='./DeepSeek-R1-AWQ/', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='float16', kv_cache_dtype='auto', max_model_len=32768, guided_decoding_backend='xgrammar', logits_processor_pattern=None, distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=8, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.97, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=True, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)
INFO 02-08 18:45:24 api_server.py:204] Started engine process with PID 2570820
INFO 02-08 18:45:24 config.py:135] Replacing legacy 'type' key with 'rope_type'
WARNING 02-08 18:45:24 config.py:2368] Casting torch.bfloat16 to torch.float16.
INFO 02-08 18:45:30 __init__.py:183] Automatically detected platform cuda.
INFO 02-08 18:45:32 config.py:135] Replacing legacy 'type' key with 'rope_type'
WARNING 02-08 18:45:32 config.py:2368] Casting torch.bfloat16 to torch.float16.
INFO 02-08 18:45:33 config.py:526] This model supports multiple tasks: {'classify', 'generate', 'score', 'reward', 'embed'}. Defaulting to 'generate'.
INFO 02-08 18:45:34 awq_marlin.py:109] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
INFO 02-08 18:45:34 config.py:1383] Defaulting to use mp for distributed inference
WARNING 02-08 18:45:34 cuda.py:100] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
WARNING 02-08 18:45:34 config.py:662] Async output processing is not supported on the current platform type cuda.
WARNING 02-08 18:45:34 config.py:975] MLA is not supported with awq_marlin quantization. Disabling MLA.
INFO 02-08 18:45:40 config.py:526] This model supports multiple tasks: {'reward', 'generate', 'classify', 'embed', 'score'}. Defaulting to 'generate'.
INFO 02-08 18:45:41 awq_marlin.py:109] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
INFO 02-08 18:45:41 config.py:1383] Defaulting to use mp for distributed inference
WARNING 02-08 18:45:41 cuda.py:100] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
WARNING 02-08 18:45:41 config.py:662] Async output processing is not supported on the current platform type cuda.
WARNING 02-08 18:45:41 config.py:975] MLA is not supported with awq_marlin quantization. Disabling MLA.
INFO 02-08 18:45:41 llm_engine.py:232] Initializing a V0 LLM engine (v0.7.1) with config: model='./DeepSeek-R1-AWQ/', speculative_config=None, tokenizer='./DeepSeek-R1-AWQ/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=8, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=awq_marlin, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=./DeepSeek-R1-AWQ/, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, u
(VllmWorkerProcess pid=2571092) INFO 02-08 18:46:33 model_runner.py:1116] Loading model weights took 43.0922 GB
(VllmWorkerProcess pid=2571096) INFO 02-08 18:46:33 model_runner.py:1116] Loading model weights took 43.0922 GB
(VllmWorkerProcess pid=2571095) INFO 02-08 18:46:36 model_runner.py:1116] Loading model weights took 43.0922 GB
INFO 02-08 18:46:36 model_runner.py:1116] Loading model weights took 43.0922 GB
(VllmWorkerProcess pid=2571093) INFO 02-08 18:46:38 model_runner.py:1116] Loading model weights took 43.0922 GB
(VllmWorkerProcess pid=2571094) INFO 02-08 18:46:39 model_runner.py:1116] Loading model weights took 43.0922 GB
(VllmWorkerProcess pid=2571098) INFO 02-08 18:46:43 model_runner.py:1116] Loading model weights took 43.0922 GB
(VllmWorkerProcess pid=2571097) INFO 02-08 18:46:43 model_runner.py:1116] Loading model weights took 43.0922 GB
(VllmWorkerProcess pid=2571097) WARNING 02-08 18:46:49 fused_moe.py:647] Using default MoE config. Performance might be sub-optimal! Config file not found at /share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/configs/E=256,N=14336,device_name=NVIDIA_A100_80GB_PCIe.json
(VllmWorkerProcess pid=2571094) WARNING 02-08 18:46:49 fused_moe.py:647] Using default MoE config. Performance might be sub-optimal! Config file not found at /share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/configs/E=256,N=14336,device_name=NVIDIA_A100_80GB_PCIe.json
(VllmWorkerProcess pid=2571096) WARNING 02-08 18:46:49 fused_moe.py:647] Using default MoE config. Performance might be sub-optimal! Config file not found at /share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/configs/E=256,N=14336,device_name=NVIDIA_A100_80GB_PCIe.json
(VllmWorkerProcess pid=2571095) WARNING 02-08 18:46:49 fused_moe.py:647] Using default MoE config. Performance might be sub-optimal! Config file not found at /share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/configs/E=256,N=14336,device_name=NVIDIA_A100_80GB_PCIe.json
(VllmWorkerProcess pid=2571098) WARNING 02-08 18:46:49 fused_moe.py:647] Using default MoE config. Performance might be sub-optimal! Config file not found at /share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/configs/E=256,N=14336,device_name=NVIDIA_A100_80GB_PCIe.json
(VllmWorkerProcess pid=2571092) WARNING 02-08 18:46:49 fused_moe.py:647] Using default MoE config. Performance might be sub-optimal! Config file not found at /share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/configs/E=256,N=14336,device_name=NVIDIA_A100_80GB_PCIe.json
WARNING 02-08 18:46:49 fused_moe.py:647] Using default MoE config. Performance might be sub-optimal! Config file not found at /share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/configs/E=256,N=14336,device_name=NVIDIA_A100_80GB_PCIe.json
(VllmWorkerProcess pid=2571093) WARNING 02-08 18:46:49 fused_moe.py:647] Using default MoE config. Performance might be sub-optimal! Config file not found at /share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/configs/E=256,N=14336,device_name=NVIDIA_A100_80GB_PCIe.json
(VllmWorkerProcess pid=2571092) INFO 02-08 18:47:22 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250208-184722.pkl...
(VllmWorkerProcess pid=2571098) INFO 02-08 18:47:22 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250208-184722.pkl...
(VllmWorkerProcess pid=2571096) INFO 02-08 18:47:22 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250208-184722.pkl...
(VllmWorkerProcess pid=2571095) INFO 02-08 18:47:22 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250208-184722.pkl...
(VllmWorkerProcess pid=2571093) INFO 02-08 18:47:22 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250208-184722.pkl...
(VllmWorkerProcess pid=2571097) INFO 02-08 18:47:22 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250208-184722.pkl...
INFO 02-08 18:47:22 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250208-184722.pkl...
(VllmWorkerProcess pid=2571092) INFO 02-08 18:47:22 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20250208-184722.pkl.
(VllmWorkerProcess pid=2571096) INFO 02-08 18:47:22 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20250208-184722.pkl.
(VllmWorkerProcess pid=2571093) INFO 02-08 18:47:22 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20250208-184722.pkl.
(VllmWorkerProcess pid=2571095) INFO 02-08 18:47:22 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20250208-184722.pkl.
(VllmWorkerProcess pid=2571098) INFO 02-08 18:47:22 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20250208-184722.pkl.
(VllmWorkerProcess pid=2571094) INFO 02-08 18:47:22 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250208-184722.pkl...
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks.
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] Traceback (most recent call last):
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] File "/share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] return func(*args, **kwargs)
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] File "/share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1767, in execute_model
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] logits = self.model.compute_logits(hidden_or_intermediate_states,
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] File "/share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/vllm/model_executor/models/deepseek_v3.py", line 692, in compute_logits
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] logits = self.logits_processor(self.lm_head, hidden_states,
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] File "/share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] File "/share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] File "/share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/vllm/model_executor/layers/logits_processor.py", line 68, in forward
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] logits = self._get_logits(hidden_states, lm_head, embedding_bias)
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] File "/share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/vllm/model_executor/layers/logits_processor.py", line 104, in _get_logits
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] logits = tensor_model_parallel_gather(logits)
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] File "/share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/vllm/distributed/communication_op.py", line 24, in tensor_model_parallel_gather
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] return get_tp_group().gather(input_, dst, dim)
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] File "/share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/vllm/distributed/parallel_state.py", line 441, in gather
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] torch.distributed.gather(input_,
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] File "/share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] return func(*args, **kwargs)
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] File "/share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 3620, in gather
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] work = group.gather(output_tensors, input_tensors, opts)
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] RuntimeError: Inplace update to inference tensor outside InferenceMode is not allowed.You can make a clone to get a normal tensor before doing inplace update.See https://github.com/pytorch/rfcs/pull/17 for more details.
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] The above exception was the direct cause of the following exception:
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] Traceback (most recent call last):
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] File "/share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/vllm/executor/multiproc_worker_utils.py", line 234, in _run_worker_process
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] output = run_method(worker, method, args, kwargs)
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] File "/share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/vllm/utils.py", line 2208, in run_method
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] return func(*args, **kwargs)
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] File "/share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] return func(*args, **kwargs)
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] File "/share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/vllm/worker/worker.py", line 228, in determine_num_available_blocks
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] self.model_runner.profile_run()
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] File "/share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] return func(*args, **kwargs)
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] File "/share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1236, in profile_run
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] self._dummy_run(max_num_batched_tokens, max_num_seqs)
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] File "/share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1347, in _dummy_run
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] self.execute_model(model_input, kv_caches, intermediate_tensors)
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] File "/share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] return func(*args, **kwargs)
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] File "/share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/vllm/worker/model_runner_base.py", line 152, in _wrapper
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] raise type(err)(
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] RuntimeError: Error in model execution (input dumped to /tmp/err_execute_model_input_20250208-184722.pkl): Inplace update to inference tensor outside InferenceMode is not allowed.You can make a clone to get a normal tensor before doing inplace update.See https://github.com/pytorch/rfcs/pull/17 for more details.
(VllmWorkerProcess pid=2571098) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks.
(VllmWorkerProcess pid=2571098) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] Traceback (most recent call last):
(VllmWorkerProcess pid=2571098) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] File "/share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
(VllmWorkerProcess pid=2571098) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] return func(*args, **kwargs)
(VllmWorkerProcess pid=2571098) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2571098) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] File "/share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1767, in execute_model
How can I achieve the speed of 25 tps described in the Model card using 8A100? I found that if I don't add --enforce-eager, a cudaoom error will occur. After adding it, the speed is approximately 9 tps. If I remove --quantization moe_wna16, the speed drops to around 5 tps.
My startup command is as follows:
python -m vllm.entrypoints.openai.api_server
--served-model-name deepseek-r1
--model /data/liyongzhi/hf_models/DeepSeek-R1-AWQ
--trust-remote-code
--host 0.0.0.0
--port 8080
--max-model-len 10000
--tensor-parallel-size 8
--gpu_memory_utilization 0.999
--swap-space 32
--kv-cache-dtype fp8_e5m2
--calculate-kv-scales
--enforce-eager
--quantization moe_wna16
--dtype float16
How can I achieve the speed of 25 tps described in the Model card using 8A100? I found that if I don't add --enforce-eager, a cudaoom error will occur. After adding it, the speed is approximately 9 tps. If I remove --quantization moe_wna16, the speed drops to around 5 tps.
My startup command is as follows:
python -m vllm.entrypoints.openai.api_server
--served-model-name deepseek-r1
--model /data/liyongzhi/hf_models/DeepSeek-R1-AWQ
--trust-remote-code
--host 0.0.0.0
--port 8080
--max-model-len 10000
--tensor-parallel-size 8
--gpu_memory_utilization 0.999 \
Hello, what is your vllm environment? Mine is not running properly.
How can I achieve the speed of 25 tps described in the Model card using 8A100? I found that if I don't add --enforce-eager, a cudaoom error will occur. After adding it, the speed is approximately 9 tps. If I remove --quantization moe_wna16, the speed drops to around 5 tps.
My startup command is as follows:
python -m vllm.entrypoints.openai.api_server
--served-model-name deepseek-r1
--model /data/liyongzhi/hf_models/DeepSeek-R1-AWQ
--trust-remote-code
--host 0.0.0.0
--port 8080
--max-model-len 10000
--tensor-parallel-size 8
--gpu_memory_utilization 0.999 \Hello, what is your vllm environment? Mine is not running properly.
I just intall the vllm from the source by the cmd below:
git clone https://github.com/vllm-project/vllm.git
cd vllm
VLLM_USE_PRECOMPILED=1 pip install --editable .
How can I achieve the speed of 25 tps described in the Model card using 8A100? I found that if I don't add --enforce-eager, a cudaoom error will occur. After adding it, the speed is approximately 9 tps. If I remove --quantization moe_wna16, the speed drops to around 5 tps.
My startup command is as follows:
python -m vllm.entrypoints.openai.api_server
--served-model-name deepseek-r1
--model /data/liyongzhi/hf_models/DeepSeek-R1-AWQ
--trust-remote-code
--host 0.0.0.0
--port 8080
--max-model-len 10000
--tensor-parallel-size 8
--gpu_memory_utilization 0.999
--swap-space 32
--kv-cache-dtype fp8_e5m2
--calculate-kv-scales
--enforce-eager
--quantization moe_wna16
--dtype float16
I can achieve more than 45 tps on 8A100 when send 5 request at the same time, not sure the 27 tps in the model card is achieved by 1 req or batch inference?
@menkeyi I see that you are not using the latest vLLM (v0.7.3), please update it and try again.
@muziyongshixin
Remove --enforce-eager
and reduce --gpu_memory_utilization
to 0.97, the main performance boost is through CUDA graph capture, which takes about 2GB of VRAM. You set the memory utilization too high so during graph capture it goes out of memory.
@menkeyi I see that you are not using the latest vLLM (v0.7.3), please update it and try again.
@muziyongshixin Remove
--enforce-eager
and reduce--gpu_memory_utilization
to 0.97, the main performance boost is through CUDA graph capture, which takes about 2GB of VRAM. You set the memory utilization too high so during graph capture it goes out of memory.
Is there v0.7.3 for vllm? The following comamnds fail
pip install --no-cache-dir vllm==0.7.3
Hi, are there any detailed command instructions for vllm "--quantization moe_wna16". On 8xA100
What command should i need to run?