Deployment framework

#2
by xro7 - opened

What framework did you use to deploy the model? I tried vllm with 8xH100 but got the following error.

2025-01-22T13:22:49.476492425Z [1;36m(VllmWorkerProcess pid=362)[0;0m INFO 01-22 05:22:49 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250122-052249.pkl...
2025-01-22T13:22:49.477126901Z [1;36m(VllmWorkerProcess pid=363)[0;0m INFO 01-22 05:22:49 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250122-052249.pkl...
2025-01-22T13:22:49.477129206Z [1;36m(VllmWorkerProcess pid=361)[0;0m INFO 01-22 05:22:49 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250122-052249.pkl...```
Cognitive Computations org

Can you provide the full log and your start up command?

I kept the logs for my 4xH200 experiment but got the same error for 8xH100

vllm parameters:
--host 0.0.0.0 --port 8000 --model cognitivecomputations/DeepSeek-R1-AWQ --gpu-memory-utilization 0.95 --tensor-parallel-size=4 --trust_remote_code

Logs:

2025-01-22T13:07:50.421133598Z INFO 01-22 05:07:50 api_server.py:712] vLLM API server version 0.6.6.post1
2025-01-22T13:07:50.421303357Z INFO 01-22 05:07:50 api_server.py:713] args: Namespace(host='0.0.0.0', port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key='', lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='cognitivecomputations/DeepSeek-R1-AWQ', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=30000, guided_decoding_backend='xgrammar', logits_processor_pattern=None, distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.95, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=True, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)
2025-01-22T13:07:50.430906961Z INFO 01-22 05:07:50 api_server.py:199] Started engine process with PID 89
2025-01-22T13:07:50.643475046Z INFO 01-22 05:07:50 config.py:131] Replacing legacy 'type' key with 'rope_type'
2025-01-22T13:07:53.969666636Z INFO 01-22 05:07:53 config.py:131] Replacing legacy 'type' key with 'rope_type'
2025-01-22T13:07:55.208259634Z INFO 01-22 05:07:55 config.py:510] This model supports multiple tasks: {'score', 'generate', 'reward', 'classify', 'embed'}. Defaulting to 'generate'.
2025-01-22T13:07:55.844302051Z INFO 01-22 05:07:55 awq_marlin.py:109] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
2025-01-22T13:07:55.888077160Z INFO 01-22 05:07:55 config.py:1310] Defaulting to use mp for distributed inference
2025-01-22T13:07:55.888171171Z WARNING 01-22 05:07:55 cuda.py:98] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
2025-01-22T13:07:55.888191894Z WARNING 01-22 05:07:55 config.py:642] Async output processing is not supported on the current platform type cuda.
2025-01-22T13:07:58.487429442Z INFO 01-22 05:07:58 config.py:510] This model supports multiple tasks: {'embed', 'reward', 'classify', 'generate', 'score'}. Defaulting to 'generate'.
2025-01-22T13:07:59.106749422Z INFO 01-22 05:07:59 awq_marlin.py:109] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
2025-01-22T13:07:59.150778826Z INFO 01-22 05:07:59 config.py:1310] Defaulting to use mp for distributed inference
2025-01-22T13:07:59.150878529Z WARNING 01-22 05:07:59 cuda.py:98] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
2025-01-22T13:07:59.150900534Z WARNING 01-22 05:07:59 config.py:642] Async output processing is not supported on the current platform type cuda.
2025-01-22T13:07:59.173686852Z INFO 01-22 05:07:59 llm_engine.py:234] Initializing an LLM engine (v0.6.6.post1) with config: model='cognitivecomputations/DeepSeek-R1-AWQ', speculative_config=None, tokenizer='cognitivecomputations/DeepSeek-R1-AWQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=30000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=awq_marlin, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=cognitivecomputations/DeepSeek-R1-AWQ, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[],"max_capture_size":0}, use_cached_outputs=True,
2025-01-22T13:07:59.578249195Z WARNING 01-22 05:07:59 multiproc_worker_utils.py:312] Reducing Torch parallelism from 96 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
2025-01-22T13:07:59.583556350Z INFO 01-22 05:07:59 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
2025-01-22T13:07:59.644588714Z INFO 01-22 05:07:59 selector.py:120] Using Flash Attention backend.
2025-01-22T13:07:59.700087505Z (VllmWorkerProcess pid=361) INFO 01-22 05:07:59 selector.py:120] Using Flash Attention backend.
2025-01-22T13:07:59.700196810Z (VllmWorkerProcess pid=361) INFO 01-22 05:07:59 multiproc_worker_utils.py:222] Worker ready; awaiting tasks
2025-01-22T13:07:59.719623814Z (VllmWorkerProcess pid=363) INFO 01-22 05:07:59 selector.py:120] Using Flash Attention backend.
2025-01-22T13:07:59.719626424Z (VllmWorkerProcess pid=362) INFO 01-22 05:07:59 selector.py:120] Using Flash Attention backend.
2025-01-22T13:07:59.719717955Z (VllmWorkerProcess pid=362) INFO 01-22 05:07:59 multiproc_worker_utils.py:222] Worker ready; awaiting tasks
2025-01-22T13:07:59.719719661Z (VllmWorkerProcess pid=363) INFO 01-22 05:07:59 multiproc_worker_utils.py:222] Worker ready; awaiting tasks
2025-01-22T13:08:03.041911052Z (VllmWorkerProcess pid=361) INFO 01-22 05:08:03 utils.py:918] Found nccl from library libnccl.so.2
2025-01-22T13:08:03.041943685Z INFO 01-22 05:08:03 utils.py:918] Found nccl from library libnccl.so.2
2025-01-22T13:08:03.042058901Z (VllmWorkerProcess pid=361) INFO 01-22 05:08:03 pynccl.py:69] vLLM is using nccl==2.21.5
2025-01-22T13:08:03.042067625Z INFO 01-22 05:08:03 pynccl.py:69] vLLM is using nccl==2.21.5
2025-01-22T13:08:03.042084177Z (VllmWorkerProcess pid=363) INFO 01-22 05:08:03 utils.py:918] Found nccl from library libnccl.so.2
2025-01-22T13:08:03.042089699Z (VllmWorkerProcess pid=362) INFO 01-22 05:08:03 utils.py:918] Found nccl from library libnccl.so.2
2025-01-22T13:08:03.042269576Z (VllmWorkerProcess pid=363) INFO 01-22 05:08:03 pynccl.py:69] vLLM is using nccl==2.21.5
2025-01-22T13:08:03.042297664Z (VllmWorkerProcess pid=362) INFO 01-22 05:08:03 pynccl.py:69] vLLM is using nccl==2.21.5
2025-01-22T13:08:04.762844790Z INFO 01-22 05:08:04 custom_all_reduce_utils.py:204] generating GPU P2P access cache in /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
2025-01-22T13:08:19.714251803Z INFO 01-22 05:08:19 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
2025-01-22T13:08:19.714368438Z (VllmWorkerProcess pid=361) INFO 01-22 05:08:19 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
2025-01-22T13:08:19.714371653Z (VllmWorkerProcess pid=362) INFO 01-22 05:08:19 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
2025-01-22T13:08:19.714609456Z (VllmWorkerProcess pid=363) INFO 01-22 05:08:19 custom_all_reduce_utils.py:242] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1,2,3.json
2025-01-22T13:08:19.747454967Z INFO 01-22 05:08:19 shm_broadcast.py:255] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3], buffer_handle=(3, 4194304, 6, 'psm_ed9c7126'), local_subscribe_port=53933, remote_subscribe_port=None)
2025-01-22T13:08:19.783713362Z INFO 01-22 05:08:19 model_runner.py:1094] Starting to load model cognitivecomputations/DeepSeek-R1-AWQ...
2025-01-22T13:08:19.783863117Z (VllmWorkerProcess pid=361) INFO 01-22 05:08:19 model_runner.py:1094] Starting to load model cognitivecomputations/DeepSeek-R1-AWQ...
2025-01-22T13:08:19.784442981Z (VllmWorkerProcess pid=362) INFO 01-22 05:08:19 model_runner.py:1094] Starting to load model cognitivecomputations/DeepSeek-R1-AWQ...
2025-01-22T13:08:19.784445640Z (VllmWorkerProcess pid=363) INFO 01-22 05:08:19 model_runner.py:1094] Starting to load model cognitivecomputations/DeepSeek-R1-AWQ...
2025-01-22T13:08:20.194644565Z Cache shape torch.Size([163840, 64])
2025-01-22T13:08:20.194662173Z INFO 01-22 05:08:20 weight_utils.py:251] Using model weights format ['*.safetensors']
2025-01-22T13:08:20.234273784Z (VllmWorkerProcess pid=361) Cache shape torch.Size([163840, 64])
2025-01-22T13:08:20.234294554Z (VllmWorkerProcess pid=361) INFO 01-22 05:08:20 weight_utils.py:251] Using model weights format ['*.safetensors']
2025-01-22T13:08:20.234579652Z (VllmWorkerProcess pid=362) Cache shape torch.Size([163840, 64])
2025-01-22T13:08:20.234583739Z (VllmWorkerProcess pid=362) INFO 01-22 05:08:20 weight_utils.py:251] Using model weights format ['*.safetensors']
2025-01-22T13:08:20.243174528Z (VllmWorkerProcess pid=363) Cache shape torch.Size([163840, 64])
2025-01-22T13:08:20.243179760Z (VllmWorkerProcess pid=363) INFO 01-22 05:08:20 weight_utils.py:251] Using model weights format ['*.safetensors']
2025-01-22T13:20:31.182095071Z 
Loading safetensors checkpoint shards:   0% Completed | 0/74 [00:00<?, ?it/s]
2025-01-22T13:20:31.489554762Z 
Loading safetensors checkpoint shards:   1% Completed | 1/74 [00:00<00:22,  3.25it/s]
2025-01-22T13:20:31.974300115Z 
Loading safetensors checkpoint shards:   3% Completed | 2/74 [00:00<00:29,  2.43it/s]
2025-01-22T13:20:32.455546706Z 
Loading safetensors checkpoint shards:   4% Completed | 3/74 [00:01<00:31,  2.25it/s]
2025-01-22T13:20:32.926504974Z 
Loading safetensors checkpoint shards:   5% Completed | 4/74 [00:01<00:31,  2.20it/s]
2025-01-22T13:20:33.397254243Z 
Loading safetensors checkpoint shards:   7% Completed | 5/74 [00:02<00:31,  2.17it/s]
2025-01-22T13:20:33.875270023Z 
Loading safetensors checkpoint shards:   8% Completed | 6/74 [00:02<00:31,  2.14it/s]
2025-01-22T13:20:34.344715583Z 
Loading safetensors checkpoint shards:   9% Completed | 7/74 [00:03<00:31,  2.14it/s]
2025-01-22T13:20:34.821748448Z 
Loading safetensors checkpoint shards:  11% Completed | 8/74 [00:03<00:31,  2.13it/s]
2025-01-22T13:20:35.290056371Z 
Loading safetensors checkpoint shards:  12% Completed | 9/74 [00:04<00:30,  2.13it/s]
2025-01-22T13:20:35.755523220Z 
Loading safetensors checkpoint shards:  14% Completed | 10/74 [00:04<00:29,  2.13it/s]
2025-01-22T13:20:36.228502702Z 
Loading safetensors checkpoint shards:  15% Completed | 11/74 [00:05<00:29,  2.13it/s]
2025-01-22T13:20:36.700871980Z 
Loading safetensors checkpoint shards:  16% Completed | 12/74 [00:05<00:29,  2.12it/s]
2025-01-22T13:20:37.183470090Z 
Loading safetensors checkpoint shards:  18% Completed | 13/74 [00:06<00:28,  2.11it/s]
2025-01-22T13:20:37.657741308Z 
Loading safetensors checkpoint shards:  19% Completed | 14/74 [00:06<00:28,  2.11it/s]
2025-01-22T13:20:38.121128353Z 
Loading safetensors checkpoint shards:  20% Completed | 15/74 [00:06<00:27,  2.12it/s]
2025-01-22T13:20:38.589453375Z 
Loading safetensors checkpoint shards:  22% Completed | 16/74 [00:07<00:27,  2.13it/s]
2025-01-22T13:20:39.047142026Z 
Loading safetensors checkpoint shards:  23% Completed | 17/74 [00:07<00:26,  2.14it/s]
2025-01-22T13:20:39.491344292Z 
Loading safetensors checkpoint shards:  24% Completed | 18/74 [00:08<00:25,  2.18it/s]
2025-01-22T13:20:39.929711441Z 
Loading safetensors checkpoint shards:  26% Completed | 19/74 [00:08<00:24,  2.21it/s]
2025-01-22T13:20:40.374986470Z 
Loading safetensors checkpoint shards:  27% Completed | 20/74 [00:09<00:24,  2.22it/s]
2025-01-22T13:20:40.818969728Z 
Loading safetensors checkpoint shards:  28% Completed | 21/74 [00:09<00:23,  2.23it/s]
2025-01-22T13:20:41.273748530Z 
Loading safetensors checkpoint shards:  30% Completed | 22/74 [00:10<00:23,  2.22it/s]
2025-01-22T13:20:41.739147123Z 
Loading safetensors checkpoint shards:  31% Completed | 23/74 [00:10<00:23,  2.20it/s]
2025-01-22T13:20:42.188972601Z 
Loading safetensors checkpoint shards:  32% Completed | 24/74 [00:11<00:22,  2.21it/s]
2025-01-22T13:20:42.641780672Z 
Loading safetensors checkpoint shards:  34% Completed | 25/74 [00:11<00:22,  2.21it/s]
2025-01-22T13:20:43.096641696Z 
Loading safetensors checkpoint shards:  35% Completed | 26/74 [00:11<00:21,  2.20it/s]
2025-01-22T13:20:43.567797093Z 
Loading safetensors checkpoint shards:  36% Completed | 27/74 [00:12<00:21,  2.18it/s]
2025-01-22T13:20:44.046209789Z 
Loading safetensors checkpoint shards:  38% Completed | 28/74 [00:12<00:21,  2.15it/s]
2025-01-22T13:20:44.525739823Z 
Loading safetensors checkpoint shards:  39% Completed | 29/74 [00:13<00:21,  2.13it/s]
2025-01-22T13:20:45.062838963Z 
Loading safetensors checkpoint shards:  41% Completed | 30/74 [00:13<00:21,  2.04it/s]
2025-01-22T13:20:45.538771429Z 
Loading safetensors checkpoint shards:  42% Completed | 31/74 [00:14<00:20,  2.06it/s]
2025-01-22T13:20:46.003535599Z 
Loading safetensors checkpoint shards:  43% Completed | 32/74 [00:14<00:20,  2.09it/s]
2025-01-22T13:20:46.479112534Z 
Loading safetensors checkpoint shards:  45% Completed | 33/74 [00:15<00:19,  2.09it/s]
2025-01-22T13:20:46.945277181Z 
Loading safetensors checkpoint shards:  46% Completed | 34/74 [00:15<00:18,  2.11it/s]
2025-01-22T13:20:47.399506630Z 
Loading safetensors checkpoint shards:  47% Completed | 35/74 [00:16<00:18,  2.13it/s]
2025-01-22T13:20:47.862872167Z 
Loading safetensors checkpoint shards:  49% Completed | 36/74 [00:16<00:17,  2.14it/s]
2025-01-22T13:20:48.339609077Z 
Loading safetensors checkpoint shards:  50% Completed | 37/74 [00:17<00:17,  2.13it/s]
2025-01-22T13:20:48.810059207Z 
Loading safetensors checkpoint shards:  51% Completed | 38/74 [00:17<00:16,  2.13it/s]
2025-01-22T13:20:49.280713034Z 
Loading safetensors checkpoint shards:  53% Completed | 39/74 [00:18<00:16,  2.13it/s]
2025-01-22T13:20:49.748002366Z 
Loading safetensors checkpoint shards:  54% Completed | 40/74 [00:18<00:15,  2.13it/s]
2025-01-22T13:20:50.200210526Z 
Loading safetensors checkpoint shards:  55% Completed | 41/74 [00:19<00:15,  2.15it/s]
2025-01-22T13:20:50.657614498Z 
Loading safetensors checkpoint shards:  57% Completed | 42/74 [00:19<00:14,  2.16it/s]
2025-01-22T13:20:51.128247380Z 
Loading safetensors checkpoint shards:  58% Completed | 43/74 [00:19<00:14,  2.15it/s]
2025-01-22T13:20:51.599344184Z 
Loading safetensors checkpoint shards:  59% Completed | 44/74 [00:20<00:13,  2.14it/s]
2025-01-22T13:20:52.074519018Z 
Loading safetensors checkpoint shards:  61% Completed | 45/74 [00:20<00:13,  2.13it/s]
2025-01-22T13:20:52.549870992Z 
Loading safetensors checkpoint shards:  62% Completed | 46/74 [00:21<00:13,  2.12it/s]
2025-01-22T13:20:53.041993357Z 
Loading safetensors checkpoint shards:  64% Completed | 47/74 [00:21<00:12,  2.09it/s]
2025-01-22T13:20:53.515416397Z 
Loading safetensors checkpoint shards:  65% Completed | 48/74 [00:22<00:12,  2.10it/s]
2025-01-22T13:20:53.985782219Z 
Loading safetensors checkpoint shards:  66% Completed | 49/74 [00:22<00:11,  2.11it/s]
2025-01-22T13:20:54.445680829Z 
Loading safetensors checkpoint shards:  68% Completed | 50/74 [00:23<00:11,  2.13it/s]
2025-01-22T13:20:54.916269219Z 
Loading safetensors checkpoint shards:  69% Completed | 51/74 [00:23<00:10,  2.13it/s]
2025-01-22T13:20:55.389394303Z 
Loading safetensors checkpoint shards:  70% Completed | 52/74 [00:24<00:10,  2.12it/s]
2025-01-22T13:20:55.866349991Z 
Loading safetensors checkpoint shards:  72% Completed | 53/74 [00:24<00:09,  2.11it/s]
2025-01-22T13:20:56.347850931Z 
Loading safetensors checkpoint shards:  73% Completed | 54/74 [00:25<00:09,  2.10it/s]
2025-01-22T13:20:56.794412370Z 
Loading safetensors checkpoint shards:  74% Completed | 55/74 [00:25<00:08,  2.14it/s]
2025-01-22T13:20:57.262317289Z 
Loading safetensors checkpoint shards:  76% Completed | 56/74 [00:26<00:08,  2.14it/s]
2025-01-22T13:20:57.732185124Z 
Loading safetensors checkpoint shards:  77% Completed | 57/74 [00:26<00:07,  2.14it/s]
2025-01-22T13:20:58.194820443Z 
Loading safetensors checkpoint shards:  78% Completed | 58/74 [00:27<00:07,  2.14it/s]
2025-01-22T13:20:58.670495387Z 
Loading safetensors checkpoint shards:  80% Completed | 59/74 [00:27<00:07,  2.13it/s]
2025-01-22T13:20:59.140341139Z 
Loading safetensors checkpoint shards:  81% Completed | 60/74 [00:27<00:06,  2.13it/s]
2025-01-22T13:20:59.613002800Z 
Loading safetensors checkpoint shards:  82% Completed | 61/74 [00:28<00:06,  2.13it/s]
2025-01-22T13:21:00.086442184Z 
Loading safetensors checkpoint shards:  84% Completed | 62/74 [00:28<00:05,  2.12it/s]
2025-01-22T13:21:00.560259399Z 
Loading safetensors checkpoint shards:  85% Completed | 63/74 [00:29<00:05,  2.12it/s]
2025-01-22T13:21:01.037240553Z 
Loading safetensors checkpoint shards:  86% Completed | 64/74 [00:29<00:04,  2.11it/s]
2025-01-22T13:21:01.498437763Z 
Loading safetensors checkpoint shards:  88% Completed | 65/74 [00:30<00:04,  2.13it/s]
2025-01-22T13:21:01.969160301Z 
Loading safetensors checkpoint shards:  89% Completed | 66/74 [00:30<00:03,  2.13it/s]
2025-01-22T13:21:02.440027377Z 
Loading safetensors checkpoint shards:  91% Completed | 67/74 [00:31<00:03,  2.13it/s]
2025-01-22T13:21:02.908381363Z 
Loading safetensors checkpoint shards:  92% Completed | 68/74 [00:31<00:02,  2.13it/s]
2025-01-22T13:21:03.381695121Z 
Loading safetensors checkpoint shards:  93% Completed | 69/74 [00:32<00:02,  2.12it/s]
2025-01-22T13:21:03.845546580Z 
Loading safetensors checkpoint shards:  95% Completed | 70/74 [00:32<00:01,  2.13it/s]
2025-01-22T13:21:04.311999508Z 
Loading safetensors checkpoint shards:  96% Completed | 71/74 [00:33<00:01,  2.14it/s]
2025-01-22T13:21:04.789659443Z 
Loading safetensors checkpoint shards:  97% Completed | 72/74 [00:33<00:00,  2.12it/s]
2025-01-22T13:21:05.098397817Z 
Loading safetensors checkpoint shards:  99% Completed | 73/74 [00:33<00:00,  2.37it/s]
2025-01-22T13:21:05.153431782Z 
Loading safetensors checkpoint shards: 100% Completed | 74/74 [00:33<00:00,  2.18it/s]
2025-01-22T13:21:22.200528061Z (VllmWorkerProcess pid=361) INFO 01-22 05:21:22 model_runner.py:1099] Loading model weights took 85.5053 GB
2025-01-22T13:21:23.235285583Z (VllmWorkerProcess pid=362) INFO 01-22 05:21:23 model_runner.py:1099] Loading model weights took 85.5053 GB
2025-01-22T13:21:23.662143488Z (VllmWorkerProcess pid=363) INFO 01-22 05:21:23 model_runner.py:1099] Loading model weights took 85.5053 GB
2025-01-22T13:21:24.130898012Z INFO 01-22 05:21:24 model_runner.py:1099] Loading model weights took 85.5053 GB
2025-01-22T13:21:25.970200306Z (VllmWorkerProcess pid=362) INFO 01-22 05:21:25 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250122-052125.pkl...
2025-01-22T13:21:25.970911363Z INFO 01-22 05:21:25 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250122-052125.pkl...
2025-01-22T13:21:25.973483812Z (VllmWorkerProcess pid=363) INFO 01-22 05:21:25 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250122-052125.pkl...
2025-01-22T13:21:25.973539062Z (VllmWorkerProcess pid=361) INFO 01-22 05:21:25 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250122-052125.pkl...
2025-01-22T13:21:25.978851389Z (VllmWorkerProcess pid=362) INFO 01-22 05:21:25 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20250122-052125.pkl.
2025-01-22T13:21:25.979807043Z INFO 01-22 05:21:25 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20250122-052125.pkl.
2025-01-22T13:21:25.981052641Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks.
2025-01-22T13:21:25.981054883Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] Traceback (most recent call last):
2025-01-22T13:21:25.981057016Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
2025-01-22T13:21:25.981058711Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     return func(*args, **kwargs)
2025-01-22T13:21:25.981060648Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]            ^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981061614Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1691, in execute_model
2025-01-22T13:21:25.981062571Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     hidden_or_intermediate_states = model_executable(
2025-01-22T13:21:25.981064343Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]                                     ^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981067242Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
2025-01-22T13:21:25.981068486Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     return self._call_impl(*args, **kwargs)
2025-01-22T13:21:25.981069736Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981070984Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
2025-01-22T13:21:25.981072436Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     return forward_call(*args, **kwargs)
2025-01-22T13:21:25.981080706Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981082054Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v3.py", line 527, in forward
2025-01-22T13:21:25.981083090Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     hidden_states = self.model(input_ids, positions, kv_caches,
2025-01-22T13:21:25.981084195Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981085485Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
2025-01-22T13:21:25.981086427Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     return self._call_impl(*args, **kwargs)
2025-01-22T13:21:25.981087552Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981088523Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
2025-01-22T13:21:25.981089556Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     return forward_call(*args, **kwargs)
2025-01-22T13:21:25.981090564Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981091589Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v3.py", line 483, in forward
2025-01-22T13:21:25.981092501Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     hidden_states, residual = layer(positions, hidden_states,
2025-01-22T13:21:25.981093914Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981094910Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
2025-01-22T13:21:25.981096149Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     return self._call_impl(*args, **kwargs)
2025-01-22T13:21:25.981097296Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981098229Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
2025-01-22T13:21:25.981099158Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     return forward_call(*args, **kwargs)
2025-01-22T13:21:25.981100096Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981101044Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v3.py", line 412, in forward
2025-01-22T13:21:25.981101969Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     hidden_states = self.mlp(hidden_states)
2025-01-22T13:21:25.981104922Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]                     ^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981105903Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
2025-01-22T13:21:25.981106883Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     return self._call_impl(*args, **kwargs)
2025-01-22T13:21:25.981107795Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981108916Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
2025-01-22T13:21:25.981109861Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     return forward_call(*args, **kwargs)
2025-01-22T13:21:25.981110821Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981111777Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v3.py", line 158, in forward
2025-01-22T13:21:25.981112700Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     final_hidden_states = self.experts(
2025-01-22T13:21:25.981113637Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]                           ^^^^^^^^^^^^^
2025-01-22T13:21:25.981114804Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
2025-01-22T13:21:25.981115921Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     return self._call_impl(*args, **kwargs)
2025-01-22T13:21:25.981117012Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981118081Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
2025-01-22T13:21:25.981119209Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     return forward_call(*args, **kwargs)
2025-01-22T13:21:25.981120307Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981121683Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 522, in forward
2025-01-22T13:21:25.981123129Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     final_hidden_states = self.quant_method.apply(
2025-01-22T13:21:25.981124109Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]                           ^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981125040Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/awq_marlin.py", line 463, in apply
2025-01-22T13:21:25.981126118Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     return torch.ops.vllm.fused_marlin_moe(
2025-01-22T13:21:25.981127065Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981129560Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1116, in __call__
2025-01-22T13:21:25.981130829Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     return self._op(*args, **(kwargs or {}))
2025-01-22T13:21:25.981132369Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981133322Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/fused_marlin_moe.py", line 202, in fused_marlin_moe
2025-01-22T13:21:25.981134990Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     assert hidden_states.dtype == torch.float16
2025-01-22T13:21:25.981135915Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981136986Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] AssertionError
2025-01-22T13:21:25.981138483Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]
2025-01-22T13:21:25.981140056Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] The above exception was the direct cause of the following exception:
2025-01-22T13:21:25.981141343Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]
2025-01-22T13:21:25.981142564Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236] Traceback (most recent call last):
2025-01-22T13:21:25.981143687Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/multiproc_worker_utils.py", line 230, in _run_worker_process
2025-01-22T13:21:25.981144781Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     output = executor(*args, **kwargs)
2025-01-22T13:21:25.981145931Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]              ^^^^^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981147028Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
2025-01-22T13:21:25.981148165Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     return func(*args, **kwargs)
2025-01-22T13:21:25.981149122Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]            ^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981150069Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 202, in determine_num_available_blocks
2025-01-22T13:21:25.981150994Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     self.model_runner.profile_run()
2025-01-22T13:21:25.981151936Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
2025-01-22T13:21:25.981152905Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     return func(*args, **kwargs)
2025-01-22T13:21:25.981154018Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]            ^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981155223Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1331, in profile_run
2025-01-22T13:21:25.981158094Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     self.execute_model(model_input, kv_caches, intermediate_tensors)
2025-01-22T13:21:25.981159193Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
2025-01-22T13:21:25.981160159Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     return func(*args, **kwargs)
2025-01-22T13:21:25.981161245Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]            ^^^^^^^^^^^^^^^^^^^^^
2025-01-22T13:21:25.981162372Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner_base.py", line 152, in _wrapper
2025-01-22T13:21:25.981179430Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 multiproc_worker_utils.py:236]     raise type(err)(
2025-01-22T13:21:25.981180548Z (VllmWorkerProcess pid=362) ERROR 01-22 05:21:25 
Cognitive Computations org

Add --dtype float16 or use the new moe_wna16 kernel which needs to be built from source.

Did Work on 8x A100 with the Steup below, but I did not check the output quality, only around 1.2 T/S.
Can run inference Batch size of 10 with max len of 10.000

Might be faster with: - VLLM_ATTENTION_BACKEND=FLASHINFER # FOR 8bit KV Cache
But i did not check, probably better Bang for a Buck using H100s.

vllm:
image: vllm/vllm-openai:latest
ipc: host
ports:
- 8000:8000
volumes:
- ../hf_home:/root/.cache/huggingface/
environment:
- HF_HOME=/root/.cache/huggingface/
- TRANSFORMERS_OFFLINE=0
- HF_DATASET_OFFLINE=1
- NCCL_P2P_LEVEL=NVL
- NCCL_SHM_DISABLE=1
- NCCL_SOCKET_IFNAME=eth0
- CUDA_LAUNCH_BLOCKING=1
- TORCH_USE_CUDA_DSA=1
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [ "gpu" ]
command: >
--model cognitivecomputations/DeepSeek-R1-AWQ
--trust-remote-code
--host 0.0.0.0
--port 8000
--max-model-len 10000
--tensor-parallel-size 8
--gpu_memory_utilization 0.99
--swap-space 32
--kv-cache-dtype fp8
--enforce-eager
--dtype float16

vllm serve cognitivecomputations/DeepSeek-V3-AWQ
--trust-remote-code
--host 0.0.0.0
--port 8080
--max-model-len 10000
--tensor-parallel-size 8
--gpu_memory_utilization 0.99
--swap-space 32
--kv-cache-dtype fp8
--enforce-eager
--dtype float16

works and gives 5.2T/s

vllm serve cognitivecomputations/DeepSeek-R1-AWQ
--trust-remote-code
--host 0.0.0.0
--port 8080
--max-model-len 10000
--tensor-parallel-size 8
--gpu_memory_utilization 0.99
--swap-space 32
--kv-cache-dtype fp8
--enforce-eager
--dtype float16

works and 5.2T/s for 8 x A100 as well

For a Batch Size of 1 ?

combined over 10 Batches I had 12.6 avg throughput.
Vllms average throughput log is for all concurrent requests combined.

If it is for a Batch Size of 1, any Ideas what could cause this huge performance difference on the same Hardware ?

Cognitive Computations org

Use the new kernel for a lot of performance boost and CUDA graph support.

Use the new kernel for a lot of performance boost and CUDA graph support.

new kernel means vllm kernel or tensorcore?

How is the model quality?
I used
vllm serve cognitivecomputations/DeepSeek-R1-AWQ
--trust-remote-code
--host 0.0.0.0
--port 8080
--max-model-len 10000
--tensor-parallel-size 8
--gpu_memory_utilization 0.99
--swap-space 32
--kv-cache-dtype fp8
--enforce-eager
--dtype float16

The model outputs random things.

Cognitive Computations org

@yh-yao Try updating to the latest vLLM, then run the model with the command I provided in the model card. Also please post the spec of your setup, such as your GPUs.

This comment has been hidden

I removed --kv-cache-dtype fp8. It is working.

The release version of vllm (0.7.2) still has bug when I run
--kv-cache-dtype fp8_e5m2 --calculate-kv-scales .

Will try to build vllm then run.

A100 80G x8 

Hi, how did you build the environment?vllm 0.7.1 and 0.7.2 have this problem
ubuntu22.04


conda create -n vllm_deepseek-r1 python=3.12
pip install vllm



root@compute02:~# nvidia-smi 
Sat Feb  8 18:55:56 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100 80GB PCIe          Off | 00000000:4B:00.0 Off |                    0 |
| N/A   44C    P0              53W / 300W |      0MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80GB PCIe          Off | 00000000:4C:00.0 Off |                    0 |
| N/A   45C    P0              56W / 300W |      0MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100 80GB PCIe          Off | 00000000:4E:00.0 Off |                    0 |
| N/A   39C    P0              52W / 300W |      0MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100 80GB PCIe          Off | 00000000:4F:00.0 Off |                    0 |
| N/A   44C    P0              59W / 300W |      0MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA A100 80GB PCIe          Off | 00000000:CB:00.0 Off |                    0 |
| N/A   45C    P0              57W / 300W |      0MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA A100 80GB PCIe          Off | 00000000:CC:00.0 Off |                    0 |
| N/A   45C    P0              58W / 300W |      0MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA A100 80GB PCIe          Off | 00000000:CD:00.0 Off |                    0 |
| N/A   40C    P0              56W / 300W |      0MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA A100 80GB PCIe          Off | 00000000:CE:00.0 Off |                    0 |
| N/A   43C    P0              58W / 300W |      0MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+



root@compute02:~# /share/menkeyi/.conda/envs/vllm_deepseek-r1/bin/python -m vllm.entrypoints.openai.api_server --trust-remote-code --tensor-parallel-size 8  --max-model-len 32768 --gpu-memory-utilization 0.97 --dtype float16 --enforce-eager  --model  ./DeepSeek-R1-AWQ/
INFO 02-08 18:45:23 __init__.py:183] Automatically detected platform cuda.
INFO 02-08 18:45:24 api_server.py:838] vLLM API server version 0.7.1
INFO 02-08 18:45:24 api_server.py:839] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, enable_reasoning=False, reasoning_parser=None, tool_call_parser=None, tool_parser_plugin='', model='./DeepSeek-R1-AWQ/', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='float16', kv_cache_dtype='auto', max_model_len=32768, guided_decoding_backend='xgrammar', logits_processor_pattern=None, distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=8, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.97, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=True, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)
INFO 02-08 18:45:24 api_server.py:204] Started engine process with PID 2570820
INFO 02-08 18:45:24 config.py:135] Replacing legacy 'type' key with 'rope_type'
WARNING 02-08 18:45:24 config.py:2368] Casting torch.bfloat16 to torch.float16.
INFO 02-08 18:45:30 __init__.py:183] Automatically detected platform cuda.
INFO 02-08 18:45:32 config.py:135] Replacing legacy 'type' key with 'rope_type'
WARNING 02-08 18:45:32 config.py:2368] Casting torch.bfloat16 to torch.float16.
INFO 02-08 18:45:33 config.py:526] This model supports multiple tasks: {'classify', 'generate', 'score', 'reward', 'embed'}. Defaulting to 'generate'.
INFO 02-08 18:45:34 awq_marlin.py:109] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
INFO 02-08 18:45:34 config.py:1383] Defaulting to use mp for distributed inference
WARNING 02-08 18:45:34 cuda.py:100] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
WARNING 02-08 18:45:34 config.py:662] Async output processing is not supported on the current platform type cuda.
WARNING 02-08 18:45:34 config.py:975] MLA is not supported with awq_marlin quantization. Disabling MLA.
INFO 02-08 18:45:40 config.py:526] This model supports multiple tasks: {'reward', 'generate', 'classify', 'embed', 'score'}. Defaulting to 'generate'.
INFO 02-08 18:45:41 awq_marlin.py:109] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
INFO 02-08 18:45:41 config.py:1383] Defaulting to use mp for distributed inference
WARNING 02-08 18:45:41 cuda.py:100] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
WARNING 02-08 18:45:41 config.py:662] Async output processing is not supported on the current platform type cuda.
WARNING 02-08 18:45:41 config.py:975] MLA is not supported with awq_marlin quantization. Disabling MLA.
INFO 02-08 18:45:41 llm_engine.py:232] Initializing a V0 LLM engine (v0.7.1) with config: model='./DeepSeek-R1-AWQ/', speculative_config=None, tokenizer='./DeepSeek-R1-AWQ/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=8, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=awq_marlin, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=./DeepSeek-R1-AWQ/, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, u
(VllmWorkerProcess pid=2571092) INFO 02-08 18:46:33 model_runner.py:1116] Loading model weights took 43.0922 GB
(VllmWorkerProcess pid=2571096) INFO 02-08 18:46:33 model_runner.py:1116] Loading model weights took 43.0922 GB
(VllmWorkerProcess pid=2571095) INFO 02-08 18:46:36 model_runner.py:1116] Loading model weights took 43.0922 GB
INFO 02-08 18:46:36 model_runner.py:1116] Loading model weights took 43.0922 GB
(VllmWorkerProcess pid=2571093) INFO 02-08 18:46:38 model_runner.py:1116] Loading model weights took 43.0922 GB
(VllmWorkerProcess pid=2571094) INFO 02-08 18:46:39 model_runner.py:1116] Loading model weights took 43.0922 GB
(VllmWorkerProcess pid=2571098) INFO 02-08 18:46:43 model_runner.py:1116] Loading model weights took 43.0922 GB
(VllmWorkerProcess pid=2571097) INFO 02-08 18:46:43 model_runner.py:1116] Loading model weights took 43.0922 GB
(VllmWorkerProcess pid=2571097) WARNING 02-08 18:46:49 fused_moe.py:647] Using default MoE config. Performance might be sub-optimal! Config file not found at /share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/configs/E=256,N=14336,device_name=NVIDIA_A100_80GB_PCIe.json
(VllmWorkerProcess pid=2571094) WARNING 02-08 18:46:49 fused_moe.py:647] Using default MoE config. Performance might be sub-optimal! Config file not found at /share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/configs/E=256,N=14336,device_name=NVIDIA_A100_80GB_PCIe.json
(VllmWorkerProcess pid=2571096) WARNING 02-08 18:46:49 fused_moe.py:647] Using default MoE config. Performance might be sub-optimal! Config file not found at /share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/configs/E=256,N=14336,device_name=NVIDIA_A100_80GB_PCIe.json
(VllmWorkerProcess pid=2571095) WARNING 02-08 18:46:49 fused_moe.py:647] Using default MoE config. Performance might be sub-optimal! Config file not found at /share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/configs/E=256,N=14336,device_name=NVIDIA_A100_80GB_PCIe.json
(VllmWorkerProcess pid=2571098) WARNING 02-08 18:46:49 fused_moe.py:647] Using default MoE config. Performance might be sub-optimal! Config file not found at /share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/configs/E=256,N=14336,device_name=NVIDIA_A100_80GB_PCIe.json
(VllmWorkerProcess pid=2571092) WARNING 02-08 18:46:49 fused_moe.py:647] Using default MoE config. Performance might be sub-optimal! Config file not found at /share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/configs/E=256,N=14336,device_name=NVIDIA_A100_80GB_PCIe.json
WARNING 02-08 18:46:49 fused_moe.py:647] Using default MoE config. Performance might be sub-optimal! Config file not found at /share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/configs/E=256,N=14336,device_name=NVIDIA_A100_80GB_PCIe.json
(VllmWorkerProcess pid=2571093) WARNING 02-08 18:46:49 fused_moe.py:647] Using default MoE config. Performance might be sub-optimal! Config file not found at /share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/configs/E=256,N=14336,device_name=NVIDIA_A100_80GB_PCIe.json
(VllmWorkerProcess pid=2571092) INFO 02-08 18:47:22 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250208-184722.pkl...
(VllmWorkerProcess pid=2571098) INFO 02-08 18:47:22 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250208-184722.pkl...
(VllmWorkerProcess pid=2571096) INFO 02-08 18:47:22 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250208-184722.pkl...
(VllmWorkerProcess pid=2571095) INFO 02-08 18:47:22 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250208-184722.pkl...
(VllmWorkerProcess pid=2571093) INFO 02-08 18:47:22 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250208-184722.pkl...
(VllmWorkerProcess pid=2571097) INFO 02-08 18:47:22 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250208-184722.pkl...
INFO 02-08 18:47:22 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250208-184722.pkl...
(VllmWorkerProcess pid=2571092) INFO 02-08 18:47:22 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20250208-184722.pkl.
(VllmWorkerProcess pid=2571096) INFO 02-08 18:47:22 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20250208-184722.pkl.
(VllmWorkerProcess pid=2571093) INFO 02-08 18:47:22 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20250208-184722.pkl.
(VllmWorkerProcess pid=2571095) INFO 02-08 18:47:22 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20250208-184722.pkl.
(VllmWorkerProcess pid=2571098) INFO 02-08 18:47:22 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20250208-184722.pkl.
(VllmWorkerProcess pid=2571094) INFO 02-08 18:47:22 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20250208-184722.pkl...
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks.
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] Traceback (most recent call last):
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]   File "/share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]     return func(*args, **kwargs)
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]   File "/share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1767, in execute_model
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]     logits = self.model.compute_logits(hidden_or_intermediate_states,
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]   File "/share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/vllm/model_executor/models/deepseek_v3.py", line 692, in compute_logits
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]     logits = self.logits_processor(self.lm_head, hidden_states,
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]   File "/share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]   File "/share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]   File "/share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/vllm/model_executor/layers/logits_processor.py", line 68, in forward
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]     logits = self._get_logits(hidden_states, lm_head, embedding_bias)
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]   File "/share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/vllm/model_executor/layers/logits_processor.py", line 104, in _get_logits
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]     logits = tensor_model_parallel_gather(logits)
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]   File "/share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/vllm/distributed/communication_op.py", line 24, in tensor_model_parallel_gather
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]     return get_tp_group().gather(input_, dst, dim)
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]   File "/share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/vllm/distributed/parallel_state.py", line 441, in gather
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]     torch.distributed.gather(input_,
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]   File "/share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]     return func(*args, **kwargs)
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]   File "/share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 3620, in gather
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]     work = group.gather(output_tensors, input_tensors, opts)
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] RuntimeError: Inplace update to inference tensor outside InferenceMode is not allowed.You can make a clone to get a normal tensor before doing inplace update.See https://github.com/pytorch/rfcs/pull/17 for more details.
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] 
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] The above exception was the direct cause of the following exception:
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] 
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] Traceback (most recent call last):
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]   File "/share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/vllm/executor/multiproc_worker_utils.py", line 234, in _run_worker_process
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]     output = run_method(worker, method, args, kwargs)
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]   File "/share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/vllm/utils.py", line 2208, in run_method
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]     return func(*args, **kwargs)
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]   File "/share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]     return func(*args, **kwargs)
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]   File "/share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/vllm/worker/worker.py", line 228, in determine_num_available_blocks
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]     self.model_runner.profile_run()
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]   File "/share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]     return func(*args, **kwargs)
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]   File "/share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1236, in profile_run
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]     self._dummy_run(max_num_batched_tokens, max_num_seqs)
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]   File "/share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1347, in _dummy_run
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]     self.execute_model(model_input, kv_caches, intermediate_tensors)
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]   File "/share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]     return func(*args, **kwargs)
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]   File "/share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/vllm/worker/model_runner_base.py", line 152, in _wrapper
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]     raise type(err)(
(VllmWorkerProcess pid=2571096) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] RuntimeError: Error in model execution (input dumped to /tmp/err_execute_model_input_20250208-184722.pkl): Inplace update to inference tensor outside InferenceMode is not allowed.You can make a clone to get a normal tensor before doing inplace update.See https://github.com/pytorch/rfcs/pull/17 for more details.
(VllmWorkerProcess pid=2571098) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks.
(VllmWorkerProcess pid=2571098) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240] Traceback (most recent call last):
(VllmWorkerProcess pid=2571098) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]   File "/share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
(VllmWorkerProcess pid=2571098) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]     return func(*args, **kwargs)
(VllmWorkerProcess pid=2571098) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2571098) ERROR 02-08 18:47:22 multiproc_worker_utils.py:240]   File "/share/menkeyi/.conda/envs/vllm_deepseek-r1/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1767, in execute_model

How can I achieve the speed of 25 tps described in the Model card using 8A100? I found that if I don't add --enforce-eager, a cudaoom error will occur. After adding it, the speed is approximately 9 tps. If I remove --quantization moe_wna16, the speed drops to around 5 tps.

My startup command is as follows:
python -m vllm.entrypoints.openai.api_server
--served-model-name deepseek-r1
--model /data/liyongzhi/hf_models/DeepSeek-R1-AWQ
--trust-remote-code
--host 0.0.0.0
--port 8080
--max-model-len 10000
--tensor-parallel-size 8
--gpu_memory_utilization 0.999
--swap-space 32
--kv-cache-dtype fp8_e5m2
--calculate-kv-scales
--enforce-eager
--quantization moe_wna16
--dtype float16

How can I achieve the speed of 25 tps described in the Model card using 8A100? I found that if I don't add --enforce-eager, a cudaoom error will occur. After adding it, the speed is approximately 9 tps. If I remove --quantization moe_wna16, the speed drops to around 5 tps.

My startup command is as follows:
python -m vllm.entrypoints.openai.api_server
--served-model-name deepseek-r1
--model /data/liyongzhi/hf_models/DeepSeek-R1-AWQ
--trust-remote-code
--host 0.0.0.0
--port 8080
--max-model-len 10000
--tensor-parallel-size 8
--gpu_memory_utilization 0.999 \

Hello, what is your vllm environment? Mine is not running properly.

How can I achieve the speed of 25 tps described in the Model card using 8A100? I found that if I don't add --enforce-eager, a cudaoom error will occur. After adding it, the speed is approximately 9 tps. If I remove --quantization moe_wna16, the speed drops to around 5 tps.

My startup command is as follows:
python -m vllm.entrypoints.openai.api_server
--served-model-name deepseek-r1
--model /data/liyongzhi/hf_models/DeepSeek-R1-AWQ
--trust-remote-code
--host 0.0.0.0
--port 8080
--max-model-len 10000
--tensor-parallel-size 8
--gpu_memory_utilization 0.999 \

Hello, what is your vllm environment? Mine is not running properly.

I just intall the vllm from the source by the cmd below:

git clone https://github.com/vllm-project/vllm.git
cd vllm
VLLM_USE_PRECOMPILED=1 pip install --editable .

How can I achieve the speed of 25 tps described in the Model card using 8A100? I found that if I don't add --enforce-eager, a cudaoom error will occur. After adding it, the speed is approximately 9 tps. If I remove --quantization moe_wna16, the speed drops to around 5 tps.

My startup command is as follows:
python -m vllm.entrypoints.openai.api_server
--served-model-name deepseek-r1
--model /data/liyongzhi/hf_models/DeepSeek-R1-AWQ
--trust-remote-code
--host 0.0.0.0
--port 8080
--max-model-len 10000
--tensor-parallel-size 8
--gpu_memory_utilization 0.999
--swap-space 32
--kv-cache-dtype fp8_e5m2
--calculate-kv-scales
--enforce-eager
--quantization moe_wna16
--dtype float16

I can achieve more than 45 tps on 8A100 when send 5 request at the same time, not sure the 27 tps in the model card is achieved by 1 req or batch inference?

Cognitive Computations org

@menkeyi I see that you are not using the latest vLLM (v0.7.3), please update it and try again.

@muziyongshixin Remove --enforce-eager and reduce --gpu_memory_utilization to 0.97, the main performance boost is through CUDA graph capture, which takes about 2GB of VRAM. You set the memory utilization too high so during graph capture it goes out of memory.

@menkeyi I see that you are not using the latest vLLM (v0.7.3), please update it and try again.

@muziyongshixin Remove --enforce-eager and reduce --gpu_memory_utilization to 0.97, the main performance boost is through CUDA graph capture, which takes about 2GB of VRAM. You set the memory utilization too high so during graph capture it goes out of memory.

Is there v0.7.3 for vllm? The following comamnds fail
pip install --no-cache-dir vllm==0.7.3

Cognitive Computations org

@baohao Sorry, I meant 0.7.2, the 0.7.3 is the dev version which you need to build from source. But v0.7.2 should work.

Cognitive Computations org

Thank you @v2ray your knowledge and contribution is much appreciated!

Hi, are there any detailed command instructions for vllm "--quantization moe_wna16". On 8xA100

What command should i need to run?

Cognitive Computations org

@traphix Use the command which I put in the README.md.

Sign up or log in to comment