Skip to content

Instantly share code, notes, and snippets.

@yiliu30
Created September 6, 2025 09:13
Show Gist options
  • Save yiliu30/a6fc29772477457dc59525e72b7eec00 to your computer and use it in GitHub Desktop.
Save yiliu30/a6fc29772477457dc59525e72b7eec00 to your computer and use it in GitHub Desktop.
Run 1:
Auto-configed device: cuda
WARNING:sglang.srt.server_args:Detected SM100 and MXFP4 quantization format for GPT-OSS model, enabling FlashInfer MXFP4 MOE kernel.
WARNING:sglang.srt.server_args:TensorRT-LLM MHA only supports page_size of 16, 32 or 64, changing page_size from None to 64.
[2025-09-06 08:26:09] server_args=ServerArgs(model_path='/home/yiliu7/models/openai/gpt-oss-120b', tokenizer_path='/home/yiliu7/models/openai/gpt-oss-120b', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=8400, skip_server_warmup=False, warmups=None, nccl_port=None, dtype='bfloat16', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', mem_fraction_static=0.93, max_running_requests=None, max_queued_requests=9223372036854775807, max_total_tokens=None, chunked_prefill_size=16384, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, page_size=64, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=True, device='cuda', tp_size=4, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=950273309, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, gc_warning_threshold_secs=0.0, api_key=None, served_model_name='/home/yiliu7/models/openai/gpt-oss-120b', weight_version='default', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_backend='triton', attention_backend='trtllm_mha', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, ep_size=1, moe_a2a_backend='none', moe_runner_backend='flashinfer_mxfp4', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=True, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', disable_radix_cache=False, cuda_graph_max_bs=200, cuda_graph_bs=None, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, torch_compile_max_bs=32, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, enable_return_hidden_states=False, scheduler_recv_interval=1, numa_node=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, debug_tensor_dump_prefill_only=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, num_reserved_decode_tokens=512, custom_weight_loader=[], weight_loader_disable_mmap=False, enable_pdmux=False, sm_group_num=3, enable_ep_moe=False, enable_deepep_moe=False, enable_flashinfer_cutlass_moe=False, enable_flashinfer_trtllm_moe=False, enable_triton_kernel_moe=False, enable_flashinfer_mxfp4_moe=False)
All deep_gemm operations loaded successfully!
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-06 08:26:09] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:26:09] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-06 08:26:10] Using default HuggingFace chat template with detected content format: string
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-06 08:26:17 TP2] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:26:17 TP2] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-06 08:26:17 TP0] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:26:17 TP0] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-06 08:26:17 TP1] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:26:17 TP1] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-06 08:26:17 TP3] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:26:17 TP3] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-06 08:26:17 TP2] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:26:17 TP2] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-06 08:26:17 TP0] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:26:17 TP0] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-06 08:26:17 TP0] Init torch distributed begin.
[2025-09-06 08:26:17 TP1] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:26:17 TP1] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-06 08:26:17 TP3] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:26:17 TP3] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[2025-09-06 08:26:19 TP0] sglang is using nccl==2.27.3
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[2025-09-06 08:26:21 TP0] Init torch distributed ends. mem usage=1.46 GB
[2025-09-06 08:26:22 TP0] Load weight begin. avail mem=176.28 GB
All deep_gemm operations loaded successfully!
Loading safetensors checkpoint shards: 0% Completed | 0/15 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 15/15 [00:00<00:00, 1690.57it/s]
All deep_gemm operations loaded successfully!
[2025-09-06 08:26:35 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.0.mlp.experts), it might take a while...
All deep_gemm operations loaded successfully!
All deep_gemm operations loaded successfully!
[2025-09-06 08:26:38 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.1.mlp.experts), it might take a while...
[2025-09-06 08:26:41 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.2.mlp.experts), it might take a while...
[2025-09-06 08:26:44 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.3.mlp.experts), it might take a while...
[2025-09-06 08:26:47 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.4.mlp.experts), it might take a while...
[2025-09-06 08:26:50 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.5.mlp.experts), it might take a while...
[2025-09-06 08:26:53 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.6.mlp.experts), it might take a while...
[2025-09-06 08:26:56 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.7.mlp.experts), it might take a while...
[2025-09-06 08:26:59 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.8.mlp.experts), it might take a while...
[2025-09-06 08:27:03 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.9.mlp.experts), it might take a while...
[2025-09-06 08:27:06 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.10.mlp.experts), it might take a while...
[2025-09-06 08:27:09 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.11.mlp.experts), it might take a while...
[2025-09-06 08:27:12 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.12.mlp.experts), it might take a while...
[2025-09-06 08:27:15 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.13.mlp.experts), it might take a while...
[2025-09-06 08:27:18 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.14.mlp.experts), it might take a while...
[2025-09-06 08:27:21 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.15.mlp.experts), it might take a while...
[2025-09-06 08:27:24 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.16.mlp.experts), it might take a while...
[2025-09-06 08:27:27 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.17.mlp.experts), it might take a while...
[2025-09-06 08:27:30 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.18.mlp.experts), it might take a while...
[2025-09-06 08:27:33 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.19.mlp.experts), it might take a while...
[2025-09-06 08:27:36 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.20.mlp.experts), it might take a while...
[2025-09-06 08:27:39 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.21.mlp.experts), it might take a while...
[2025-09-06 08:27:42 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.22.mlp.experts), it might take a while...
[2025-09-06 08:27:46 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.23.mlp.experts), it might take a while...
[2025-09-06 08:27:49 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.24.mlp.experts), it might take a while...
[2025-09-06 08:27:52 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.25.mlp.experts), it might take a while...
[2025-09-06 08:27:55 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.26.mlp.experts), it might take a while...
[2025-09-06 08:27:58 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.27.mlp.experts), it might take a while...
[2025-09-06 08:28:01 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.28.mlp.experts), it might take a while...
[2025-09-06 08:28:04 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.29.mlp.experts), it might take a while...
[2025-09-06 08:28:07 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.30.mlp.experts), it might take a while...
[2025-09-06 08:28:10 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.31.mlp.experts), it might take a while...
[2025-09-06 08:28:13 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.32.mlp.experts), it might take a while...
[2025-09-06 08:28:16 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.33.mlp.experts), it might take a while...
[2025-09-06 08:28:19 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.34.mlp.experts), it might take a while...
[2025-09-06 08:28:22 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.35.mlp.experts), it might take a while...
[2025-09-06 08:28:26 TP0] Load weight end. type=GptOssForCausalLM, dtype=torch.bfloat16, avail mem=158.06 GB, mem usage=18.22 GB.
[2025-09-06 08:28:26 TP3] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB
[2025-09-06 08:28:26 TP0] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB
[2025-09-06 08:28:26 TP2] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB
[2025-09-06 08:28:26 TP0] Memory pool end. avail mem=10.23 GB
[2025-09-06 08:28:26 TP1] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB
[2025-09-06 08:28:26 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=9.54 GB
[2025-09-06 08:28:26 TP0] Capture cuda graph bs [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200]
0%| | 0/28 [00:00<?, ?it/s] Capturing batches (bs=200 avail_mem=9.39 GB): 0%| | 0/28 [00:00<?, ?it/s]rank 1 allocated ipc_handles: [['0x777854000000', '0x779dea000000', '0x7777f0000000', '0x7777ec000000'], ['0x7777ef000000', '0x7777eee00000', '0x7777ef200000', '0x7777ef400000'], ['0x7777d8000000', '0x7777e2000000', '0x7777ce000000', '0x7777c4000000']]
[2025-09-06 08:28:28.689] [info] lamportInitialize start: buffer: 0x7777e2000000, size: 71303168
rank 0 allocated ipc_handles: [['0x76738e000000', '0x764df2000000', '0x764d9c000000', '0x764d98000000'], ['0x764d9ae00000', '0x764d9b000000', '0x764d9b200000', '0x764d9b400000'], ['0x764d8e000000', '0x764d84000000', '0x764d7a000000', '0x764d70000000']]
[2025-09-06 08:28:28.738] [info] lamportInitialize start: buffer: 0x764d8e000000, size: 71303168
rank 3 allocated ipc_handles: [['0x7b1fe4000000', '0x7b1f88000000', '0x7b1f84000000', '0x7b4580000000'], ['0x7b1f87000000', '0x7b1f87200000', '0x7b1f87400000', '0x7b1f86e00000'], ['0x7b1f70000000', '0x7b1f66000000', '0x7b1f5c000000', '0x7b1f7a000000']]
[2025-09-06 08:28:28.787] [info] lamportInitialize start: buffer: 0x7b1f7a000000, size: 71303168
rank 2 allocated ipc_handles: [['0x7fae78000000', '0x7fae14000000', '0x7fd40e000000', '0x7fae10000000'], ['0x7fae13000000', '0x7fae13200000', '0x7fae12e00000', '0x7fae13400000'], ['0x7fadfc000000', '0x7fadf2000000', '0x7fae06000000', '0x7fade8000000']]
[2025-09-06 08:28:28.838] [info] lamportInitialize start: buffer: 0x7fae06000000, size: 71303168
[2025-09-06 08:28:28 TP0] FlashInfer workspace initialized for rank 0, world_size 4
[2025-09-06 08:28:28 TP1] FlashInfer workspace initialized for rank 1, world_size 4
[2025-09-06 08:28:28 TP2] FlashInfer workspace initialized for rank 2, world_size 4
[2025-09-06 08:28:28 TP3] FlashInfer workspace initialized for rank 3, world_size 4
set flag_ptr[3] = lamport_comm_size: 47185920
Rank 0 workspace[0] 0x76738e000000
Rank 0 workspace[1] 0x764df2000000
Rank 0 workspace[2] 0x764d9c000000
Rank 0 workspace[3] 0x764d98000000
Rank 0 workspace[4] 0x764d9ae00000
Rank 0 workspace[5] 0x764d9b000000
Rank 0 workspace[6] 0x764d9b200000
Rank 0 workspace[7] 0x764d9b400000
Rank 0 workspace[8] 0x764d8e000000
Rank 0 workspace[9] 0x764d84000000
Rank 0 workspace[10] 0x764d7a000000
Rank 0 workspace[11] 0x764d70000000
Rank 0 workspace[12] 0x767987264400
set flag_ptr[3] = lamport_comm_size: 47185920
Rank 2 workspace[0] 0x7fae78000000
Rank 2 workspace[1] 0x7fae14000000
Rank 2 workspace[2] 0x7fd40e000000
Rank 2 workspace[3] 0x7fae10000000
Rank 2 workspace[4] 0x7fae13000000
Rank 2 workspace[5] 0x7fae13200000
Rank 2 workspace[6] 0x7fae12e00000
Rank 2 workspace[7] 0x7fae13400000
Rank 2 workspace[8] 0x7fadfc000000
Rank 2 workspace[9] 0x7fadf2000000
Rank 2 workspace[10] 0x7fae06000000
Rank 2 workspace[11] 0x7fade8000000
Rank 2 workspace[12] 0x7fda1b264400
set flag_ptr[3] = lamport_comm_size: 47185920
Rank 3 workspace[0] 0x7b1fe4000000
Rank 3 workspace[1] 0x7b1f88000000
Rank 3 workspace[2] 0x7b1f84000000
Rank 3 workspace[3] 0x7b4580000000
Rank 3 workspace[4] 0x7b1f87000000
Rank 3 workspace[5] 0x7b1f87200000
Rank 3 workspace[6] 0x7b1f87400000
Rank 3 workspace[7] 0x7b1f86e00000
Rank 3 workspace[8] 0x7b1f70000000
Rank 3 workspace[9] 0x7b1f66000000
Rank 3 workspace[10] 0x7b1f5c000000
Rank 3 workspace[11] 0x7b1f7a000000
Rank 3 workspace[12] 0x7b4b7b264400
set flag_ptr[3] = lamport_comm_size: 47185920
Rank 1 workspace[0] 0x777854000000
Rank 1 workspace[1] 0x779dea000000
Rank 1 workspace[2] 0x7777f0000000
Rank 1 workspace[3] 0x7777ec000000
Rank 1 workspace[4] 0x7777ef000000
Rank 1 workspace[5] 0x7777eee00000
Rank 1 workspace[6] 0x7777ef200000
Rank 1 workspace[7] 0x7777ef400000
Rank 1 workspace[8] 0x7777d8000000
Rank 1 workspace[9] 0x7777e2000000
Rank 1 workspace[10] 0x7777ce000000
Rank 1 workspace[11] 0x7777c4000000
Rank 1 workspace[12] 0x77a3f3264400
Capturing batches (bs=200 avail_mem=9.39 GB): 4%|▎ | 1/28 [00:02<00:59, 2.20s/it] Capturing batches (bs=192 avail_mem=8.06 GB): 4%|▎ | 1/28 [00:02<00:59, 2.20s/it] Capturing batches (bs=192 avail_mem=8.06 GB): 7%|▋ | 2/28 [00:02<00:25, 1.00it/s] Capturing batches (bs=184 avail_mem=8.05 GB): 7%|▋ | 2/28 [00:02<00:25, 1.00it/s] Capturing batches (bs=176 avail_mem=8.04 GB): 7%|▋ | 2/28 [00:02<00:25, 1.00it/s] Capturing batches (bs=176 avail_mem=8.04 GB): 14%|█▍ | 4/28 [00:02<00:10, 2.34it/s] Capturing batches (bs=168 avail_mem=8.03 GB): 14%|█▍ | 4/28 [00:02<00:10, 2.34it/s] Capturing batches (bs=160 avail_mem=8.02 GB): 14%|█▍ | 4/28 [00:02<00:10, 2.34it/s] Capturing batches (bs=160 avail_mem=8.02 GB): 21%|██▏ | 6/28 [00:02<00:05, 3.75it/s] Capturing batches (bs=152 avail_mem=8.01 GB): 21%|██▏ | 6/28 [00:02<00:05, 3.75it/s] Capturing batches (bs=144 avail_mem=8.00 GB): 21%|██▏ | 6/28 [00:02<00:05, 3.75it/s] Capturing batches (bs=144 avail_mem=8.00 GB): 29%|██▊ | 8/28 [00:02<00:03, 5.13it/s] Capturing batches (bs=136 avail_mem=8.00 GB): 29%|██▊ | 8/28 [00:02<00:03, 5.13it/s] Capturing batches (bs=128 avail_mem=7.99 GB): 29%|██▊ | 8/28 [00:02<00:03, 5.13it/s] Capturing batches (bs=128 avail_mem=7.99 GB): 36%|███▌ | 10/28 [00:03<00:02, 6.34it/s] Capturing batches (bs=120 avail_mem=7.97 GB): 36%|███▌ | 10/28 [00:03<00:02, 6.34it/s] Capturing batches (bs=112 avail_mem=7.97 GB): 36%|███▌ | 10/28 [00:03<00:02, 6.34it/s] Capturing batches (bs=112 avail_mem=7.97 GB): 43%|████▎ | 12/28 [00:03<00:02, 7.47it/s] Capturing batches (bs=104 avail_mem=7.96 GB): 43%|████▎ | 12/28 [00:03<00:02, 7.47it/s] Capturing batches (bs=96 avail_mem=7.95 GB): 43%|████▎ | 12/28 [00:03<00:02, 7.47it/s] Capturing batches (bs=96 avail_mem=7.95 GB): 50%|█████ | 14/28 [00:03<00:01, 8.36it/s] Capturing batches (bs=88 avail_mem=7.94 GB): 50%|█████ | 14/28 [00:03<00:01, 8.36it/s] Capturing batches (bs=80 avail_mem=7.93 GB): 50%|█████ | 14/28 [00:03<00:01, 8.36it/s] Capturing batches (bs=80 avail_mem=7.93 GB): 57%|█████▋ | 16/28 [00:03<00:01, 9.10it/s] Capturing batches (bs=72 avail_mem=7.92 GB): 57%|█████▋ | 16/28 [00:03<00:01, 9.10it/s] Capturing batches (bs=64 avail_mem=7.91 GB): 57%|█████▋ | 16/28 [00:03<00:01, 9.10it/s] Capturing batches (bs=64 avail_mem=7.91 GB): 64%|██████▍ | 18/28 [00:03<00:01, 9.71it/s] Capturing batches (bs=56 avail_mem=7.90 GB): 64%|██████▍ | 18/28 [00:03<00:01, 9.71it/s] Capturing batches (bs=48 avail_mem=7.89 GB): 64%|██████▍ | 18/28 [00:03<00:01, 9.71it/s] Capturing batches (bs=48 avail_mem=7.89 GB): 71%|███████▏ | 20/28 [00:03<00:00, 10.20it/s] Capturing batches (bs=40 avail_mem=7.88 GB): 71%|███████▏ | 20/28 [00:03<00:00, 10.20it/s] Capturing batches (bs=32 avail_mem=7.88 GB): 71%|███████▏ | 20/28 [00:04<00:00, 10.20it/s] Capturing batches (bs=32 avail_mem=7.88 GB): 79%|███████▊ | 22/28 [00:04<00:00, 10.63it/s] Capturing batches (bs=24 avail_mem=7.87 GB): 79%|███████▊ | 22/28 [00:04<00:00, 10.63it/s] Capturing batches (bs=16 avail_mem=7.86 GB): 79%|███████▊ | 22/28 [00:04<00:00, 10.63it/s] Capturing batches (bs=16 avail_mem=7.86 GB): 86%|████████▌ | 24/28 [00:04<00:00, 10.93it/s] Capturing batches (bs=8 avail_mem=7.85 GB): 86%|████████▌ | 24/28 [00:04<00:00, 10.93it/s] Capturing batches (bs=4 avail_mem=7.84 GB): 86%|████████▌ | 24/28 [00:04<00:00, 10.93it/s] Capturing batches (bs=4 avail_mem=7.84 GB): 93%|█████████▎| 26/28 [00:04<00:00, 11.38it/s] Capturing batches (bs=2 avail_mem=7.83 GB): 93%|█████████▎| 26/28 [00:04<00:00, 11.38it/s] Capturing batches (bs=1 avail_mem=7.82 GB): 93%|█████████▎| 26/28 [00:04<00:00, 11.38it/s] Capturing batches (bs=1 avail_mem=7.82 GB): 100%|██████████| 28/28 [00:04<00:00, 10.86it/s] Capturing batches (bs=1 avail_mem=7.82 GB): 100%|██████████| 28/28 [00:04<00:00, 6.00it/s]
[2025-09-06 08:28:31 TP0] Registering 56 cuda graph addresses
[2025-09-06 08:28:31 TP2] Registering 56 cuda graph addresses
[2025-09-06 08:28:31 TP3] Registering 56 cuda graph addresses
[2025-09-06 08:28:31 TP1] Registering 56 cuda graph addresses
[2025-09-06 08:28:31 TP0] Capture cuda graph end. Time elapsed: 5.17 s. mem usage=1.73 GB. avail mem=7.81 GB.
[2025-09-06 08:28:32 TP0] max_total_num_tokens=8487040, chunked_prefill_size=16384, max_prefill_tokens=16384, max_running_requests=4096, context_len=131072, available_gpu_mem=7.81 GB
[2025-09-06 08:28:33] INFO: Started server process [34489]
[2025-09-06 08:28:33] INFO: Waiting for application startup.
[2025-09-06 08:28:33] INFO: Application startup complete.
[2025-09-06 08:28:33] INFO: Uvicorn running on http://127.0.0.1:8400 (Press CTRL+C to quit)
[2025-09-06 08:28:34] INFO: 127.0.0.1:46012 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-09-06 08:28:34] INFO: 127.0.0.1:46014 - "GET /health_generate HTTP/1.1" 503 Service Unavailable
[2025-09-06 08:28:34 TP0] Prefill batch. #new-seq: 1, #new-token: 64, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-09-06 08:28:35] INFO: 127.0.0.1:46026 - "POST /generate HTTP/1.1" 200 OK
[2025-09-06 08:28:35] The server is fired up and ready to roll!
[2025-09-06 08:28:44 TP0] Prefill batch. #new-seq: 1, #new-token: 64, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-09-06 08:28:45] INFO: 127.0.0.1:56076 - "GET /health_generate HTTP/1.1" 200 OK
command=python3 -m sglang.launch_server --model-path /home/yiliu7/models/openai/gpt-oss-120b --tp 4 --cuda-graph-max-bs 200 --mem-fraction-static 0.93 --device cuda --host 127.0.0.1 --port 8400
Evaluation start: model=/home/yiliu7/models/openai/gpt-oss-120b reasoning_effort=low expected_score=0.6
ChatCompletionSampler initialized with self.system_message=None self.temperature=0.1 self.max_tokens=4096 self.reasoning_effort='low'
0%| | 0/198 [00:00<?, ?it/s][2025-09-06 08:28:46 TP0] Prefill batch. #new-seq: 1, #new-token: 256, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-09-06 08:28:46 TP0] Prefill batch. #new-seq: 9, #new-token: 2304, #cached-token: 576, token usage: 0.00, #running-req: 1, #queue-req: 0,
[2025-09-06 08:28:46 TP0] Prefill batch. #new-seq: 3, #new-token: 768, #cached-token: 192, token usage: 0.00, #running-req: 10, #queue-req: 0,
[2025-09-06 08:28:46 TP0] Prefill batch. #new-seq: 56, #new-token: 16064, #cached-token: 3584, token usage: 0.00, #running-req: 13, #queue-req: 39,
[2025-09-06 08:28:46 TP0] Prefill batch. #new-seq: 54, #new-token: 16320, #cached-token: 3456, token usage: 0.00, #running-req: 69, #queue-req: 6,
[2025-09-06 08:28:46 TP0] Prefill batch. #new-seq: 56, #new-token: 16320, #cached-token: 3648, token usage: 0.00, #running-req: 123, #queue-req: 4,
[2025-09-06 08:28:46 TP0] Prefill batch. #new-seq: 19, #new-token: 4672, #cached-token: 1216, token usage: 0.01, #running-req: 179, #queue-req: 0,
[2025-09-06 08:28:47 TP0] Decode batch. #running-req: 198, #token: 61888, token usage: 0.01, cuda graph: True, gen throughput (token/s): 347.01, #queue-req: 0,
[2025-09-06 08:28:47] INFO: 127.0.0.1:56212 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:47 TP0] Decode batch. #running-req: 197, #token: 68928, token usage: 0.01, cuda graph: True, gen throughput (token/s): 17331.29, #queue-req: 0,
[2025-09-06 08:28:47] INFO: 127.0.0.1:56736 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:47] INFO: 127.0.0.1:56152 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:47] INFO: 127.0.0.1:56114 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:48] INFO: 127.0.0.1:56536 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:48] INFO: 127.0.0.1:57098 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:48] INFO: 127.0.0.1:57084 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:48] INFO: 127.0.0.1:57222 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:48 TP0] Decode batch. #running-req: 190, #token: 72064, token usage: 0.01, cuda graph: True, gen throughput (token/s): 17223.27, #queue-req: 0,
[2025-09-06 08:28:48] INFO: 127.0.0.1:56472 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:48] INFO: 127.0.0.1:56622 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:48] INFO: 127.0.0.1:57690 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:48] INFO: 127.0.0.1:57758 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:48] INFO: 127.0.0.1:57572 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:48] INFO: 127.0.0.1:56564 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:48] INFO: 127.0.0.1:57002 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:48] INFO: 127.0.0.1:56574 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:48] INFO: 127.0.0.1:56594 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:48] INFO: 127.0.0.1:57218 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:48] INFO: 127.0.0.1:56978 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:48] INFO: 127.0.0.1:56338 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:48] INFO: 127.0.0.1:57338 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:48] INFO: 127.0.0.1:57158 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:48] INFO: 127.0.0.1:56762 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:48 TP0] Decode batch. #running-req: 175, #token: 72832, token usage: 0.01, cuda graph: True, gen throughput (token/s): 16416.05, #queue-req: 0,
[2025-09-06 08:28:48] INFO: 127.0.0.1:57452 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:48] INFO: 127.0.0.1:57804 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:48] INFO: 127.0.0.1:57294 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:48] INFO: 127.0.0.1:56488 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:49] INFO: 127.0.0.1:56324 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:49] INFO: 127.0.0.1:57374 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:49] INFO: 127.0.0.1:57022 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:49 TP0] Decode batch. #running-req: 168, #token: 76544, token usage: 0.01, cuda graph: True, gen throughput (token/s): 15784.57, #queue-req: 0,
[2025-09-06 08:28:49] INFO: 127.0.0.1:56164 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:49] INFO: 127.0.0.1:56438 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:49] INFO: 127.0.0.1:57172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:49] INFO: 127.0.0.1:56116 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:49] INFO: 127.0.0.1:57332 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:49] INFO: 127.0.0.1:56950 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:49] INFO: 127.0.0.1:57792 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:49 TP0] Decode batch. #running-req: 161, #token: 79744, token usage: 0.01, cuda graph: True, gen throughput (token/s): 14688.78, #queue-req: 0,
[2025-09-06 08:28:49] INFO: 127.0.0.1:56584 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:49] INFO: 127.0.0.1:57510 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:49] INFO: 127.0.0.1:56658 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:49] INFO: 127.0.0.1:57862 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:49] INFO: 127.0.0.1:56220 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:49] INFO: 127.0.0.1:57704 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:49] INFO: 127.0.0.1:57424 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:49] INFO: 127.0.0.1:56344 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:49] INFO: 127.0.0.1:56844 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:49 TP0] Decode batch. #running-req: 153, #token: 81536, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13916.02, #queue-req: 0,
[2025-09-06 08:28:49] INFO: 127.0.0.1:57474 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:50] INFO: 127.0.0.1:56940 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:50] INFO: 127.0.0.1:57602 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:50] INFO: 127.0.0.1:57816 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:50] INFO: 127.0.0.1:57870 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:50] INFO: 127.0.0.1:56822 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:50] INFO: 127.0.0.1:57614 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:50] INFO: 127.0.0.1:57496 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:50] INFO: 127.0.0.1:57388 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:50 TP0] Decode batch. #running-req: 143, #token: 82432, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13423.86, #queue-req: 0,
[2025-09-06 08:28:50] INFO: 127.0.0.1:56836 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:50] INFO: 127.0.0.1:57050 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:50] INFO: 127.0.0.1:56208 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:50] INFO: 127.0.0.1:57780 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:50] INFO: 127.0.0.1:56380 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:50] INFO: 127.0.0.1:57340 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:50] INFO: 127.0.0.1:57208 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:50] INFO: 127.0.0.1:57412 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:50] INFO: 127.0.0.1:56794 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:50 TP0] Decode batch. #running-req: 134, #token: 83136, token usage: 0.01, cuda graph: True, gen throughput (token/s): 12796.65, #queue-req: 0,
[2025-09-06 08:28:50] INFO: 127.0.0.1:57112 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:50] INFO: 127.0.0.1:57202 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:50] INFO: 127.0.0.1:56870 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:50] INFO: 127.0.0.1:56974 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:50] INFO: 127.0.0.1:56138 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:51] INFO: 127.0.0.1:56724 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:51] INFO: 127.0.0.1:56378 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:51] INFO: 127.0.0.1:56968 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:51] INFO: 127.0.0.1:57880 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:51] INFO: 127.0.0.1:56086 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:51] INFO: 127.0.0.1:56348 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:51] INFO: 127.0.0.1:57624 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:51] INFO: 127.0.0.1:57640 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:51] INFO: 127.0.0.1:56248 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:51] INFO: 127.0.0.1:57394 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:51 TP0] Decode batch. #running-req: 119, #token: 78976, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13816.37, #queue-req: 0,
[2025-09-06 08:28:51] INFO: 127.0.0.1:57250 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:51] INFO: 127.0.0.1:57670 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:51] INFO: 127.0.0.1:56452 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:51] INFO: 127.0.0.1:57464 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:51] INFO: 127.0.0.1:56880 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:51] INFO: 127.0.0.1:57420 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:51 TP0] Decode batch. #running-req: 113, #token: 78784, token usage: 0.01, cuda graph: True, gen throughput (token/s): 15745.19, #queue-req: 0,
[2025-09-06 08:28:51] INFO: 127.0.0.1:56614 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:51] INFO: 127.0.0.1:56522 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:51] INFO: 127.0.0.1:56422 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:51] INFO: 127.0.0.1:56280 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:51 TP0] Decode batch. #running-req: 109, #token: 81024, token usage: 0.01, cuda graph: True, gen throughput (token/s): 15465.52, #queue-req: 0,
[2025-09-06 08:28:51] INFO: 127.0.0.1:57000 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:51] INFO: 127.0.0.1:56602 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:51] INFO: 127.0.0.1:56730 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:51] INFO: 127.0.0.1:56264 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:51] INFO: 127.0.0.1:56608 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:52] INFO: 127.0.0.1:56204 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:52] INFO: 127.0.0.1:57540 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:52] INFO: 127.0.0.1:57744 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:52 TP0] Decode batch. #running-req: 102, #token: 78848, token usage: 0.01, cuda graph: True, gen throughput (token/s): 14793.92, #queue-req: 0,
[2025-09-06 08:28:52] INFO: 127.0.0.1:57674 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:52] INFO: 127.0.0.1:56308 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:52] INFO: 127.0.0.1:57542 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:52] INFO: 127.0.0.1:57134 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:52] INFO: 127.0.0.1:56776 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:52] INFO: 127.0.0.1:57860 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:52 TP0] Decode batch. #running-req: 97, #token: 78016, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13885.71, #queue-req: 0,
[2025-09-06 08:28:52] INFO: 127.0.0.1:56126 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:52] INFO: 127.0.0.1:57660 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:52] INFO: 127.0.0.1:57798 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:52] INFO: 127.0.0.1:56682 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:52] INFO: 127.0.0.1:57586 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:52] INFO: 127.0.0.1:56318 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:52] INFO: 127.0.0.1:57622 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:52] INFO: 127.0.0.1:57280 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:52] INFO: 127.0.0.1:57560 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:52 TP0] Decode batch. #running-req: 86, #token: 73984, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13284.89, #queue-req: 0,
[2025-09-06 08:28:52] INFO: 127.0.0.1:56556 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:52] INFO: 127.0.0.1:56314 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:52] INFO: 127.0.0.1:56240 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:52] INFO: 127.0.0.1:57186 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:52] INFO: 127.0.0.1:56710 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:52] INFO: 127.0.0.1:57328 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:52] INFO: 127.0.0.1:56814 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:52] INFO: 127.0.0.1:57718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:52 TP0] Decode batch. #running-req: 78, #token: 70208, token usage: 0.01, cuda graph: True, gen throughput (token/s): 12384.46, #queue-req: 0,
[2025-09-06 08:28:52] INFO: 127.0.0.1:56912 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:52] INFO: 127.0.0.1:57522 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:52] INFO: 127.0.0.1:56394 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:52] INFO: 127.0.0.1:56632 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:52] INFO: 127.0.0.1:57018 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:52] INFO: 127.0.0.1:57470 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:52] INFO: 127.0.0.1:56364 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:52] INFO: 127.0.0.1:56458 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:53] INFO: 127.0.0.1:56752 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:53] INFO: 127.0.0.1:56782 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:53] INFO: 127.0.0.1:57354 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:53] INFO: 127.0.0.1:56360 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:53 TP0] Decode batch. #running-req: 66, #token: 62592, token usage: 0.01, cuda graph: True, gen throughput (token/s): 11198.65, #queue-req: 0,
[2025-09-06 08:28:53] INFO: 127.0.0.1:57040 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:53] INFO: 127.0.0.1:57170 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:53] INFO: 127.0.0.1:56296 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:53] INFO: 127.0.0.1:57766 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:53 TP0] Decode batch. #running-req: 62, #token: 61312, token usage: 0.01, cuda graph: True, gen throughput (token/s): 10345.65, #queue-req: 0,
[2025-09-06 08:28:53] INFO: 127.0.0.1:57146 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:53] INFO: 127.0.0.1:57382 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:53] INFO: 127.0.0.1:57574 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:53] INFO: 127.0.0.1:57300 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:53] INFO: 127.0.0.1:56896 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:53] INFO: 127.0.0.1:57546 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:53 TP0] Decode batch. #running-req: 56, #token: 57344, token usage: 0.01, cuda graph: True, gen throughput (token/s): 10066.95, #queue-req: 0,
[2025-09-06 08:28:53] INFO: 127.0.0.1:56504 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:53 TP0] Decode batch. #running-req: 55, #token: 58432, token usage: 0.01, cuda graph: True, gen throughput (token/s): 9385.78, #queue-req: 0,
[2025-09-06 08:28:53] INFO: 127.0.0.1:57148 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:53] INFO: 127.0.0.1:56962 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:53] INFO: 127.0.0.1:56904 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:54] INFO: 127.0.0.1:57698 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:54] INFO: 127.0.0.1:56656 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:54] INFO: 127.0.0.1:57836 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:54 TP0] Decode batch. #running-req: 49, #token: 52608, token usage: 0.01, cuda graph: True, gen throughput (token/s): 8793.59, #queue-req: 0,
[2025-09-06 08:28:54] INFO: 127.0.0.1:56334 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:54] INFO: 127.0.0.1:57042 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:54] INFO: 127.0.0.1:57312 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:54] INFO: 127.0.0.1:57276 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:54] INFO: 127.0.0.1:57822 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:54] INFO: 127.0.0.1:56928 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:54] INFO: 127.0.0.1:57528 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:54] INFO: 127.0.0.1:57124 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:54] INFO: 127.0.0.1:57028 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:54] INFO: 127.0.0.1:57138 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:54 TP0] Decode batch. #running-req: 39, #token: 44288, token usage: 0.01, cuda graph: True, gen throughput (token/s): 7588.36, #queue-req: 0,
[2025-09-06 08:28:54] INFO: 127.0.0.1:56346 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:54] INFO: 127.0.0.1:56304 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:54] INFO: 127.0.0.1:56552 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:54] INFO: 127.0.0.1:57440 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:54] INFO: 127.0.0.1:57370 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:54] INFO: 127.0.0.1:57490 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:54] INFO: 127.0.0.1:56098 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:54 TP0] Decode batch. #running-req: 33, #token: 37568, token usage: 0.00, cuda graph: True, gen throughput (token/s): 6607.97, #queue-req: 0,
[2025-09-06 08:28:54] INFO: 127.0.0.1:56230 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:54] INFO: 127.0.0.1:56448 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:54] INFO: 127.0.0.1:57848 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:54] INFO: 127.0.0.1:56190 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:54] INFO: 127.0.0.1:57872 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:54 TP0] Decode batch. #running-req: 27, #token: 33088, token usage: 0.00, cuda graph: True, gen throughput (token/s): 5465.59, #queue-req: 0,
[2025-09-06 08:28:54] INFO: 127.0.0.1:57142 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:54] INFO: 127.0.0.1:57080 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:54] INFO: 127.0.0.1:57234 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:54 TP0] Decode batch. #running-req: 24, #token: 30464, token usage: 0.00, cuda graph: True, gen throughput (token/s): 4766.63, #queue-req: 0,
[2025-09-06 08:28:55] INFO: 127.0.0.1:57700 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:55] INFO: 127.0.0.1:56994 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:55] INFO: 127.0.0.1:57064 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:55 TP0] Decode batch. #running-req: 21, #token: 27392, token usage: 0.00, cuda graph: True, gen throughput (token/s): 4507.33, #queue-req: 0,
[2025-09-06 08:28:55] INFO: 127.0.0.1:57258 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:55] INFO: 127.0.0.1:56174 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:55] INFO: 127.0.0.1:56516 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:55 TP0] Decode batch. #running-req: 18, #token: 24256, token usage: 0.00, cuda graph: True, gen throughput (token/s): 3896.98, #queue-req: 0,
[2025-09-06 08:28:55] INFO: 127.0.0.1:56860 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:55] INFO: 127.0.0.1:57788 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:55] INFO: 127.0.0.1:56800 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:55] INFO: 127.0.0.1:56118 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:55 TP0] Decode batch. #running-req: 14, #token: 19456, token usage: 0.00, cuda graph: True, gen throughput (token/s): 3175.32, #queue-req: 0,
[2025-09-06 08:28:55] INFO: 127.0.0.1:56454 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:55] INFO: 127.0.0.1:56428 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:55] INFO: 127.0.0.1:57728 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:55 TP0] Decode batch. #running-req: 11, #token: 15808, token usage: 0.00, cuda graph: True, gen throughput (token/s): 2722.85, #queue-req: 0,
[2025-09-06 08:28:55] INFO: 127.0.0.1:57644 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:55 TP0] Decode batch. #running-req: 10, #token: 14720, token usage: 0.00, cuda graph: True, gen throughput (token/s): 2218.75, #queue-req: 0,
[2025-09-06 08:28:56 TP0] Decode batch. #running-req: 10, #token: 14848, token usage: 0.00, cuda graph: True, gen throughput (token/s): 2126.31, #queue-req: 0,
[2025-09-06 08:28:56] INFO: 127.0.0.1:56696 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:56] INFO: 127.0.0.1:56354 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:56 TP0] Decode batch. #running-req: 8, #token: 12352, token usage: 0.00, cuda graph: True, gen throughput (token/s): 2016.74, #queue-req: 0,
[2025-09-06 08:28:56] INFO: 127.0.0.1:57772 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:56 TP0] Decode batch. #running-req: 7, #token: 11008, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1879.52, #queue-req: 0,
[2025-09-06 08:28:56 TP0] Decode batch. #running-req: 7, #token: 11328, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1672.65, #queue-req: 0,
[2025-09-06 08:28:56] INFO: 127.0.0.1:57398 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:56] INFO: 127.0.0.1:57484 - "POST /v1/chat/completions HTTP/1.1" 200 OK
1%| | 1/198 [00:10<35:11, 10.72s/it][2025-09-06 08:28:56 TP0] Decode batch. #running-req: 5, #token: 8320, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1373.21, #queue-req: 0,
[2025-09-06 08:28:57 TP0] Decode batch. #running-req: 5, #token: 8384, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1212.19, #queue-req: 0,
[2025-09-06 08:28:57 TP0] Decode batch. #running-req: 5, #token: 8704, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1206.92, #queue-req: 0,
[2025-09-06 08:28:57] INFO: 127.0.0.1:57268 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:57 TP0] Decode batch. #running-req: 5, #token: 7232, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1188.35, #queue-req: 0,
[2025-09-06 08:28:57 TP0] Decode batch. #running-req: 4, #token: 7232, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1062.48, #queue-req: 0,
[2025-09-06 08:28:57 TP0] Decode batch. #running-req: 4, #token: 7488, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1055.36, #queue-req: 0,
[2025-09-06 08:28:57 TP0] Decode batch. #running-req: 4, #token: 7616, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1066.69, #queue-req: 0,
[2025-09-06 08:28:57 TP0] Decode batch. #running-req: 4, #token: 7744, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1058.33, #queue-req: 0,
[2025-09-06 08:28:58] INFO: 127.0.0.1:57696 - "POST /v1/chat/completions HTTP/1.1" 200 OK
12%|█▏ | 24/198 [00:11<01:04, 2.70it/s][2025-09-06 08:28:58] INFO: 127.0.0.1:56406 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:28:58 TP0] Decode batch. #running-req: 2, #token: 4032, token usage: 0.00, cuda graph: True, gen throughput (token/s): 774.95, #queue-req: 0,
[2025-09-06 08:28:58 TP0] Decode batch. #running-req: 2, #token: 4032, token usage: 0.00, cuda graph: True, gen throughput (token/s): 580.43, #queue-req: 0,
[2025-09-06 08:28:58] INFO: 127.0.0.1:56640 - "POST /v1/chat/completions HTTP/1.1" 200 OK
56%|█████▌ | 110/198 [00:12<00:05, 15.97it/s][2025-09-06 08:28:58 TP0] Decode batch. #running-req: 1, #token: 2112, token usage: 0.00, cuda graph: True, gen throughput (token/s): 449.97, #queue-req: 0,
[2025-09-06 08:28:58] INFO: 127.0.0.1:56666 - "POST /v1/chat/completions HTTP/1.1" 200 OK
65%|██████▌ | 129/198 [00:12<00:03, 19.74it/s] 100%|██████████| 198/198 [00:12<00:00, 16.00it/s]
/usr/lib/python3.12/subprocess.py:1127: ResourceWarning: subprocess 34489 is still running
_warn("subprocess %s is still running" % self.pid,
ResourceWarning: Enable tracemalloc to get the object allocation traceback
.
----------------------------------------------------------------------
Ran 1 test in 176.620s
OK
Writing report to /tmp/gpqa__home_yiliu7_models_openai_gpt-oss-120b.html
{'chars': 1754.6666666666667, 'chars:std': 1020.4785765769535, 'score:std': 0.48631931786709987, 'score': 0.6161616161616161}
Writing results to /tmp/gpqa__home_yiliu7_models_openai_gpt-oss-120b.json
Total latency: 12.432 s
Score: 0.616
Evaluation end: model=/home/yiliu7/models/openai/gpt-oss-120b reasoning_effort=low expected_score=0.6 metrics={'chars': 1754.6666666666667, 'chars:std': 1020.4785765769535, 'score:std': 0.48631931786709987, 'score': 0.6161616161616161}
================================================================================
Run 2:
Auto-configed device: cuda
WARNING:sglang.srt.server_args:Detected SM100 and MXFP4 quantization format for GPT-OSS model, enabling FlashInfer MXFP4 MOE kernel.
WARNING:sglang.srt.server_args:TensorRT-LLM MHA only supports page_size of 16, 32 or 64, changing page_size from None to 64.
[2025-09-06 08:29:13] server_args=ServerArgs(model_path='/home/yiliu7/models/openai/gpt-oss-120b', tokenizer_path='/home/yiliu7/models/openai/gpt-oss-120b', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=8400, skip_server_warmup=False, warmups=None, nccl_port=None, dtype='bfloat16', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', mem_fraction_static=0.93, max_running_requests=None, max_queued_requests=9223372036854775807, max_total_tokens=None, chunked_prefill_size=16384, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, page_size=64, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=True, device='cuda', tp_size=4, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=613488540, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, gc_warning_threshold_secs=0.0, api_key=None, served_model_name='/home/yiliu7/models/openai/gpt-oss-120b', weight_version='default', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_backend='triton', attention_backend='trtllm_mha', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, ep_size=1, moe_a2a_backend='none', moe_runner_backend='flashinfer_mxfp4', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=True, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', disable_radix_cache=False, cuda_graph_max_bs=200, cuda_graph_bs=None, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, torch_compile_max_bs=32, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, enable_return_hidden_states=False, scheduler_recv_interval=1, numa_node=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, debug_tensor_dump_prefill_only=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, num_reserved_decode_tokens=512, custom_weight_loader=[], weight_loader_disable_mmap=False, enable_pdmux=False, sm_group_num=3, enable_ep_moe=False, enable_deepep_moe=False, enable_flashinfer_cutlass_moe=False, enable_flashinfer_trtllm_moe=False, enable_triton_kernel_moe=False, enable_flashinfer_mxfp4_moe=False)
All deep_gemm operations loaded successfully!
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-06 08:29:13] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:29:13] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-06 08:29:13] Using default HuggingFace chat template with detected content format: string
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-06 08:29:20 TP2] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:29:20 TP2] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-06 08:29:20 TP0] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:29:20 TP0] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-06 08:29:20 TP1] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:29:20 TP1] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-06 08:29:20 TP3] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:29:20 TP3] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-06 08:29:20 TP2] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:29:20 TP2] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-06 08:29:20 TP0] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:29:20 TP0] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-06 08:29:20 TP0] Init torch distributed begin.
[2025-09-06 08:29:21 TP1] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:29:21 TP1] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-06 08:29:21 TP3] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:29:21 TP3] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[2025-09-06 08:29:22 TP0] sglang is using nccl==2.27.3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[2025-09-06 08:29:25 TP0] Init torch distributed ends. mem usage=1.46 GB
[2025-09-06 08:29:25 TP0] Load weight begin. avail mem=176.28 GB
All deep_gemm operations loaded successfully!
Loading safetensors checkpoint shards: 0% Completed | 0/15 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 15/15 [00:00<00:00, 1585.51it/s]
All deep_gemm operations loaded successfully!
All deep_gemm operations loaded successfully!
All deep_gemm operations loaded successfully!
[2025-09-06 08:29:36 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.0.mlp.experts), it might take a while...
[2025-09-06 08:29:39 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.1.mlp.experts), it might take a while...
[2025-09-06 08:29:42 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.2.mlp.experts), it might take a while...
[2025-09-06 08:29:45 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.3.mlp.experts), it might take a while...
[2025-09-06 08:29:48 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.4.mlp.experts), it might take a while...
[2025-09-06 08:29:51 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.5.mlp.experts), it might take a while...
[2025-09-06 08:29:54 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.6.mlp.experts), it might take a while...
[2025-09-06 08:29:58 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.7.mlp.experts), it might take a while...
[2025-09-06 08:30:01 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.8.mlp.experts), it might take a while...
[2025-09-06 08:30:04 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.9.mlp.experts), it might take a while...
[2025-09-06 08:30:07 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.10.mlp.experts), it might take a while...
[2025-09-06 08:30:10 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.11.mlp.experts), it might take a while...
[2025-09-06 08:30:13 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.12.mlp.experts), it might take a while...
[2025-09-06 08:30:16 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.13.mlp.experts), it might take a while...
[2025-09-06 08:30:19 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.14.mlp.experts), it might take a while...
[2025-09-06 08:30:22 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.15.mlp.experts), it might take a while...
[2025-09-06 08:30:25 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.16.mlp.experts), it might take a while...
[2025-09-06 08:30:28 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.17.mlp.experts), it might take a while...
[2025-09-06 08:30:31 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.18.mlp.experts), it might take a while...
[2025-09-06 08:30:34 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.19.mlp.experts), it might take a while...
[2025-09-06 08:30:38 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.20.mlp.experts), it might take a while...
[2025-09-06 08:30:41 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.21.mlp.experts), it might take a while...
[2025-09-06 08:30:44 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.22.mlp.experts), it might take a while...
[2025-09-06 08:30:47 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.23.mlp.experts), it might take a while...
[2025-09-06 08:30:50 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.24.mlp.experts), it might take a while...
[2025-09-06 08:30:53 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.25.mlp.experts), it might take a while...
[2025-09-06 08:30:56 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.26.mlp.experts), it might take a while...
[2025-09-06 08:30:59 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.27.mlp.experts), it might take a while...
[2025-09-06 08:31:02 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.28.mlp.experts), it might take a while...
[2025-09-06 08:31:05 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.29.mlp.experts), it might take a while...
[2025-09-06 08:31:08 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.30.mlp.experts), it might take a while...
[2025-09-06 08:31:11 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.31.mlp.experts), it might take a while...
[2025-09-06 08:31:15 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.32.mlp.experts), it might take a while...
[2025-09-06 08:31:18 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.33.mlp.experts), it might take a while...
[2025-09-06 08:31:21 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.34.mlp.experts), it might take a while...
[2025-09-06 08:31:24 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.35.mlp.experts), it might take a while...
[2025-09-06 08:31:27 TP0] Load weight end. type=GptOssForCausalLM, dtype=torch.bfloat16, avail mem=158.06 GB, mem usage=18.22 GB.
[2025-09-06 08:31:28 TP0] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB
[2025-09-06 08:31:28 TP3] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB
[2025-09-06 08:31:28 TP0] Memory pool end. avail mem=10.23 GB
[2025-09-06 08:31:28 TP2] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB
[2025-09-06 08:31:28 TP1] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB
[2025-09-06 08:31:28 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=9.54 GB
[2025-09-06 08:31:28 TP0] Capture cuda graph bs [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200]
0%| | 0/28 [00:00<?, ?it/s] Capturing batches (bs=200 avail_mem=9.39 GB): 0%| | 0/28 [00:00<?, ?it/s]rank 1 allocated ipc_handles: [['0x77f2a8000000', '0x7818a4000000', '0x77f2a4000000', '0x77f2a0000000'], ['0x77f2a3000000', '0x77f2a2e00000', '0x77f2a3200000', '0x77f2a3400000'], ['0x77f28c000000', '0x77f296000000', '0x77f282000000', '0x77f278000000']]
[2025-09-06 08:31:30.684] [info] lamportInitialize start: buffer: 0x77f296000000, size: 71303168
rank 3 allocated ipc_handles: [['0x7e64a4000000', '0x7e6448000000', '0x7e6444000000', '0x7e8a40000000'], ['0x7e6447000000', '0x7e6447200000', '0x7e6447400000', '0x7e6446e00000'], ['0x7e6430000000', '0x7e6426000000', '0x7e641c000000', '0x7e643a000000']]
[2025-09-06 08:31:30.734] [info] lamportInitialize start: buffer: 0x7e643a000000, size: 71303168
rank 0 allocated ipc_handles: [['0x77a1e2000000', '0x777bf6000000', '0x777bf2000000', '0x777bee000000'], ['0x777bf0e00000', '0x777bf1000000', '0x777bf1200000', '0x777bf1400000'], ['0x777be4000000', '0x777bda000000', '0x777bd0000000', '0x777bc6000000']]
[2025-09-06 08:31:30.783] [info] lamportInitialize start: buffer: 0x777be4000000, size: 71303168
rank 2 allocated ipc_handles: [['0x719f74000000', '0x719f10000000', '0x71c50a000000', '0x719f0c000000'], ['0x719f0f000000', '0x719f0f200000', '0x719f0ee00000', '0x719f0f400000'], ['0x719ef8000000', '0x719eee000000', '0x719f02000000', '0x719ee4000000']]
[2025-09-06 08:31:30.833] [info] lamportInitialize start: buffer: 0x719f02000000, size: 71303168
[2025-09-06 08:31:30 TP0] FlashInfer workspace initialized for rank 0, world_size 4
[2025-09-06 08:31:30 TP1] FlashInfer workspace initialized for rank 1, world_size 4
[2025-09-06 08:31:30 TP3] FlashInfer workspace initialized for rank 3, world_size 4
[2025-09-06 08:31:30 TP2] FlashInfer workspace initialized for rank 2, world_size 4
set flag_ptr[3] = lamport_comm_size: 47185920
Rank 0 workspace[0] 0x77a1e2000000
Rank 0 workspace[1] 0x777bf6000000
Rank 0 workspace[2] 0x777bf2000000
Rank 0 workspace[3] 0x777bee000000
Rank 0 workspace[4] 0x777bf0e00000
Rank 0 workspace[5] 0x777bf1000000
Rank 0 workspace[6] 0x777bf1200000
Rank 0 workspace[7] 0x777bf1400000
Rank 0 workspace[8] 0x777be4000000
Rank 0 workspace[9] 0x777bda000000
Rank 0 workspace[10] 0x777bd0000000
Rank 0 workspace[11] 0x777bc6000000
Rank 0 workspace[12] 0x77a7db264400
set flag_ptr[3] = lamport_comm_size: 47185920
Rank 3 workspace[0] 0x7e64a4000000
Rank 3 workspace[1] 0x7e6448000000
Rank 3 workspace[2] 0x7e6444000000
Rank 3 workspace[3] 0x7e8a40000000
Rank 3 workspace[4] 0x7e6447000000
Rank 3 workspace[5] 0x7e6447200000
Rank 3 workspace[6] 0x7e6447400000
Rank 3 workspace[7] 0x7e6446e00000
Rank 3 workspace[8] 0x7e6430000000
Rank 3 workspace[9] 0x7e6426000000
Rank 3 workspace[10] 0x7e641c000000
Rank 3 workspace[11] 0x7e643a000000
Rank 3 workspace[12] 0x7e903b264400
set flag_ptr[3] = lamport_comm_size: 47185920
Rank 2 workspace[0] 0x719f74000000
Rank 2 workspace[1] 0x719f10000000
Rank 2 workspace[2] 0x71c50a000000
Rank 2 workspace[3] 0x719f0c000000
Rank 2 workspace[4] 0x719f0f000000
Rank 2 workspace[5] 0x719f0f200000
Rank 2 workspace[6] 0x719f0ee00000
Rank 2 workspace[7] 0x719f0f400000
Rank 2 workspace[8] 0x719ef8000000
Rank 2 workspace[9] 0x719eee000000
Rank 2 workspace[10] 0x719f02000000
Rank 2 workspace[11] 0x719ee4000000
Rank 2 workspace[12] 0x71cb13264400
set flag_ptr[3] = lamport_comm_size: 47185920
Rank 1 workspace[0] 0x77f2a8000000
Rank 1 workspace[1] 0x7818a4000000
Rank 1 workspace[2] 0x77f2a4000000
Rank 1 workspace[3] 0x77f2a0000000
Rank 1 workspace[4] 0x77f2a3000000
Rank 1 workspace[5] 0x77f2a2e00000
Rank 1 workspace[6] 0x77f2a3200000
Rank 1 workspace[7] 0x77f2a3400000
Rank 1 workspace[8] 0x77f28c000000
Rank 1 workspace[9] 0x77f296000000
Rank 1 workspace[10] 0x77f282000000
Rank 1 workspace[11] 0x77f278000000
Rank 1 workspace[12] 0x781ead264400
Capturing batches (bs=200 avail_mem=9.39 GB): 4%|▎ | 1/28 [00:02<00:56, 2.10s/it] Capturing batches (bs=192 avail_mem=8.06 GB): 4%|▎ | 1/28 [00:02<00:56, 2.10s/it] Capturing batches (bs=192 avail_mem=8.06 GB): 7%|▋ | 2/28 [00:02<00:24, 1.05it/s] Capturing batches (bs=184 avail_mem=8.05 GB): 7%|▋ | 2/28 [00:02<00:24, 1.05it/s] Capturing batches (bs=176 avail_mem=8.04 GB): 7%|▋ | 2/28 [00:02<00:24, 1.05it/s] Capturing batches (bs=176 avail_mem=8.04 GB): 14%|█▍ | 4/28 [00:02<00:09, 2.45it/s] Capturing batches (bs=168 avail_mem=8.03 GB): 14%|█▍ | 4/28 [00:02<00:09, 2.45it/s] Capturing batches (bs=160 avail_mem=8.02 GB): 14%|█▍ | 4/28 [00:02<00:09, 2.45it/s] Capturing batches (bs=160 avail_mem=8.02 GB): 21%|██▏ | 6/28 [00:02<00:05, 3.92it/s] Capturing batches (bs=152 avail_mem=8.01 GB): 21%|██▏ | 6/28 [00:02<00:05, 3.92it/s] Capturing batches (bs=144 avail_mem=8.00 GB): 21%|██▏ | 6/28 [00:02<00:05, 3.92it/s] Capturing batches (bs=144 avail_mem=8.00 GB): 29%|██▊ | 8/28 [00:02<00:03, 5.38it/s] Capturing batches (bs=136 avail_mem=8.00 GB): 29%|██▊ | 8/28 [00:02<00:03, 5.38it/s] Capturing batches (bs=128 avail_mem=7.99 GB): 29%|██▊ | 8/28 [00:02<00:03, 5.38it/s] Capturing batches (bs=128 avail_mem=7.99 GB): 36%|███▌ | 10/28 [00:02<00:02, 6.66it/s] Capturing batches (bs=120 avail_mem=7.97 GB): 36%|███▌ | 10/28 [00:02<00:02, 6.66it/s] Capturing batches (bs=112 avail_mem=7.97 GB): 36%|███▌ | 10/28 [00:03<00:02, 6.66it/s] Capturing batches (bs=112 avail_mem=7.97 GB): 43%|████▎ | 12/28 [00:03<00:02, 7.81it/s] Capturing batches (bs=104 avail_mem=7.96 GB): 43%|████▎ | 12/28 [00:03<00:02, 7.81it/s] Capturing batches (bs=96 avail_mem=7.95 GB): 43%|████▎ | 12/28 [00:03<00:02, 7.81it/s] Capturing batches (bs=96 avail_mem=7.95 GB): 50%|█████ | 14/28 [00:03<00:01, 8.79it/s] Capturing batches (bs=88 avail_mem=7.94 GB): 50%|█████ | 14/28 [00:03<00:01, 8.79it/s] Capturing batches (bs=80 avail_mem=7.93 GB): 50%|█████ | 14/28 [00:03<00:01, 8.79it/s] Capturing batches (bs=80 avail_mem=7.93 GB): 57%|█████▋ | 16/28 [00:03<00:01, 9.56it/s] Capturing batches (bs=72 avail_mem=7.92 GB): 57%|█████▋ | 16/28 [00:03<00:01, 9.56it/s] Capturing batches (bs=64 avail_mem=7.91 GB): 57%|█████▋ | 16/28 [00:03<00:01, 9.56it/s] Capturing batches (bs=64 avail_mem=7.91 GB): 64%|██████▍ | 18/28 [00:03<00:00, 10.22it/s] Capturing batches (bs=56 avail_mem=7.90 GB): 64%|██████▍ | 18/28 [00:03<00:00, 10.22it/s] Capturing batches (bs=48 avail_mem=7.89 GB): 64%|██████▍ | 18/28 [00:03<00:00, 10.22it/s] Capturing batches (bs=48 avail_mem=7.89 GB): 71%|███████▏ | 20/28 [00:03<00:00, 10.77it/s] Capturing batches (bs=40 avail_mem=7.88 GB): 71%|███████▏ | 20/28 [00:03<00:00, 10.77it/s] Capturing batches (bs=32 avail_mem=7.88 GB): 71%|███████▏ | 20/28 [00:03<00:00, 10.77it/s] Capturing batches (bs=32 avail_mem=7.88 GB): 79%|███████▊ | 22/28 [00:03<00:00, 11.19it/s] Capturing batches (bs=24 avail_mem=7.87 GB): 79%|███████▊ | 22/28 [00:03<00:00, 11.19it/s] Capturing batches (bs=16 avail_mem=7.86 GB): 79%|███████▊ | 22/28 [00:04<00:00, 11.19it/s] Capturing batches (bs=16 avail_mem=7.86 GB): 86%|████████▌ | 24/28 [00:04<00:00, 11.53it/s] Capturing batches (bs=8 avail_mem=7.85 GB): 86%|████████▌ | 24/28 [00:04<00:00, 11.53it/s] Capturing batches (bs=4 avail_mem=7.84 GB): 86%|████████▌ | 24/28 [00:04<00:00, 11.53it/s] Capturing batches (bs=4 avail_mem=7.84 GB): 93%|█████████▎| 26/28 [00:04<00:00, 12.12it/s] Capturing batches (bs=2 avail_mem=7.83 GB): 93%|█████████▎| 26/28 [00:04<00:00, 12.12it/s] Capturing batches (bs=1 avail_mem=7.82 GB): 93%|█████████▎| 26/28 [00:04<00:00, 12.12it/s] Capturing batches (bs=1 avail_mem=7.82 GB): 100%|██████████| 28/28 [00:04<00:00, 11.50it/s] Capturing batches (bs=1 avail_mem=7.82 GB): 100%|██████████| 28/28 [00:04<00:00, 6.31it/s]
[2025-09-06 08:31:33 TP0] Registering 56 cuda graph addresses
[2025-09-06 08:31:33 TP3] Registering 56 cuda graph addresses
[2025-09-06 08:31:33 TP2] Registering 56 cuda graph addresses
[2025-09-06 08:31:33 TP1] Registering 56 cuda graph addresses
[2025-09-06 08:31:33 TP0] Capture cuda graph end. Time elapsed: 4.95 s. mem usage=1.73 GB. avail mem=7.81 GB.
[2025-09-06 08:31:34 TP0] max_total_num_tokens=8487040, chunked_prefill_size=16384, max_prefill_tokens=16384, max_running_requests=4096, context_len=131072, available_gpu_mem=7.81 GB
[2025-09-06 08:31:34] INFO: Started server process [36995]
[2025-09-06 08:31:34] INFO: Waiting for application startup.
[2025-09-06 08:31:35] INFO: Application startup complete.
[2025-09-06 08:31:35] INFO: Uvicorn running on http://127.0.0.1:8400 (Press CTRL+C to quit)
[2025-09-06 08:31:36] INFO: 127.0.0.1:51504 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-09-06 08:31:36 TP0] Prefill batch. #new-seq: 1, #new-token: 64, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-09-06 08:31:37] INFO: 127.0.0.1:51520 - "POST /generate HTTP/1.1" 200 OK
[2025-09-06 08:31:37] The server is fired up and ready to roll!
[2025-09-06 08:31:37 TP0] Prefill batch. #new-seq: 1, #new-token: 64, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-09-06 08:31:38] INFO: 127.0.0.1:51528 - "GET /health_generate HTTP/1.1" 200 OK
command=python3 -m sglang.launch_server --model-path /home/yiliu7/models/openai/gpt-oss-120b --tp 4 --cuda-graph-max-bs 200 --mem-fraction-static 0.93 --device cuda --host 127.0.0.1 --port 8400
Evaluation start: model=/home/yiliu7/models/openai/gpt-oss-120b reasoning_effort=low expected_score=0.6
ChatCompletionSampler initialized with self.system_message=None self.temperature=0.1 self.max_tokens=4096 self.reasoning_effort='low'
0%| | 0/198 [00:00<?, ?it/s][2025-09-06 08:31:39 TP0] Prefill batch. #new-seq: 1, #new-token: 512, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-09-06 08:31:39 TP0] Prefill batch. #new-seq: 3, #new-token: 896, #cached-token: 0, token usage: 0.00, #running-req: 1, #queue-req: 0,
[2025-09-06 08:31:39 TP0] Prefill batch. #new-seq: 14, #new-token: 4160, #cached-token: 896, token usage: 0.00, #running-req: 4, #queue-req: 0,
[2025-09-06 08:31:39 TP0] Prefill batch. #new-seq: 17, #new-token: 3840, #cached-token: 1088, token usage: 0.00, #running-req: 18, #queue-req: 0,
[2025-09-06 08:31:39 TP0] Prefill batch. #new-seq: 55, #new-token: 16128, #cached-token: 3520, token usage: 0.00, #running-req: 35, #queue-req: 23,
[2025-09-06 08:31:39 TP0] Prefill batch. #new-seq: 46, #new-token: 14208, #cached-token: 2944, token usage: 0.00, #running-req: 90, #queue-req: 0,
[2025-09-06 08:31:40 TP0] Prefill batch. #new-seq: 57, #new-token: 16000, #cached-token: 3712, token usage: 0.00, #running-req: 136, #queue-req: 0,
[2025-09-06 08:31:40 TP0] Prefill batch. #new-seq: 5, #new-token: 1152, #cached-token: 320, token usage: 0.01, #running-req: 193, #queue-req: 0,
[2025-09-06 08:31:40 TP0] Decode batch. #running-req: 198, #token: 62976, token usage: 0.01, cuda graph: True, gen throughput (token/s): 922.40, #queue-req: 0,
[2025-09-06 08:31:40] INFO: 127.0.0.1:41312 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:40] INFO: 127.0.0.1:40926 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:41 TP0] Decode batch. #running-req: 196, #token: 70272, token usage: 0.01, cuda graph: True, gen throughput (token/s): 17188.20, #queue-req: 0,
[2025-09-06 08:31:41] INFO: 127.0.0.1:40114 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:41] INFO: 127.0.0.1:40706 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:41] INFO: 127.0.0.1:40156 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:41] INFO: 127.0.0.1:41216 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:41] INFO: 127.0.0.1:41038 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:41] INFO: 127.0.0.1:40550 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:41] INFO: 127.0.0.1:39952 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:41] INFO: 127.0.0.1:40468 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:41 TP0] Decode batch. #running-req: 188, #token: 72704, token usage: 0.01, cuda graph: True, gen throughput (token/s): 17082.06, #queue-req: 0,
[2025-09-06 08:31:41] INFO: 127.0.0.1:41490 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:41] INFO: 127.0.0.1:40572 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:41] INFO: 127.0.0.1:41026 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:41] INFO: 127.0.0.1:40216 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:41] INFO: 127.0.0.1:41222 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:41] INFO: 127.0.0.1:40902 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:41] INFO: 127.0.0.1:41350 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:41] INFO: 127.0.0.1:40590 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:41] INFO: 127.0.0.1:41608 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:41] INFO: 127.0.0.1:40338 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:41] INFO: 127.0.0.1:40722 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:42] INFO: 127.0.0.1:40620 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:42] INFO: 127.0.0.1:41118 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:42 TP0] Decode batch. #running-req: 175, #token: 73280, token usage: 0.01, cuda graph: True, gen throughput (token/s): 16450.85, #queue-req: 0,
[2025-09-06 08:31:42] INFO: 127.0.0.1:40004 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:42] INFO: 127.0.0.1:40324 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:42] INFO: 127.0.0.1:41550 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:42] INFO: 127.0.0.1:40594 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:42] INFO: 127.0.0.1:41424 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:42] INFO: 127.0.0.1:41528 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:42 TP0] Decode batch. #running-req: 169, #token: 76992, token usage: 0.01, cuda graph: True, gen throughput (token/s): 15874.65, #queue-req: 0,
[2025-09-06 08:31:42] INFO: 127.0.0.1:40272 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:42] INFO: 127.0.0.1:40146 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:42] INFO: 127.0.0.1:40950 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:42] INFO: 127.0.0.1:40396 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:42] INFO: 127.0.0.1:40014 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:42] INFO: 127.0.0.1:39990 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:43 TP0] Decode batch. #running-req: 163, #token: 81216, token usage: 0.01, cuda graph: True, gen throughput (token/s): 14920.13, #queue-req: 0,
[2025-09-06 08:31:43] INFO: 127.0.0.1:40582 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:43] INFO: 127.0.0.1:40868 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:43] INFO: 127.0.0.1:40492 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:43] INFO: 127.0.0.1:40742 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:43] INFO: 127.0.0.1:39878 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:43] INFO: 127.0.0.1:40910 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:43] INFO: 127.0.0.1:40870 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:43] INFO: 127.0.0.1:41634 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:43] INFO: 127.0.0.1:40340 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:43] INFO: 127.0.0.1:41368 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:43] INFO: 127.0.0.1:40652 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:43] INFO: 127.0.0.1:41138 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:43] INFO: 127.0.0.1:40406 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:43 TP0] Decode batch. #running-req: 151, #token: 80704, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13889.09, #queue-req: 0,
[2025-09-06 08:31:43] INFO: 127.0.0.1:40782 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:43] INFO: 127.0.0.1:41124 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:43] INFO: 127.0.0.1:40698 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:43] INFO: 127.0.0.1:41040 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:43] INFO: 127.0.0.1:41560 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:43] INFO: 127.0.0.1:41012 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:43] INFO: 127.0.0.1:39824 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:43] INFO: 127.0.0.1:40062 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:43] INFO: 127.0.0.1:41210 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:43 TP0] Decode batch. #running-req: 141, #token: 81792, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13339.47, #queue-req: 0,
[2025-09-06 08:31:43] INFO: 127.0.0.1:41456 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:44] INFO: 127.0.0.1:41196 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:44] INFO: 127.0.0.1:40154 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:44] INFO: 127.0.0.1:40350 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:44] INFO: 127.0.0.1:40064 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:44] INFO: 127.0.0.1:40768 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:44] INFO: 127.0.0.1:39862 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:44] INFO: 127.0.0.1:41380 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:44] INFO: 127.0.0.1:40798 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:44] INFO: 127.0.0.1:40130 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:44] INFO: 127.0.0.1:41252 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:44] INFO: 127.0.0.1:40838 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:44] INFO: 127.0.0.1:41476 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:44 TP0] Decode batch. #running-req: 128, #token: 79296, token usage: 0.01, cuda graph: True, gen throughput (token/s): 12714.51, #queue-req: 0,
[2025-09-06 08:31:44] INFO: 127.0.0.1:40200 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:44] INFO: 127.0.0.1:40306 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:44] INFO: 127.0.0.1:41624 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:44] INFO: 127.0.0.1:40882 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:44] INFO: 127.0.0.1:40102 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:44] INFO: 127.0.0.1:40432 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:44] INFO: 127.0.0.1:40762 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:44] INFO: 127.0.0.1:40460 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:44] INFO: 127.0.0.1:40420 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:44 TP0] Decode batch. #running-req: 119, #token: 78528, token usage: 0.01, cuda graph: True, gen throughput (token/s): 16715.06, #queue-req: 0,
[2025-09-06 08:31:44] INFO: 127.0.0.1:40112 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:44] INFO: 127.0.0.1:40848 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:44] INFO: 127.0.0.1:40610 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:44] INFO: 127.0.0.1:40226 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:44] INFO: 127.0.0.1:41536 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:44] INFO: 127.0.0.1:40250 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:44] INFO: 127.0.0.1:40098 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:44] INFO: 127.0.0.1:41596 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:44 TP0] Decode batch. #running-req: 112, #token: 77760, token usage: 0.01, cuda graph: True, gen throughput (token/s): 15766.71, #queue-req: 0,
[2025-09-06 08:31:44] INFO: 127.0.0.1:40384 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:45] INFO: 127.0.0.1:40968 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:45] INFO: 127.0.0.1:40448 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:45 TP0] Decode batch. #running-req: 108, #token: 79872, token usage: 0.01, cuda graph: True, gen throughput (token/s): 15274.40, #queue-req: 0,
[2025-09-06 08:31:45] INFO: 127.0.0.1:40302 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:45] INFO: 127.0.0.1:40700 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:45] INFO: 127.0.0.1:41298 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:45] INFO: 127.0.0.1:40184 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:45] INFO: 127.0.0.1:41278 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:45] INFO: 127.0.0.1:41304 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:45] INFO: 127.0.0.1:41352 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:45] INFO: 127.0.0.1:40532 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:45] INFO: 127.0.0.1:39892 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:45] INFO: 127.0.0.1:40618 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:45] INFO: 127.0.0.1:40670 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:45] INFO: 127.0.0.1:40812 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:45 TP0] Decode batch. #running-req: 96, #token: 73856, token usage: 0.01, cuda graph: True, gen throughput (token/s): 14447.63, #queue-req: 0,
[2025-09-06 08:31:45] INFO: 127.0.0.1:41588 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:45] INFO: 127.0.0.1:41268 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:45] INFO: 127.0.0.1:39842 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:45] INFO: 127.0.0.1:40650 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:45] INFO: 127.0.0.1:40408 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:45] INFO: 127.0.0.1:40888 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:45] INFO: 127.0.0.1:40630 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:45] INFO: 127.0.0.1:41066 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:45] INFO: 127.0.0.1:40296 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:45] INFO: 127.0.0.1:41180 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:45 TP0] Decode batch. #running-req: 86, #token: 69632, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13476.54, #queue-req: 0,
[2025-09-06 08:31:45] INFO: 127.0.0.1:39908 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:45] INFO: 127.0.0.1:40560 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:45] INFO: 127.0.0.1:40040 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:45] INFO: 127.0.0.1:40732 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:45] INFO: 127.0.0.1:39950 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:45] INFO: 127.0.0.1:39976 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:45] INFO: 127.0.0.1:41294 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:45] INFO: 127.0.0.1:40278 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:46] INFO: 127.0.0.1:41378 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:46 TP0] Decode batch. #running-req: 78, #token: 66048, token usage: 0.01, cuda graph: True, gen throughput (token/s): 12489.34, #queue-req: 0,
[2025-09-06 08:31:46] INFO: 127.0.0.1:41094 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:46] INFO: 127.0.0.1:40258 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:46] INFO: 127.0.0.1:40520 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:46] INFO: 127.0.0.1:40828 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:46] INFO: 127.0.0.1:41406 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:46] INFO: 127.0.0.1:40484 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:46] INFO: 127.0.0.1:41166 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:46 TP0] Decode batch. #running-req: 70, #token: 63360, token usage: 0.01, cuda graph: True, gen throughput (token/s): 11531.30, #queue-req: 0,
[2025-09-06 08:31:46] INFO: 127.0.0.1:41632 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:46] INFO: 127.0.0.1:40936 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:46] INFO: 127.0.0.1:40714 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:46] INFO: 127.0.0.1:40294 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:46] INFO: 127.0.0.1:39814 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:46 TP0] Decode batch. #running-req: 65, #token: 61248, token usage: 0.01, cuda graph: True, gen throughput (token/s): 10813.85, #queue-req: 0,
[2025-09-06 08:31:46] INFO: 127.0.0.1:41098 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:46] INFO: 127.0.0.1:41464 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:46] INFO: 127.0.0.1:41232 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:46 TP0] Decode batch. #running-req: 63, #token: 60800, token usage: 0.01, cuda graph: True, gen throughput (token/s): 10310.24, #queue-req: 0,
[2025-09-06 08:31:46] INFO: 127.0.0.1:39940 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:47 TP0] Decode batch. #running-req: 61, #token: 62464, token usage: 0.01, cuda graph: True, gen throughput (token/s): 10214.19, #queue-req: 0,
[2025-09-06 08:31:47] INFO: 127.0.0.1:40370 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:47] INFO: 127.0.0.1:40466 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:47] INFO: 127.0.0.1:40658 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:47] INFO: 127.0.0.1:40426 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:47] INFO: 127.0.0.1:40846 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:47 TP0] Decode batch. #running-req: 56, #token: 57984, token usage: 0.01, cuda graph: True, gen throughput (token/s): 9650.02, #queue-req: 0,
[2025-09-06 08:31:47] INFO: 127.0.0.1:41440 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:47] INFO: 127.0.0.1:39826 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:47] INFO: 127.0.0.1:40028 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:47] INFO: 127.0.0.1:40608 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:47] INFO: 127.0.0.1:40676 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:47] INFO: 127.0.0.1:39958 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:47] INFO: 127.0.0.1:41290 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:47] INFO: 127.0.0.1:40178 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:47] INFO: 127.0.0.1:41602 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:47] INFO: 127.0.0.1:41078 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:47] INFO: 127.0.0.1:41338 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:47 TP0] Decode batch. #running-req: 46, #token: 49408, token usage: 0.01, cuda graph: True, gen throughput (token/s): 8536.76, #queue-req: 0,
[2025-09-06 08:31:47] INFO: 127.0.0.1:41412 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:47] INFO: 127.0.0.1:40414 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:47] INFO: 127.0.0.1:41392 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:47] INFO: 127.0.0.1:40736 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:47] INFO: 127.0.0.1:41504 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:47] INFO: 127.0.0.1:40240 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:47] INFO: 127.0.0.1:41056 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:47] INFO: 127.0.0.1:40958 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:47] INFO: 127.0.0.1:40542 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:47] INFO: 127.0.0.1:40086 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:47 TP0] Decode batch. #running-req: 35, #token: 39616, token usage: 0.00, cuda graph: True, gen throughput (token/s): 7112.32, #queue-req: 0,
[2025-09-06 08:31:47] INFO: 127.0.0.1:41154 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:47] INFO: 127.0.0.1:41122 - "POST /v1/chat/completions HTTP/1.1" 200 OK
1%| | 1/198 [00:08<27:51, 8.49s/it][2025-09-06 08:31:47] INFO: 127.0.0.1:39856 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:47 TP0] Decode batch. #running-req: 32, #token: 37376, token usage: 0.00, cuda graph: True, gen throughput (token/s): 6197.00, #queue-req: 0,
[2025-09-06 08:31:48] INFO: 127.0.0.1:41014 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:48] INFO: 127.0.0.1:41086 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:48 TP0] Decode batch. #running-req: 30, #token: 36480, token usage: 0.00, cuda graph: True, gen throughput (token/s): 5783.69, #queue-req: 0,
[2025-09-06 08:31:48] INFO: 127.0.0.1:41114 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:48] INFO: 127.0.0.1:39964 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:48] INFO: 127.0.0.1:40880 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:48] INFO: 127.0.0.1:40312 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:48 TP0] Decode batch. #running-req: 26, #token: 32512, token usage: 0.00, cuda graph: True, gen throughput (token/s): 5250.98, #queue-req: 0,
[2025-09-06 08:31:48] INFO: 127.0.0.1:40800 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:48] INFO: 127.0.0.1:40140 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:48] INFO: 127.0.0.1:40980 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:48] INFO: 127.0.0.1:41330 - "POST /v1/chat/completions HTTP/1.1" 200 OK
6%|▌ | 12/198 [00:09<01:44, 1.78it/s][2025-09-06 08:31:48 TP0] Decode batch. #running-req: 22, #token: 28672, token usage: 0.00, cuda graph: True, gen throughput (token/s): 4662.53, #queue-req: 0,
[2025-09-06 08:31:48] INFO: 127.0.0.1:39946 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:48] INFO: 127.0.0.1:39924 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:48] INFO: 127.0.0.1:41244 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:48 TP0] Decode batch. #running-req: 19, #token: 25472, token usage: 0.00, cuda graph: True, gen throughput (token/s): 4022.45, #queue-req: 0,
[2025-09-06 08:31:48] INFO: 127.0.0.1:40996 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:48] INFO: 127.0.0.1:40360 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:48] INFO: 127.0.0.1:40906 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:48] INFO: 127.0.0.1:39900 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:48] INFO: 127.0.0.1:40458 - "POST /v1/chat/completions HTTP/1.1" 200 OK
14%|█▎ | 27/198 [00:09<00:36, 4.63it/s][2025-09-06 08:31:48] INFO: 127.0.0.1:41514 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:48 TP0] Decode batch. #running-req: 13, #token: 17856, token usage: 0.00, cuda graph: True, gen throughput (token/s): 3308.91, #queue-req: 0,
[2025-09-06 08:31:48] INFO: 127.0.0.1:40864 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:49] INFO: 127.0.0.1:40332 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:49 TP0] Decode batch. #running-req: 11, #token: 15360, token usage: 0.00, cuda graph: True, gen throughput (token/s): 2396.18, #queue-req: 0,
[2025-09-06 08:31:49] INFO: 127.0.0.1:40172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:49 TP0] Decode batch. #running-req: 10, #token: 14464, token usage: 0.00, cuda graph: True, gen throughput (token/s): 2326.20, #queue-req: 0,
[2025-09-06 08:31:49] INFO: 127.0.0.1:40054 - "POST /v1/chat/completions HTTP/1.1" 200 OK
21%|██ | 42/198 [00:09<00:19, 7.89it/s][2025-09-06 08:31:49 TP0] Decode batch. #running-req: 9, #token: 13248, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1980.44, #queue-req: 0,
[2025-09-06 08:31:49] INFO: 127.0.0.1:41576 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:49] INFO: 127.0.0.1:40750 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:49] INFO: 127.0.0.1:41360 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:49 TP0] Decode batch. #running-req: 6, #token: 9088, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1691.00, #queue-req: 0,
[2025-09-06 08:31:49 TP0] Decode batch. #running-req: 6, #token: 9280, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1448.62, #queue-req: 0,
[2025-09-06 08:31:49] INFO: 127.0.0.1:41322 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:49] INFO: 127.0.0.1:40076 - "POST /v1/chat/completions HTTP/1.1" 200 OK
23%|██▎ | 45/198 [00:10<00:20, 7.40it/s][2025-09-06 08:31:50 TP0] Decode batch. #running-req: 4, #token: 6336, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1187.08, #queue-req: 0,
[2025-09-06 08:31:50] INFO: 127.0.0.1:40504 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:50 TP0] Decode batch. #running-req: 3, #token: 4864, token usage: 0.00, cuda graph: True, gen throughput (token/s): 873.48, #queue-req: 0,
[2025-09-06 08:31:50 TP0] Decode batch. #running-req: 3, #token: 4928, token usage: 0.00, cuda graph: True, gen throughput (token/s): 794.66, #queue-req: 0,
[2025-09-06 08:31:50 TP0] Decode batch. #running-req: 3, #token: 5056, token usage: 0.00, cuda graph: True, gen throughput (token/s): 793.97, #queue-req: 0,
[2025-09-06 08:31:50] INFO: 127.0.0.1:40686 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:31:50 TP0] Decode batch. #running-req: 2, #token: 3584, token usage: 0.00, cuda graph: True, gen throughput (token/s): 654.65, #queue-req: 0,
[2025-09-06 08:31:50 TP0] Decode batch. #running-req: 2, #token: 3584, token usage: 0.00, cuda graph: True, gen throughput (token/s): 579.22, #queue-req: 0,
[2025-09-06 08:31:50] INFO: 127.0.0.1:40152 - "POST /v1/chat/completions HTTP/1.1" 200 OK
27%|██▋ | 54/198 [00:11<00:17, 8.07it/s][2025-09-06 08:31:50 TP0] Decode batch. #running-req: 1, #token: 1920, token usage: 0.00, cuda graph: True, gen throughput (token/s): 553.68, #queue-req: 0,
[2025-09-06 08:31:51 TP0] Decode batch. #running-req: 1, #token: 1920, token usage: 0.00, cuda graph: True, gen throughput (token/s): 331.43, #queue-req: 0,
[2025-09-06 08:31:51 TP0] Decode batch. #running-req: 1, #token: 1984, token usage: 0.00, cuda graph: True, gen throughput (token/s): 331.17, #queue-req: 0,
[2025-09-06 08:31:51 TP0] Decode batch. #running-req: 1, #token: 2048, token usage: 0.00, cuda graph: True, gen throughput (token/s): 331.30, #queue-req: 0,
[2025-09-06 08:31:51 TP0] Decode batch. #running-req: 1, #token: 2048, token usage: 0.00, cuda graph: True, gen throughput (token/s): 331.20, #queue-req: 0,
[2025-09-06 08:31:51 TP0] Decode batch. #running-req: 1, #token: 2112, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.45, #queue-req: 0,
[2025-09-06 08:31:51 TP0] Decode batch. #running-req: 1, #token: 2176, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.09, #queue-req: 0,
[2025-09-06 08:31:51 TP0] Decode batch. #running-req: 1, #token: 2176, token usage: 0.00, cuda graph: True, gen throughput (token/s): 327.97, #queue-req: 0,
[2025-09-06 08:31:51 TP0] Decode batch. #running-req: 1, #token: 2240, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.19, #queue-req: 0,
[2025-09-06 08:31:52 TP0] Decode batch. #running-req: 1, #token: 2240, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.97, #queue-req: 0,
[2025-09-06 08:31:52] INFO: 127.0.0.1:40646 - "POST /v1/chat/completions HTTP/1.1" 200 OK
56%|█████▌ | 110/198 [00:12<00:04, 21.58it/s] 100%|██████████| 198/198 [00:12<00:00, 15.65it/s]
/usr/lib/python3.12/subprocess.py:1127: ResourceWarning: subprocess 36995 is still running
_warn("subprocess %s is still running" % self.pid,
ResourceWarning: Enable tracemalloc to get the object allocation traceback
.
----------------------------------------------------------------------
Ran 1 test in 166.818s
OK
Writing report to /tmp/gpqa__home_yiliu7_models_openai_gpt-oss-120b.html
{'chars': 1722.030303030303, 'chars:std': 987.6499088827325, 'score:std': 0.4824488175389596, 'score': 0.6313131313131313}
Writing results to /tmp/gpqa__home_yiliu7_models_openai_gpt-oss-120b.json
Total latency: 12.707 s
Score: 0.631
Evaluation end: model=/home/yiliu7/models/openai/gpt-oss-120b reasoning_effort=low expected_score=0.6 metrics={'chars': 1722.030303030303, 'chars:std': 987.6499088827325, 'score:std': 0.4824488175389596, 'score': 0.6313131313131313}
================================================================================
Run 3:
Auto-configed device: cuda
WARNING:sglang.srt.server_args:Detected SM100 and MXFP4 quantization format for GPT-OSS model, enabling FlashInfer MXFP4 MOE kernel.
WARNING:sglang.srt.server_args:TensorRT-LLM MHA only supports page_size of 16, 32 or 64, changing page_size from None to 64.
[2025-09-06 08:32:06] server_args=ServerArgs(model_path='/home/yiliu7/models/openai/gpt-oss-120b', tokenizer_path='/home/yiliu7/models/openai/gpt-oss-120b', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=8400, skip_server_warmup=False, warmups=None, nccl_port=None, dtype='bfloat16', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', mem_fraction_static=0.93, max_running_requests=None, max_queued_requests=9223372036854775807, max_total_tokens=None, chunked_prefill_size=16384, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, page_size=64, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=True, device='cuda', tp_size=4, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=713630635, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, gc_warning_threshold_secs=0.0, api_key=None, served_model_name='/home/yiliu7/models/openai/gpt-oss-120b', weight_version='default', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_backend='triton', attention_backend='trtllm_mha', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, ep_size=1, moe_a2a_backend='none', moe_runner_backend='flashinfer_mxfp4', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=True, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', disable_radix_cache=False, cuda_graph_max_bs=200, cuda_graph_bs=None, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, torch_compile_max_bs=32, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, enable_return_hidden_states=False, scheduler_recv_interval=1, numa_node=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, debug_tensor_dump_prefill_only=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, num_reserved_decode_tokens=512, custom_weight_loader=[], weight_loader_disable_mmap=False, enable_pdmux=False, sm_group_num=3, enable_ep_moe=False, enable_deepep_moe=False, enable_flashinfer_cutlass_moe=False, enable_flashinfer_trtllm_moe=False, enable_triton_kernel_moe=False, enable_flashinfer_mxfp4_moe=False)
All deep_gemm operations loaded successfully!
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-06 08:32:06] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:32:06] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-06 08:32:06] Using default HuggingFace chat template with detected content format: string
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-06 08:32:13 TP0] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:32:13 TP0] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-06 08:32:13 TP3] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:32:13 TP3] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-06 08:32:13 TP1] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:32:13 TP1] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-06 08:32:13 TP2] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:32:13 TP2] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-06 08:32:13 TP0] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:32:13 TP0] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-06 08:32:13 TP0] Init torch distributed begin.
[2025-09-06 08:32:14 TP3] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:32:14 TP3] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-06 08:32:14 TP1] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:32:14 TP1] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-06 08:32:14 TP2] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:32:14 TP2] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[2025-09-06 08:32:15 TP0] sglang is using nccl==2.27.3
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[2025-09-06 08:32:18 TP0] Init torch distributed ends. mem usage=1.46 GB
[2025-09-06 08:32:18 TP0] Load weight begin. avail mem=176.28 GB
All deep_gemm operations loaded successfully!
Loading safetensors checkpoint shards: 0% Completed | 0/15 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 15/15 [00:00<00:00, 1662.52it/s]
[2025-09-06 08:32:29 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.0.mlp.experts), it might take a while...
All deep_gemm operations loaded successfully!
All deep_gemm operations loaded successfully!
All deep_gemm operations loaded successfully!
[2025-09-06 08:32:33 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.1.mlp.experts), it might take a while...
[2025-09-06 08:32:36 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.2.mlp.experts), it might take a while...
[2025-09-06 08:32:39 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.3.mlp.experts), it might take a while...
[2025-09-06 08:32:42 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.4.mlp.experts), it might take a while...
[2025-09-06 08:32:45 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.5.mlp.experts), it might take a while...
[2025-09-06 08:32:48 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.6.mlp.experts), it might take a while...
[2025-09-06 08:32:51 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.7.mlp.experts), it might take a while...
[2025-09-06 08:32:54 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.8.mlp.experts), it might take a while...
[2025-09-06 08:32:57 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.9.mlp.experts), it might take a while...
[2025-09-06 08:33:00 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.10.mlp.experts), it might take a while...
[2025-09-06 08:33:03 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.11.mlp.experts), it might take a while...
[2025-09-06 08:33:06 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.12.mlp.experts), it might take a while...
[2025-09-06 08:33:09 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.13.mlp.experts), it might take a while...
[2025-09-06 08:33:12 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.14.mlp.experts), it might take a while...
[2025-09-06 08:33:15 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.15.mlp.experts), it might take a while...
[2025-09-06 08:33:18 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.16.mlp.experts), it might take a while...
[2025-09-06 08:33:21 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.17.mlp.experts), it might take a while...
[2025-09-06 08:33:24 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.18.mlp.experts), it might take a while...
[2025-09-06 08:33:27 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.19.mlp.experts), it might take a while...
[2025-09-06 08:33:30 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.20.mlp.experts), it might take a while...
[2025-09-06 08:33:33 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.21.mlp.experts), it might take a while...
[2025-09-06 08:33:36 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.22.mlp.experts), it might take a while...
[2025-09-06 08:33:39 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.23.mlp.experts), it might take a while...
[2025-09-06 08:33:42 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.24.mlp.experts), it might take a while...
[2025-09-06 08:33:45 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.25.mlp.experts), it might take a while...
[2025-09-06 08:33:49 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.26.mlp.experts), it might take a while...
[2025-09-06 08:33:52 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.27.mlp.experts), it might take a while...
[2025-09-06 08:33:55 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.28.mlp.experts), it might take a while...
[2025-09-06 08:33:58 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.29.mlp.experts), it might take a while...
[2025-09-06 08:34:01 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.30.mlp.experts), it might take a while...
[2025-09-06 08:34:04 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.31.mlp.experts), it might take a while...
[2025-09-06 08:34:07 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.32.mlp.experts), it might take a while...
[2025-09-06 08:34:10 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.33.mlp.experts), it might take a while...
[2025-09-06 08:34:13 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.34.mlp.experts), it might take a while...
[2025-09-06 08:34:16 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.35.mlp.experts), it might take a while...
[2025-09-06 08:34:19 TP0] Load weight end. type=GptOssForCausalLM, dtype=torch.bfloat16, avail mem=158.06 GB, mem usage=18.22 GB.
[2025-09-06 08:34:23 TP2] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB
[2025-09-06 08:34:23 TP3] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB
[2025-09-06 08:34:23 TP0] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB
[2025-09-06 08:34:23 TP0] Memory pool end. avail mem=10.23 GB
[2025-09-06 08:34:23 TP1] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB
[2025-09-06 08:34:23 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=9.54 GB
[2025-09-06 08:34:24 TP0] Capture cuda graph bs [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200]
0%| | 0/28 [00:00<?, ?it/s] Capturing batches (bs=200 avail_mem=9.39 GB): 0%| | 0/28 [00:00<?, ?it/s]rank 1 allocated ipc_handles: [['0x7eb6fc000000', '0x7edcbc000000', '0x7eb6bc000000', '0x7eb6b8000000'], ['0x7eb6bb000000', '0x7eb6bae00000', '0x7eb6bb200000', '0x7eb6bb400000'], ['0x7eb6a4000000', '0x7eb6ae000000', '0x7eb69a000000', '0x7eb690000000']]
[2025-09-06 08:34:26.044] [info] lamportInitialize start: buffer: 0x7eb6ae000000, size: 71303168
rank 0 allocated ipc_handles: [['0x774d82000000', '0x772796000000', '0x772792000000', '0x77278e000000'], ['0x772790e00000', '0x772791000000', '0x772791200000', '0x772791400000'], ['0x772784000000', '0x77277a000000', '0x772770000000', '0x772766000000']]
[2025-09-06 08:34:26.092] [info] lamportInitialize start: buffer: 0x772784000000, size: 71303168
rank 3 allocated ipc_handles: [['0x76aaf0000000', '0x76aa8c000000', '0x76aa88000000', '0x76d08c000000'], ['0x76aa8b000000', '0x76aa8b200000', '0x76aa8b400000', '0x76aa8ae00000'], ['0x76aa74000000', '0x76aa6a000000', '0x76aa60000000', '0x76aa7e000000']]
[2025-09-06 08:34:26.142] [info] lamportInitialize start: buffer: 0x76aa7e000000, size: 71303168
rank 2 allocated ipc_handles: [['0x7db43c000000', '0x7db3f4000000', '0x7dd9f2000000', '0x7db3f0000000'], ['0x7db3f3000000', '0x7db3f3200000', '0x7db3f2e00000', '0x7db3f3400000'], ['0x7db3dc000000', '0x7db3d2000000', '0x7db3e6000000', '0x7db3c8000000']]
[2025-09-06 08:34:26.192] [info] lamportInitialize start: buffer: 0x7db3e6000000, size: 71303168
[2025-09-06 08:34:26 TP0] FlashInfer workspace initialized for rank 0, world_size 4
[2025-09-06 08:34:26 TP1] FlashInfer workspace initialized for rank 1, world_size 4
[2025-09-06 08:34:26 TP2] FlashInfer workspace initialized for rank 2, world_size 4
[2025-09-06 08:34:26 TP3] FlashInfer workspace initialized for rank 3, world_size 4
set flag_ptr[3] = lamport_comm_size: 47185920
Rank 0 workspace[0] 0x774d82000000
Rank 0 workspace[1] 0x772796000000
Rank 0 workspace[2] 0x772792000000
Rank 0 workspace[3] 0x77278e000000
Rank 0 workspace[4] 0x772790e00000
Rank 0 workspace[5] 0x772791000000
Rank 0 workspace[6] 0x772791200000
Rank 0 workspace[7] 0x772791400000
Rank 0 workspace[8] 0x772784000000
Rank 0 workspace[9] 0x77277a000000
Rank 0 workspace[10] 0x772770000000
Rank 0 workspace[11] 0x772766000000
Rank 0 workspace[12] 0x77537b264400
set flag_ptr[3] = lamport_comm_size: 47185920
Rank 3 workspace[0] 0x76aaf0000000
Rank 3 workspace[1] 0x76aa8c000000
Rank 3 workspace[2] 0x76aa88000000
Rank 3 workspace[3] 0x76d08c000000
Rank 3 workspace[4] 0x76aa8b000000
Rank 3 workspace[5] 0x76aa8b200000
Rank 3 workspace[6] 0x76aa8b400000
Rank 3 workspace[7] 0x76aa8ae00000
Rank 3 workspace[8] 0x76aa74000000
Rank 3 workspace[9] 0x76aa6a000000
Rank 3 workspace[10] 0x76aa60000000
Rank 3 workspace[11] 0x76aa7e000000
Rank 3 workspace[12] 0x76d685264400
set flag_ptr[3] = lamport_comm_size: 47185920
Rank 2 workspace[0] 0x7db43c000000
Rank 2 workspace[1] 0x7db3f4000000
Rank 2 workspace[2] 0x7dd9f2000000
Rank 2 workspace[3] 0x7db3f0000000
Rank 2 workspace[4] 0x7db3f3000000
Rank 2 workspace[5] 0x7db3f3200000
Rank 2 workspace[6] 0x7db3f2e00000
Rank 2 workspace[7] 0x7db3f3400000
Rank 2 workspace[8] 0x7db3dc000000
Rank 2 workspace[9] 0x7db3d2000000
Rank 2 workspace[10] 0x7db3e6000000
Rank 2 workspace[11] 0x7db3c8000000
Rank 2 workspace[12] 0x7ddfff264400
set flag_ptr[3] = lamport_comm_size: 47185920
Rank 1 workspace[0] 0x7eb6fc000000
Rank 1 workspace[1] 0x7edcbc000000
Rank 1 workspace[2] 0x7eb6bc000000
Rank 1 workspace[3] 0x7eb6b8000000
Rank 1 workspace[4] 0x7eb6bb000000
Rank 1 workspace[5] 0x7eb6bae00000
Rank 1 workspace[6] 0x7eb6bb200000
Rank 1 workspace[7] 0x7eb6bb400000
Rank 1 workspace[8] 0x7eb6a4000000
Rank 1 workspace[9] 0x7eb6ae000000
Rank 1 workspace[10] 0x7eb69a000000
Rank 1 workspace[11] 0x7eb690000000
Rank 1 workspace[12] 0x7ee2c9264400
Capturing batches (bs=200 avail_mem=9.39 GB): 4%|▎ | 1/28 [00:02<00:56, 2.10s/it] Capturing batches (bs=192 avail_mem=8.06 GB): 4%|▎ | 1/28 [00:02<00:56, 2.10s/it] Capturing batches (bs=192 avail_mem=8.06 GB): 7%|▋ | 2/28 [00:02<00:24, 1.05it/s] Capturing batches (bs=184 avail_mem=8.05 GB): 7%|▋ | 2/28 [00:02<00:24, 1.05it/s] Capturing batches (bs=176 avail_mem=8.04 GB): 7%|▋ | 2/28 [00:02<00:24, 1.05it/s] Capturing batches (bs=176 avail_mem=8.04 GB): 14%|█▍ | 4/28 [00:02<00:09, 2.44it/s] Capturing batches (bs=168 avail_mem=8.03 GB): 14%|█▍ | 4/28 [00:02<00:09, 2.44it/s] Capturing batches (bs=160 avail_mem=8.02 GB): 14%|█▍ | 4/28 [00:02<00:09, 2.44it/s] Capturing batches (bs=160 avail_mem=8.02 GB): 21%|██▏ | 6/28 [00:02<00:05, 3.92it/s] Capturing batches (bs=152 avail_mem=8.01 GB): 21%|██▏ | 6/28 [00:02<00:05, 3.92it/s] Capturing batches (bs=144 avail_mem=8.00 GB): 21%|██▏ | 6/28 [00:02<00:05, 3.92it/s] Capturing batches (bs=144 avail_mem=8.00 GB): 29%|██▊ | 8/28 [00:02<00:03, 5.35it/s] Capturing batches (bs=136 avail_mem=8.00 GB): 29%|██▊ | 8/28 [00:02<00:03, 5.35it/s] Capturing batches (bs=128 avail_mem=7.99 GB): 29%|██▊ | 8/28 [00:02<00:03, 5.35it/s] Capturing batches (bs=128 avail_mem=7.99 GB): 36%|███▌ | 10/28 [00:02<00:02, 6.63it/s] Capturing batches (bs=120 avail_mem=7.97 GB): 36%|███▌ | 10/28 [00:02<00:02, 6.63it/s] Capturing batches (bs=112 avail_mem=7.97 GB): 36%|███▌ | 10/28 [00:03<00:02, 6.63it/s] Capturing batches (bs=112 avail_mem=7.97 GB): 43%|████▎ | 12/28 [00:03<00:02, 7.77it/s] Capturing batches (bs=104 avail_mem=7.96 GB): 43%|████▎ | 12/28 [00:03<00:02, 7.77it/s] Capturing batches (bs=96 avail_mem=7.95 GB): 43%|████▎ | 12/28 [00:03<00:02, 7.77it/s] Capturing batches (bs=96 avail_mem=7.95 GB): 50%|█████ | 14/28 [00:03<00:01, 8.72it/s] Capturing batches (bs=88 avail_mem=7.94 GB): 50%|█████ | 14/28 [00:03<00:01, 8.72it/s] Capturing batches (bs=80 avail_mem=7.93 GB): 50%|█████ | 14/28 [00:03<00:01, 8.72it/s] Capturing batches (bs=80 avail_mem=7.93 GB): 57%|█████▋ | 16/28 [00:03<00:01, 9.51it/s] Capturing batches (bs=72 avail_mem=7.92 GB): 57%|█████▋ | 16/28 [00:03<00:01, 9.51it/s] Capturing batches (bs=64 avail_mem=7.91 GB): 57%|█████▋ | 16/28 [00:03<00:01, 9.51it/s] Capturing batches (bs=64 avail_mem=7.91 GB): 64%|██████▍ | 18/28 [00:03<00:00, 10.14it/s] Capturing batches (bs=56 avail_mem=7.90 GB): 64%|██████▍ | 18/28 [00:03<00:00, 10.14it/s] Capturing batches (bs=48 avail_mem=7.89 GB): 64%|██████▍ | 18/28 [00:03<00:00, 10.14it/s] Capturing batches (bs=48 avail_mem=7.89 GB): 71%|███████▏ | 20/28 [00:03<00:00, 10.70it/s] Capturing batches (bs=40 avail_mem=7.88 GB): 71%|███████▏ | 20/28 [00:03<00:00, 10.70it/s] Capturing batches (bs=32 avail_mem=7.88 GB): 71%|███████▏ | 20/28 [00:03<00:00, 10.70it/s] Capturing batches (bs=32 avail_mem=7.88 GB): 79%|███████▊ | 22/28 [00:03<00:00, 11.12it/s] Capturing batches (bs=24 avail_mem=7.87 GB): 79%|███████▊ | 22/28 [00:03<00:00, 11.12it/s] Capturing batches (bs=16 avail_mem=7.86 GB): 79%|███████▊ | 22/28 [00:04<00:00, 11.12it/s] Capturing batches (bs=16 avail_mem=7.86 GB): 86%|████████▌ | 24/28 [00:04<00:00, 11.49it/s] Capturing batches (bs=8 avail_mem=7.85 GB): 86%|████████▌ | 24/28 [00:04<00:00, 11.49it/s] Capturing batches (bs=4 avail_mem=7.84 GB): 86%|████████▌ | 24/28 [00:04<00:00, 11.49it/s] Capturing batches (bs=4 avail_mem=7.84 GB): 93%|█████████▎| 26/28 [00:04<00:00, 12.05it/s] Capturing batches (bs=2 avail_mem=7.83 GB): 93%|█████████▎| 26/28 [00:04<00:00, 12.05it/s] Capturing batches (bs=1 avail_mem=7.82 GB): 93%|█████████▎| 26/28 [00:04<00:00, 12.05it/s] Capturing batches (bs=1 avail_mem=7.82 GB): 100%|██████████| 28/28 [00:04<00:00, 11.45it/s] Capturing batches (bs=1 avail_mem=7.82 GB): 100%|██████████| 28/28 [00:04<00:00, 6.29it/s]
[2025-09-06 08:34:28 TP0] Registering 56 cuda graph addresses
[2025-09-06 08:34:28 TP2] Registering 56 cuda graph addresses
[2025-09-06 08:34:28 TP3] Registering 56 cuda graph addresses
[2025-09-06 08:34:28 TP1] Registering 56 cuda graph addresses
[2025-09-06 08:34:28 TP0] Capture cuda graph end. Time elapsed: 4.95 s. mem usage=1.73 GB. avail mem=7.81 GB.
[2025-09-06 08:34:29 TP0] max_total_num_tokens=8487040, chunked_prefill_size=16384, max_prefill_tokens=16384, max_running_requests=4096, context_len=131072, available_gpu_mem=7.81 GB
[2025-09-06 08:34:30] INFO: Started server process [39541]
[2025-09-06 08:34:30] INFO: Waiting for application startup.
[2025-09-06 08:34:30] INFO: Application startup complete.
[2025-09-06 08:34:30] INFO: Uvicorn running on http://127.0.0.1:8400 (Press CTRL+C to quit)
[2025-09-06 08:34:30] INFO: 127.0.0.1:51204 - "GET /health_generate HTTP/1.1" 503 Service Unavailable
[2025-09-06 08:34:31] INFO: 127.0.0.1:51212 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-09-06 08:34:31 TP0] Prefill batch. #new-seq: 1, #new-token: 64, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-09-06 08:34:32] INFO: 127.0.0.1:51228 - "POST /generate HTTP/1.1" 200 OK
[2025-09-06 08:34:32] The server is fired up and ready to roll!
[2025-09-06 08:34:40 TP0] Prefill batch. #new-seq: 1, #new-token: 64, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-09-06 08:34:41] INFO: 127.0.0.1:42932 - "GET /health_generate HTTP/1.1" 200 OK
command=python3 -m sglang.launch_server --model-path /home/yiliu7/models/openai/gpt-oss-120b --tp 4 --cuda-graph-max-bs 200 --mem-fraction-static 0.93 --device cuda --host 127.0.0.1 --port 8400
Evaluation start: model=/home/yiliu7/models/openai/gpt-oss-120b reasoning_effort=low expected_score=0.6
ChatCompletionSampler initialized with self.system_message=None self.temperature=0.1 self.max_tokens=4096 self.reasoning_effort='low'
0%| | 0/198 [00:00<?, ?it/s][2025-09-06 08:34:42 TP0] Prefill batch. #new-seq: 1, #new-token: 448, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-09-06 08:34:42 TP0] Prefill batch. #new-seq: 1, #new-token: 192, #cached-token: 0, token usage: 0.00, #running-req: 1, #queue-req: 0,
[2025-09-06 08:34:42 TP0] Prefill batch. #new-seq: 14, #new-token: 3648, #cached-token: 896, token usage: 0.00, #running-req: 2, #queue-req: 0,
[2025-09-06 08:34:42 TP0] Prefill batch. #new-seq: 10, #new-token: 3648, #cached-token: 640, token usage: 0.00, #running-req: 16, #queue-req: 0,
[2025-09-06 08:34:42 TP0] Prefill batch. #new-seq: 51, #new-token: 16192, #cached-token: 3264, token usage: 0.00, #running-req: 26, #queue-req: 47,
[2025-09-06 08:34:42 TP0] Prefill batch. #new-seq: 60, #new-token: 16256, #cached-token: 3840, token usage: 0.00, #running-req: 77, #queue-req: 5,
[2025-09-06 08:34:43 TP0] Prefill batch. #new-seq: 61, #new-token: 16320, #cached-token: 4032, token usage: 0.00, #running-req: 137, #queue-req: 0,
[2025-09-06 08:34:43 TP0] Decode batch. #running-req: 198, #token: 62848, token usage: 0.01, cuda graph: True, gen throughput (token/s): 424.67, #queue-req: 0,
[2025-09-06 08:34:43] INFO: 127.0.0.1:44440 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:43] INFO: 127.0.0.1:43052 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:43] INFO: 127.0.0.1:43842 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:43] INFO: 127.0.0.1:42990 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:43] INFO: 127.0.0.1:43898 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:43 TP0] Decode batch. #running-req: 193, #token: 68544, token usage: 0.01, cuda graph: True, gen throughput (token/s): 17176.49, #queue-req: 0,
[2025-09-06 08:34:44] INFO: 127.0.0.1:42958 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:44] INFO: 127.0.0.1:44156 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:44] INFO: 127.0.0.1:44030 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:44] INFO: 127.0.0.1:43372 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:44] INFO: 127.0.0.1:44584 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:44 TP0] Decode batch. #running-req: 188, #token: 73984, token usage: 0.01, cuda graph: True, gen throughput (token/s): 16913.88, #queue-req: 0,
[2025-09-06 08:34:44] INFO: 127.0.0.1:43384 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:44] INFO: 127.0.0.1:44636 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:44] INFO: 127.0.0.1:43320 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:44] INFO: 127.0.0.1:44038 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:44] INFO: 127.0.0.1:43412 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:44] INFO: 127.0.0.1:43790 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:44] INFO: 127.0.0.1:44138 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:44] INFO: 127.0.0.1:43464 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:44] INFO: 127.0.0.1:43970 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:44 TP0] Decode batch. #running-req: 179, #token: 77184, token usage: 0.01, cuda graph: True, gen throughput (token/s): 16437.03, #queue-req: 0,
[2025-09-06 08:34:44] INFO: 127.0.0.1:43600 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:44] INFO: 127.0.0.1:43582 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:44] INFO: 127.0.0.1:44292 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:45] INFO: 127.0.0.1:43146 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:45] INFO: 127.0.0.1:44200 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:45] INFO: 127.0.0.1:44112 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:45] INFO: 127.0.0.1:43828 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:45] INFO: 127.0.0.1:43226 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:45] INFO: 127.0.0.1:43888 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:45 TP0] Decode batch. #running-req: 170, #token: 77504, token usage: 0.01, cuda graph: True, gen throughput (token/s): 15888.40, #queue-req: 0,
[2025-09-06 08:34:45] INFO: 127.0.0.1:44662 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:45] INFO: 127.0.0.1:43190 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:45] INFO: 127.0.0.1:43364 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:45] INFO: 127.0.0.1:44004 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:45] INFO: 127.0.0.1:44652 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:45] INFO: 127.0.0.1:43162 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:45] INFO: 127.0.0.1:44628 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:45 TP0] Decode batch. #running-req: 163, #token: 80768, token usage: 0.01, cuda graph: True, gen throughput (token/s): 15192.08, #queue-req: 0,
[2025-09-06 08:34:45] INFO: 127.0.0.1:43072 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:45] INFO: 127.0.0.1:44694 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:45] INFO: 127.0.0.1:43744 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:45] INFO: 127.0.0.1:42976 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:45] INFO: 127.0.0.1:43754 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:46] INFO: 127.0.0.1:43168 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:46] INFO: 127.0.0.1:44364 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:46] INFO: 127.0.0.1:42942 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:46] INFO: 127.0.0.1:43432 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:46] INFO: 127.0.0.1:43900 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:46 TP0] Decode batch. #running-req: 153, #token: 81536, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13955.66, #queue-req: 0,
[2025-09-06 08:34:46] INFO: 127.0.0.1:44136 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:46] INFO: 127.0.0.1:44436 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:46] INFO: 127.0.0.1:44178 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:46] INFO: 127.0.0.1:43676 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:46] INFO: 127.0.0.1:43880 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:46] INFO: 127.0.0.1:43696 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:46] INFO: 127.0.0.1:44724 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:46] INFO: 127.0.0.1:44320 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:46] INFO: 127.0.0.1:43032 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:46] INFO: 127.0.0.1:42972 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:46 TP0] Decode batch. #running-req: 143, #token: 82176, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13467.78, #queue-req: 0,
[2025-09-06 08:34:46] INFO: 127.0.0.1:44460 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:46] INFO: 127.0.0.1:44508 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:46] INFO: 127.0.0.1:43292 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:46] INFO: 127.0.0.1:43422 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:46] INFO: 127.0.0.1:43522 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:46] INFO: 127.0.0.1:43228 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:46] INFO: 127.0.0.1:43044 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:46] INFO: 127.0.0.1:44254 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:47] INFO: 127.0.0.1:42966 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:47 TP0] Decode batch. #running-req: 135, #token: 83392, token usage: 0.01, cuda graph: True, gen throughput (token/s): 12737.00, #queue-req: 0,
[2025-09-06 08:34:47] INFO: 127.0.0.1:43606 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:47] INFO: 127.0.0.1:43660 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:47] INFO: 127.0.0.1:43622 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:47] INFO: 127.0.0.1:44504 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:47] INFO: 127.0.0.1:44022 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:47] INFO: 127.0.0.1:44732 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:47] INFO: 127.0.0.1:43124 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:47] INFO: 127.0.0.1:43996 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:47] INFO: 127.0.0.1:43088 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:47] INFO: 127.0.0.1:43566 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:47] INFO: 127.0.0.1:44756 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:47] INFO: 127.0.0.1:44150 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:47] INFO: 127.0.0.1:43806 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:47 TP0] Decode batch. #running-req: 121, #token: 80000, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13645.15, #queue-req: 0,
[2025-09-06 08:34:47] INFO: 127.0.0.1:44216 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:47] INFO: 127.0.0.1:43646 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:47] INFO: 127.0.0.1:43968 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:47] INFO: 127.0.0.1:43224 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:47] INFO: 127.0.0.1:44354 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:47] INFO: 127.0.0.1:43780 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:47] INFO: 127.0.0.1:44526 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:47] INFO: 127.0.0.1:44406 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:47 TP0] Decode batch. #running-req: 113, #token: 79360, token usage: 0.01, cuda graph: True, gen throughput (token/s): 16010.52, #queue-req: 0,
[2025-09-06 08:34:47] INFO: 127.0.0.1:43916 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:47] INFO: 127.0.0.1:43698 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:47] INFO: 127.0.0.1:43074 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:47] INFO: 127.0.0.1:43450 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:47] INFO: 127.0.0.1:43236 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:47] INFO: 127.0.0.1:44344 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:48] INFO: 127.0.0.1:43346 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:48] INFO: 127.0.0.1:43314 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:48 TP0] Decode batch. #running-req: 105, #token: 77248, token usage: 0.01, cuda graph: True, gen throughput (token/s): 15222.64, #queue-req: 0,
[2025-09-06 08:34:48] INFO: 127.0.0.1:44680 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:48] INFO: 127.0.0.1:43134 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:48] INFO: 127.0.0.1:43776 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:48] INFO: 127.0.0.1:44304 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:48] INFO: 127.0.0.1:44562 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:48] INFO: 127.0.0.1:43574 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:48] INFO: 127.0.0.1:43636 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:48] INFO: 127.0.0.1:43700 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:48 TP0] Decode batch. #running-req: 97, #token: 76032, token usage: 0.01, cuda graph: True, gen throughput (token/s): 14367.14, #queue-req: 0,
[2025-09-06 08:34:48] INFO: 127.0.0.1:43106 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:48] INFO: 127.0.0.1:43016 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:48] INFO: 127.0.0.1:43382 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:48] INFO: 127.0.0.1:43264 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:48] INFO: 127.0.0.1:44578 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:48] INFO: 127.0.0.1:44642 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:48] INFO: 127.0.0.1:44382 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:48] INFO: 127.0.0.1:44392 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:48] INFO: 127.0.0.1:44116 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:48] INFO: 127.0.0.1:43800 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:48] INFO: 127.0.0.1:43330 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:48 TP0] Decode batch. #running-req: 86, #token: 70464, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13311.84, #queue-req: 0,
[2025-09-06 08:34:48] INFO: 127.0.0.1:44368 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:48] INFO: 127.0.0.1:44774 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:48] INFO: 127.0.0.1:43458 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:48] INFO: 127.0.0.1:44492 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:48] INFO: 127.0.0.1:43396 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:48] INFO: 127.0.0.1:44510 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:48] INFO: 127.0.0.1:44054 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:48] INFO: 127.0.0.1:44230 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:48] INFO: 127.0.0.1:43714 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:48] INFO: 127.0.0.1:44634 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:48 TP0] Decode batch. #running-req: 76, #token: 64576, token usage: 0.01, cuda graph: True, gen throughput (token/s): 12469.15, #queue-req: 0,
[2025-09-06 08:34:48] INFO: 127.0.0.1:43418 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:48] INFO: 127.0.0.1:43280 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:48] INFO: 127.0.0.1:44446 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:48] INFO: 127.0.0.1:43618 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:48] INFO: 127.0.0.1:43956 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:49] INFO: 127.0.0.1:44020 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:49] INFO: 127.0.0.1:43244 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:49] INFO: 127.0.0.1:43208 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:49] INFO: 127.0.0.1:43584 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:49 TP0] Decode batch. #running-req: 67, #token: 59840, token usage: 0.01, cuda graph: True, gen throughput (token/s): 11205.85, #queue-req: 0,
[2025-09-06 08:34:49] INFO: 127.0.0.1:43536 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:49] INFO: 127.0.0.1:44132 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:49] INFO: 127.0.0.1:44612 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:49] INFO: 127.0.0.1:43814 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:49] INFO: 127.0.0.1:44334 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:49] INFO: 127.0.0.1:44420 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:49] INFO: 127.0.0.1:44528 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:49] INFO: 127.0.0.1:43144 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:49] INFO: 127.0.0.1:43854 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:49] INFO: 127.0.0.1:44308 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:49 TP0] Decode batch. #running-req: 57, #token: 53248, token usage: 0.01, cuda graph: True, gen throughput (token/s): 9851.14, #queue-req: 0,
[2025-09-06 08:34:49] INFO: 127.0.0.1:44268 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:49] INFO: 127.0.0.1:43728 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:49] INFO: 127.0.0.1:43438 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:49] INFO: 127.0.0.1:43098 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:49 TP0] Decode batch. #running-req: 54, #token: 51520, token usage: 0.01, cuda graph: True, gen throughput (token/s): 9352.81, #queue-req: 0,
[2025-09-06 08:34:49] INFO: 127.0.0.1:43478 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:49] INFO: 127.0.0.1:42994 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:49] INFO: 127.0.0.1:43118 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:49] INFO: 127.0.0.1:43180 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:49 TP0] Decode batch. #running-req: 49, #token: 48832, token usage: 0.01, cuda graph: True, gen throughput (token/s): 8635.47, #queue-req: 0,
[2025-09-06 08:34:49] INFO: 127.0.0.1:42962 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:49] INFO: 127.0.0.1:44170 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:50] INFO: 127.0.0.1:44078 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:50 TP0] Decode batch. #running-req: 46, #token: 48704, token usage: 0.01, cuda graph: True, gen throughput (token/s): 8221.97, #queue-req: 0,
[2025-09-06 08:34:50] INFO: 127.0.0.1:44194 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:50] INFO: 127.0.0.1:43928 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:50] INFO: 127.0.0.1:44046 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:50] INFO: 127.0.0.1:43912 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:50] INFO: 127.0.0.1:43506 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:50] INFO: 127.0.0.1:43720 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:50] INFO: 127.0.0.1:43528 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:50 TP0] Decode batch. #running-req: 39, #token: 43072, token usage: 0.01, cuda graph: True, gen throughput (token/s): 7532.11, #queue-req: 0,
[2025-09-06 08:34:50] INFO: 127.0.0.1:44128 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:50] INFO: 127.0.0.1:43158 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:50] INFO: 127.0.0.1:43006 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:50 TP0] Decode batch. #running-req: 36, #token: 41024, token usage: 0.00, cuda graph: True, gen throughput (token/s): 6518.27, #queue-req: 0,
[2025-09-06 08:34:50] INFO: 127.0.0.1:44558 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:50] INFO: 127.0.0.1:44278 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:50] INFO: 127.0.0.1:44668 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:50] INFO: 127.0.0.1:42974 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:50] INFO: 127.0.0.1:43302 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:50 TP0] Decode batch. #running-req: 33, #token: 36672, token usage: 0.00, cuda graph: True, gen throughput (token/s): 5974.62, #queue-req: 0,
[2025-09-06 08:34:50] INFO: 127.0.0.1:44246 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:50] INFO: 127.0.0.1:44538 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:50] INFO: 127.0.0.1:44064 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:50] INFO: 127.0.0.1:44542 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:50 TP0] Decode batch. #running-req: 27, #token: 33472, token usage: 0.00, cuda graph: True, gen throughput (token/s): 5340.73, #queue-req: 0,
[2025-09-06 08:34:50] INFO: 127.0.0.1:44758 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:51] INFO: 127.0.0.1:43680 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:51] INFO: 127.0.0.1:44708 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:51 TP0] Decode batch. #running-req: 24, #token: 30784, token usage: 0.00, cuda graph: True, gen throughput (token/s): 4791.66, #queue-req: 0,
[2025-09-06 08:34:51] INFO: 127.0.0.1:43350 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:51] INFO: 127.0.0.1:43762 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:51] INFO: 127.0.0.1:43876 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:51] INFO: 127.0.0.1:43852 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:51] INFO: 127.0.0.1:43884 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:51 TP0] Decode batch. #running-req: 19, #token: 25152, token usage: 0.00, cuda graph: True, gen throughput (token/s): 4001.31, #queue-req: 0,
[2025-09-06 08:34:51] INFO: 127.0.0.1:44080 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:51] INFO: 127.0.0.1:44412 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:51 TP0] Decode batch. #running-req: 17, #token: 23104, token usage: 0.00, cuda graph: True, gen throughput (token/s): 3588.53, #queue-req: 0,
[2025-09-06 08:34:51] INFO: 127.0.0.1:43866 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:51] INFO: 127.0.0.1:44096 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:51 TP0] Decode batch. #running-req: 15, #token: 20928, token usage: 0.00, cuda graph: True, gen throughput (token/s): 3195.11, #queue-req: 0,
[2025-09-06 08:34:51 TP0] Decode batch. #running-req: 15, #token: 21568, token usage: 0.00, cuda graph: True, gen throughput (token/s): 3141.41, #queue-req: 0,
[2025-09-06 08:34:52] INFO: 127.0.0.1:44362 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:52] INFO: 127.0.0.1:43062 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:52 TP0] Decode batch. #running-req: 13, #token: 19264, token usage: 0.00, cuda graph: True, gen throughput (token/s): 2949.04, #queue-req: 0,
[2025-09-06 08:34:52 TP0] Decode batch. #running-req: 13, #token: 19648, token usage: 0.00, cuda graph: True, gen throughput (token/s): 2726.24, #queue-req: 0,
[2025-09-06 08:34:52] INFO: 127.0.0.1:44616 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:52] INFO: 127.0.0.1:43554 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:52] INFO: 127.0.0.1:43980 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:52 TP0] Decode batch. #running-req: 10, #token: 15744, token usage: 0.00, cuda graph: True, gen throughput (token/s): 2499.47, #queue-req: 0,
[2025-09-06 08:34:52] INFO: 127.0.0.1:43206 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:52 TP0] Decode batch. #running-req: 9, #token: 14528, token usage: 0.00, cuda graph: True, gen throughput (token/s): 2028.36, #queue-req: 0,
[2025-09-06 08:34:52] INFO: 127.0.0.1:43944 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:52 TP0] Decode batch. #running-req: 8, #token: 13376, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1883.41, #queue-req: 0,
[2025-09-06 08:34:52] INFO: 127.0.0.1:43628 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:53] INFO: 127.0.0.1:43250 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:53 TP0] Decode batch. #running-req: 6, #token: 10240, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1682.40, #queue-req: 0,
[2025-09-06 08:34:53 TP0] Decode batch. #running-req: 6, #token: 10432, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1441.35, #queue-req: 0,
[2025-09-06 08:34:53 TP0] Decode batch. #running-req: 6, #token: 10688, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1445.57, #queue-req: 0,
[2025-09-06 08:34:53 TP0] Decode batch. #running-req: 6, #token: 10944, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1439.31, #queue-req: 0,
[2025-09-06 08:34:53 TP0] Decode batch. #running-req: 6, #token: 11072, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1438.83, #queue-req: 0,
[2025-09-06 08:34:53] INFO: 127.0.0.1:43494 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:53] INFO: 127.0.0.1:44322 - "POST /v1/chat/completions HTTP/1.1" 200 OK
1%| | 1/198 [00:11<37:26, 11.40s/it][2025-09-06 08:34:53 TP0] Decode batch. #running-req: 4, #token: 7552, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1331.77, #queue-req: 0,
[2025-09-06 08:34:54 TP0] Decode batch. #running-req: 4, #token: 7744, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1056.35, #queue-req: 0,
[2025-09-06 08:34:54 TP0] Decode batch. #running-req: 4, #token: 7872, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1059.88, #queue-req: 0,
[2025-09-06 08:34:54] INFO: 127.0.0.1:44476 - "POST /v1/chat/completions HTTP/1.1" 200 OK
9%|▊ | 17/198 [00:11<01:30, 2.01it/s][2025-09-06 08:34:54] INFO: 127.0.0.1:43542 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:34:54 TP0] Decode batch. #running-req: 2, #token: 3968, token usage: 0.00, cuda graph: True, gen throughput (token/s): 621.73, #queue-req: 0,
[2025-09-06 08:34:54 TP0] Decode batch. #running-req: 2, #token: 4032, token usage: 0.00, cuda graph: True, gen throughput (token/s): 580.81, #queue-req: 0,
[2025-09-06 08:34:54 TP0] Decode batch. #running-req: 2, #token: 4160, token usage: 0.00, cuda graph: True, gen throughput (token/s): 579.50, #queue-req: 0,
[2025-09-06 08:34:54 TP0] Decode batch. #running-req: 2, #token: 4160, token usage: 0.00, cuda graph: True, gen throughput (token/s): 579.01, #queue-req: 0,
[2025-09-06 08:34:54] INFO: 127.0.0.1:44598 - "POST /v1/chat/completions HTTP/1.1" 200 OK
15%|█▌ | 30/198 [00:12<00:44, 3.81it/s][2025-09-06 08:34:54 TP0] Decode batch. #running-req: 1, #token: 2176, token usage: 0.00, cuda graph: True, gen throughput (token/s): 535.38, #queue-req: 0,
[2025-09-06 08:34:55 TP0] Decode batch. #running-req: 1, #token: 2240, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.50, #queue-req: 0,
[2025-09-06 08:34:55 TP0] Decode batch. #running-req: 1, #token: 2240, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.40, #queue-req: 0,
[2025-09-06 08:34:55] INFO: 127.0.0.1:44742 - "POST /v1/chat/completions HTTP/1.1" 200 OK
23%|██▎ | 45/198 [00:12<00:23, 6.61it/s] 100%|██████████| 198/198 [00:12<00:00, 15.54it/s]
/usr/lib/python3.12/subprocess.py:1127: ResourceWarning: subprocess 39541 is still running
_warn("subprocess %s is still running" % self.pid,
ResourceWarning: Enable tracemalloc to get the object allocation traceback
.
----------------------------------------------------------------------
Ran 1 test in 177.044s
OK
Writing report to /tmp/gpqa__home_yiliu7_models_openai_gpt-oss-120b.html
{'chars': 1735.459595959596, 'chars:std': 1063.0222380884343, 'score:std': 0.4824488175389596, 'score': 0.6313131313131313}
Writing results to /tmp/gpqa__home_yiliu7_models_openai_gpt-oss-120b.json
Total latency: 12.798 s
Score: 0.631
Evaluation end: model=/home/yiliu7/models/openai/gpt-oss-120b reasoning_effort=low expected_score=0.6 metrics={'chars': 1735.459595959596, 'chars:std': 1063.0222380884343, 'score:std': 0.4824488175389596, 'score': 0.6313131313131313}
================================================================================
Run 4:
Auto-configed device: cuda
WARNING:sglang.srt.server_args:Detected SM100 and MXFP4 quantization format for GPT-OSS model, enabling FlashInfer MXFP4 MOE kernel.
WARNING:sglang.srt.server_args:TensorRT-LLM MHA only supports page_size of 16, 32 or 64, changing page_size from None to 64.
[2025-09-06 08:35:09] server_args=ServerArgs(model_path='/home/yiliu7/models/openai/gpt-oss-120b', tokenizer_path='/home/yiliu7/models/openai/gpt-oss-120b', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=8400, skip_server_warmup=False, warmups=None, nccl_port=None, dtype='bfloat16', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', mem_fraction_static=0.93, max_running_requests=None, max_queued_requests=9223372036854775807, max_total_tokens=None, chunked_prefill_size=16384, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, page_size=64, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=True, device='cuda', tp_size=4, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=615780304, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, gc_warning_threshold_secs=0.0, api_key=None, served_model_name='/home/yiliu7/models/openai/gpt-oss-120b', weight_version='default', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_backend='triton', attention_backend='trtllm_mha', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, ep_size=1, moe_a2a_backend='none', moe_runner_backend='flashinfer_mxfp4', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=True, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', disable_radix_cache=False, cuda_graph_max_bs=200, cuda_graph_bs=None, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, torch_compile_max_bs=32, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, enable_return_hidden_states=False, scheduler_recv_interval=1, numa_node=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, debug_tensor_dump_prefill_only=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, num_reserved_decode_tokens=512, custom_weight_loader=[], weight_loader_disable_mmap=False, enable_pdmux=False, sm_group_num=3, enable_ep_moe=False, enable_deepep_moe=False, enable_flashinfer_cutlass_moe=False, enable_flashinfer_trtllm_moe=False, enable_triton_kernel_moe=False, enable_flashinfer_mxfp4_moe=False)
All deep_gemm operations loaded successfully!
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-06 08:35:09] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:35:09] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-06 08:35:10] Using default HuggingFace chat template with detected content format: string
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-06 08:35:16 TP1] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:35:16 TP1] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-06 08:35:17 TP3] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:35:17 TP3] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-06 08:35:17 TP0] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:35:17 TP0] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-06 08:35:17 TP2] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:35:17 TP2] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-06 08:35:17 TP1] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:35:17 TP1] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-06 08:35:17 TP3] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:35:17 TP3] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-06 08:35:17 TP0] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:35:17 TP0] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-06 08:35:17 TP0] Init torch distributed begin.
[2025-09-06 08:35:17 TP2] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:35:17 TP2] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[2025-09-06 08:35:19 TP0] sglang is using nccl==2.27.3
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[2025-09-06 08:35:21 TP0] Init torch distributed ends. mem usage=1.46 GB
[2025-09-06 08:35:22 TP0] Load weight begin. avail mem=176.28 GB
All deep_gemm operations loaded successfully!
Loading safetensors checkpoint shards: 0% Completed | 0/15 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 15/15 [00:00<00:00, 1615.93it/s]
All deep_gemm operations loaded successfully!
All deep_gemm operations loaded successfully!
[2025-09-06 08:35:37 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.0.mlp.experts), it might take a while...
All deep_gemm operations loaded successfully!
[2025-09-06 08:35:40 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.1.mlp.experts), it might take a while...
[2025-09-06 08:35:43 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.2.mlp.experts), it might take a while...
[2025-09-06 08:35:46 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.3.mlp.experts), it might take a while...
[2025-09-06 08:35:50 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.4.mlp.experts), it might take a while...
[2025-09-06 08:35:53 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.5.mlp.experts), it might take a while...
[2025-09-06 08:35:56 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.6.mlp.experts), it might take a while...
[2025-09-06 08:35:59 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.7.mlp.experts), it might take a while...
[2025-09-06 08:36:02 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.8.mlp.experts), it might take a while...
[2025-09-06 08:36:05 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.9.mlp.experts), it might take a while...
[2025-09-06 08:36:08 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.10.mlp.experts), it might take a while...
[2025-09-06 08:36:11 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.11.mlp.experts), it might take a while...
[2025-09-06 08:36:15 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.12.mlp.experts), it might take a while...
[2025-09-06 08:36:18 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.13.mlp.experts), it might take a while...
[2025-09-06 08:36:21 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.14.mlp.experts), it might take a while...
[2025-09-06 08:36:24 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.15.mlp.experts), it might take a while...
[2025-09-06 08:36:27 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.16.mlp.experts), it might take a while...
[2025-09-06 08:36:33 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.17.mlp.experts), it might take a while...
[2025-09-06 08:36:37 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.18.mlp.experts), it might take a while...
[2025-09-06 08:36:41 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.19.mlp.experts), it might take a while...
[2025-09-06 08:36:44 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.20.mlp.experts), it might take a while...
[2025-09-06 08:36:48 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.21.mlp.experts), it might take a while...
[2025-09-06 08:36:51 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.22.mlp.experts), it might take a while...
[2025-09-06 08:36:54 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.23.mlp.experts), it might take a while...
[2025-09-06 08:36:57 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.24.mlp.experts), it might take a while...
[2025-09-06 08:37:00 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.25.mlp.experts), it might take a while...
[2025-09-06 08:37:03 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.26.mlp.experts), it might take a while...
[2025-09-06 08:37:07 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.27.mlp.experts), it might take a while...
[2025-09-06 08:37:10 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.28.mlp.experts), it might take a while...
[2025-09-06 08:37:13 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.29.mlp.experts), it might take a while...
[2025-09-06 08:37:16 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.30.mlp.experts), it might take a while...
[2025-09-06 08:37:19 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.31.mlp.experts), it might take a while...
[2025-09-06 08:37:22 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.32.mlp.experts), it might take a while...
[2025-09-06 08:37:25 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.33.mlp.experts), it might take a while...
[2025-09-06 08:37:28 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.34.mlp.experts), it might take a while...
[2025-09-06 08:37:32 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.35.mlp.experts), it might take a while...
[2025-09-06 08:37:35 TP0] Load weight end. type=GptOssForCausalLM, dtype=torch.bfloat16, avail mem=158.06 GB, mem usage=18.22 GB.
[2025-09-06 08:37:35 TP0] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB
[2025-09-06 08:37:35 TP0] Memory pool end. avail mem=10.23 GB
[2025-09-06 08:37:35 TP2] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB
[2025-09-06 08:37:35 TP1] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB
[2025-09-06 08:37:35 TP3] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB
[2025-09-06 08:37:35 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=9.54 GB
[2025-09-06 08:37:35 TP0] Capture cuda graph bs [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200]
0%| | 0/28 [00:00<?, ?it/s] Capturing batches (bs=200 avail_mem=9.39 GB): 0%| | 0/28 [00:00<?, ?it/s]rank 2 allocated ipc_handles: [['0x739ed0000000', '0x739e68000000', '0x73c466000000', '0x739e64000000'], ['0x739e67000000', '0x739e67200000', '0x739e66e00000', '0x739e67400000'], ['0x739e50000000', '0x739e46000000', '0x739e5a000000', '0x739e3c000000']]
[2025-09-06 08:37:37.715] [info] lamportInitialize start: buffer: 0x739e5a000000, size: 71303168
rank 1 allocated ipc_handles: [['0x7a06d4000000', '0x7a2c70000000', '0x7a0670000000', '0x7a066c000000'], ['0x7a066f000000', '0x7a066ee00000', '0x7a066f200000', '0x7a066f400000'], ['0x7a0658000000', '0x7a0662000000', '0x7a064e000000', '0x7a0644000000']]
[2025-09-06 08:37:37.765] [info] lamportInitialize start: buffer: 0x7a0662000000, size: 71303168
rank 3 allocated ipc_handles: [['0x771bbc000000', '0x771b82000000', '0x771b7e000000', '0x774172000000'], ['0x771b81000000', '0x771b81200000', '0x771b81400000', '0x771b80e00000'], ['0x771b6a000000', '0x771b60000000', '0x771b56000000', '0x771b74000000']]
[2025-09-06 08:37:37.817] [info] lamportInitialize start: buffer: 0x771b74000000, size: 71303168
rank 0 allocated ipc_handles: [['0x734ac8000000', '0x7324d6000000', '0x7324d2000000', '0x7324ce000000'], ['0x7324d0e00000', '0x7324d1000000', '0x7324d1200000', '0x7324d1400000'], ['0x7324c4000000', '0x7324ba000000', '0x7324b0000000', '0x7324a6000000']]
[2025-09-06 08:37:37.865] [info] lamportInitialize start: buffer: 0x7324c4000000, size: 71303168
[2025-09-06 08:37:37 TP0] FlashInfer workspace initialized for rank 0, world_size 4
[2025-09-06 08:37:37 TP1] FlashInfer workspace initialized for rank 1, world_size 4
[2025-09-06 08:37:37 TP3] FlashInfer workspace initialized for rank 3, world_size 4
[2025-09-06 08:37:37 TP2] FlashInfer workspace initialized for rank 2, world_size 4
set flag_ptr[3] = lamport_comm_size: 47185920
Rank 0 workspace[0] 0x734ac8000000
Rank 0 workspace[1] 0x7324d6000000
Rank 0 workspace[2] 0x7324d2000000
Rank 0 workspace[3] 0x7324ce000000
Rank 0 workspace[4] 0x7324d0e00000
Rank 0 workspace[5] 0x7324d1000000
Rank 0 workspace[6] 0x7324d1200000
Rank 0 workspace[7] 0x7324d1400000
Rank 0 workspace[8] 0x7324c4000000
Rank 0 workspace[9] 0x7324ba000000
Rank 0 workspace[10] 0x7324b0000000
Rank 0 workspace[11] 0x7324a6000000
Rank 0 workspace[12] 0x7350c3264400
set flag_ptr[3] = lamport_comm_size: 47185920
Rank 1 workspace[0] 0x7a06d4000000
Rank 1 workspace[1] 0x7a2c70000000
Rank 1 workspace[2] 0x7a0670000000
Rank 1 workspace[3] 0x7a066c000000
Rank 1 workspace[4] 0x7a066f000000
Rank 1 workspace[5] 0x7a066ee00000
Rank 1 workspace[6] 0x7a066f200000
Rank 1 workspace[7] 0x7a066f400000
Rank 1 workspace[8] 0x7a0658000000
Rank 1 workspace[9] 0x7a0662000000
Rank 1 workspace[10] 0x7a064e000000
Rank 1 workspace[11] 0x7a0644000000
Rank 1 workspace[12] 0x7a327b264400
set flag_ptr[3] = lamport_comm_size: 47185920
Rank 3 workspace[0] 0x771bbc000000
Rank 3 workspace[1] 0x771b82000000
Rank 3 workspace[2] 0x771b7e000000
Rank 3 workspace[3] 0x774172000000
Rank 3 workspace[4] 0x771b81000000
Rank 3 workspace[5] 0x771b81200000
Rank 3 workspace[6] 0x771b81400000
Rank 3 workspace[7] 0x771b80e00000
Rank 3 workspace[8] 0x771b6a000000
Rank 3 workspace[9] 0x771b60000000
Rank 3 workspace[10] 0x771b56000000
Rank 3 workspace[11] 0x771b74000000
Rank 3 workspace[12] 0x77476d264400
set flag_ptr[3] = lamport_comm_size: 47185920
Rank 2 workspace[0] 0x739ed0000000
Rank 2 workspace[1] 0x739e68000000
Rank 2 workspace[2] 0x73c466000000
Rank 2 workspace[3] 0x739e64000000
Rank 2 workspace[4] 0x739e67000000
Rank 2 workspace[5] 0x739e67200000
Rank 2 workspace[6] 0x739e66e00000
Rank 2 workspace[7] 0x739e67400000
Rank 2 workspace[8] 0x739e50000000
Rank 2 workspace[9] 0x739e46000000
Rank 2 workspace[10] 0x739e5a000000
Rank 2 workspace[11] 0x739e3c000000
Rank 2 workspace[12] 0x73ca6f264400
Capturing batches (bs=200 avail_mem=9.39 GB): 4%|▎ | 1/28 [00:02<01:00, 2.26s/it] Capturing batches (bs=192 avail_mem=8.06 GB): 4%|▎ | 1/28 [00:02<01:00, 2.26s/it] Capturing batches (bs=192 avail_mem=8.06 GB): 7%|▋ | 2/28 [00:02<00:26, 1.02s/it] Capturing batches (bs=184 avail_mem=8.05 GB): 7%|▋ | 2/28 [00:02<00:26, 1.02s/it] Capturing batches (bs=176 avail_mem=8.04 GB): 7%|▋ | 2/28 [00:02<00:26, 1.02s/it] Capturing batches (bs=176 avail_mem=8.04 GB): 14%|█▍ | 4/28 [00:02<00:10, 2.30it/s] Capturing batches (bs=168 avail_mem=8.03 GB): 14%|█▍ | 4/28 [00:02<00:10, 2.30it/s] Capturing batches (bs=160 avail_mem=8.02 GB): 14%|█▍ | 4/28 [00:02<00:10, 2.30it/s] Capturing batches (bs=160 avail_mem=8.02 GB): 21%|██▏ | 6/28 [00:02<00:05, 3.71it/s] Capturing batches (bs=152 avail_mem=8.01 GB): 21%|██▏ | 6/28 [00:02<00:05, 3.71it/s] Capturing batches (bs=144 avail_mem=8.00 GB): 21%|██▏ | 6/28 [00:02<00:05, 3.71it/s] Capturing batches (bs=144 avail_mem=8.00 GB): 29%|██▊ | 8/28 [00:02<00:03, 5.12it/s] Capturing batches (bs=136 avail_mem=8.00 GB): 29%|██▊ | 8/28 [00:02<00:03, 5.12it/s] Capturing batches (bs=128 avail_mem=7.99 GB): 29%|██▊ | 8/28 [00:03<00:03, 5.12it/s] Capturing batches (bs=128 avail_mem=7.99 GB): 36%|███▌ | 10/28 [00:03<00:02, 6.40it/s] Capturing batches (bs=120 avail_mem=7.97 GB): 36%|███▌ | 10/28 [00:03<00:02, 6.40it/s] Capturing batches (bs=112 avail_mem=7.97 GB): 36%|███▌ | 10/28 [00:03<00:02, 6.40it/s] Capturing batches (bs=112 avail_mem=7.97 GB): 43%|████▎ | 12/28 [00:03<00:02, 7.55it/s] Capturing batches (bs=104 avail_mem=7.96 GB): 43%|████▎ | 12/28 [00:03<00:02, 7.55it/s] Capturing batches (bs=96 avail_mem=7.95 GB): 43%|████▎ | 12/28 [00:03<00:02, 7.55it/s] Capturing batches (bs=96 avail_mem=7.95 GB): 50%|█████ | 14/28 [00:03<00:01, 8.53it/s] Capturing batches (bs=88 avail_mem=7.94 GB): 50%|█████ | 14/28 [00:03<00:01, 8.53it/s] Capturing batches (bs=80 avail_mem=7.93 GB): 50%|█████ | 14/28 [00:03<00:01, 8.53it/s] Capturing batches (bs=80 avail_mem=7.93 GB): 57%|█████▋ | 16/28 [00:03<00:01, 9.29it/s] Capturing batches (bs=72 avail_mem=7.92 GB): 57%|█████▋ | 16/28 [00:03<00:01, 9.29it/s] Capturing batches (bs=64 avail_mem=7.91 GB): 57%|█████▋ | 16/28 [00:03<00:01, 9.29it/s] Capturing batches (bs=64 avail_mem=7.91 GB): 64%|██████▍ | 18/28 [00:03<00:01, 9.95it/s] Capturing batches (bs=56 avail_mem=7.90 GB): 64%|██████▍ | 18/28 [00:03<00:01, 9.95it/s] Capturing batches (bs=48 avail_mem=7.89 GB): 64%|██████▍ | 18/28 [00:03<00:01, 9.95it/s] Capturing batches (bs=48 avail_mem=7.89 GB): 71%|███████▏ | 20/28 [00:03<00:00, 10.52it/s] Capturing batches (bs=40 avail_mem=7.88 GB): 71%|███████▏ | 20/28 [00:03<00:00, 10.52it/s] Capturing batches (bs=32 avail_mem=7.88 GB): 71%|███████▏ | 20/28 [00:04<00:00, 10.52it/s] Capturing batches (bs=32 avail_mem=7.88 GB): 79%|███████▊ | 22/28 [00:04<00:00, 10.93it/s] Capturing batches (bs=24 avail_mem=7.87 GB): 79%|███████▊ | 22/28 [00:04<00:00, 10.93it/s] Capturing batches (bs=16 avail_mem=7.86 GB): 79%|███████▊ | 22/28 [00:04<00:00, 10.93it/s] Capturing batches (bs=16 avail_mem=7.86 GB): 86%|████████▌ | 24/28 [00:04<00:00, 11.34it/s] Capturing batches (bs=8 avail_mem=7.85 GB): 86%|████████▌ | 24/28 [00:04<00:00, 11.34it/s] Capturing batches (bs=4 avail_mem=7.84 GB): 86%|████████▌ | 24/28 [00:04<00:00, 11.34it/s] Capturing batches (bs=4 avail_mem=7.84 GB): 93%|█████████▎| 26/28 [00:04<00:00, 11.94it/s] Capturing batches (bs=2 avail_mem=7.83 GB): 93%|█████████▎| 26/28 [00:04<00:00, 11.94it/s] Capturing batches (bs=1 avail_mem=7.82 GB): 93%|█████████▎| 26/28 [00:04<00:00, 11.94it/s][2025-09-06 08:37:40 TP1] Registering 56 cuda graph addresses
Capturing batches (bs=1 avail_mem=7.82 GB): 100%|██████████| 28/28 [00:04<00:00, 11.38it/s] Capturing batches (bs=1 avail_mem=7.82 GB): 100%|██████████| 28/28 [00:04<00:00, 6.05it/s]
[2025-09-06 08:37:40 TP0] Registering 56 cuda graph addresses
[2025-09-06 08:37:40 TP3] Registering 56 cuda graph addresses
[2025-09-06 08:37:40 TP2] Registering 56 cuda graph addresses
[2025-09-06 08:37:40 TP0] Capture cuda graph end. Time elapsed: 5.07 s. mem usage=1.73 GB. avail mem=7.81 GB.
[2025-09-06 08:37:41 TP0] max_total_num_tokens=8487040, chunked_prefill_size=16384, max_prefill_tokens=16384, max_running_requests=4096, context_len=131072, available_gpu_mem=7.81 GB
[2025-09-06 08:37:41] INFO: Started server process [42006]
[2025-09-06 08:37:41] INFO: Waiting for application startup.
[2025-09-06 08:37:42] INFO: Application startup complete.
[2025-09-06 08:37:42] INFO: Uvicorn running on http://127.0.0.1:8400 (Press CTRL+C to quit)
[2025-09-06 08:37:43] INFO: 127.0.0.1:53712 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-09-06 08:37:43 TP0] Prefill batch. #new-seq: 1, #new-token: 64, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-09-06 08:37:44] INFO: 127.0.0.1:53722 - "POST /generate HTTP/1.1" 200 OK
[2025-09-06 08:37:44] The server is fired up and ready to roll!
[2025-09-06 08:37:44 TP0] Prefill batch. #new-seq: 1, #new-token: 64, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-09-06 08:37:45] INFO: 127.0.0.1:53738 - "GET /health_generate HTTP/1.1" 200 OK
command=python3 -m sglang.launch_server --model-path /home/yiliu7/models/openai/gpt-oss-120b --tp 4 --cuda-graph-max-bs 200 --mem-fraction-static 0.93 --device cuda --host 127.0.0.1 --port 8400
Evaluation start: model=/home/yiliu7/models/openai/gpt-oss-120b reasoning_effort=low expected_score=0.6
ChatCompletionSampler initialized with self.system_message=None self.temperature=0.1 self.max_tokens=4096 self.reasoning_effort='low'
0%| | 0/198 [00:00<?, ?it/s][2025-09-06 08:37:46 TP0] Prefill batch. #new-seq: 1, #new-token: 256, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-09-06 08:37:46 TP0] Prefill batch. #new-seq: 1, #new-token: 640, #cached-token: 0, token usage: 0.00, #running-req: 1, #queue-req: 0,
[2025-09-06 08:37:46 TP0] Prefill batch. #new-seq: 15, #new-token: 4800, #cached-token: 960, token usage: 0.00, #running-req: 2, #queue-req: 0,
[2025-09-06 08:37:46 TP0] Prefill batch. #new-seq: 55, #new-token: 16320, #cached-token: 3520, token usage: 0.00, #running-req: 17, #queue-req: 47,
[2025-09-06 08:37:46 TP0] Prefill batch. #new-seq: 57, #new-token: 15296, #cached-token: 3648, token usage: 0.00, #running-req: 72, #queue-req: 0,
[2025-09-06 08:37:46 TP0] Prefill batch. #new-seq: 58, #new-token: 16320, #cached-token: 3776, token usage: 0.00, #running-req: 129, #queue-req: 11,
[2025-09-06 08:37:46 TP0] Prefill batch. #new-seq: 11, #new-token: 3136, #cached-token: 704, token usage: 0.01, #running-req: 187, #queue-req: 0,
[2025-09-06 08:37:47 TP0] Decode batch. #running-req: 198, #token: 62976, token usage: 0.01, cuda graph: True, gen throughput (token/s): 986.46, #queue-req: 0,
[2025-09-06 08:37:47] INFO: 127.0.0.1:55372 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:47] INFO: 127.0.0.1:54924 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:47] INFO: 127.0.0.1:55340 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:47] INFO: 127.0.0.1:54334 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:47 TP0] Decode batch. #running-req: 194, #token: 68928, token usage: 0.01, cuda graph: True, gen throughput (token/s): 17158.29, #queue-req: 0,
[2025-09-06 08:37:47] INFO: 127.0.0.1:54134 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:47] INFO: 127.0.0.1:55242 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:47] INFO: 127.0.0.1:54550 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:48] INFO: 127.0.0.1:55312 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:48] INFO: 127.0.0.1:54414 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:48] INFO: 127.0.0.1:53950 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:48] INFO: 127.0.0.1:54428 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:48] INFO: 127.0.0.1:55094 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:48] INFO: 127.0.0.1:53964 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:48] INFO: 127.0.0.1:54790 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:48 TP0] Decode batch. #running-req: 186, #token: 70464, token usage: 0.01, cuda graph: True, gen throughput (token/s): 17032.28, #queue-req: 0,
[2025-09-06 08:37:48] INFO: 127.0.0.1:53892 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:48] INFO: 127.0.0.1:54046 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:48] INFO: 127.0.0.1:54498 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:48] INFO: 127.0.0.1:54714 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:48] INFO: 127.0.0.1:53976 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:48] INFO: 127.0.0.1:54318 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:48] INFO: 127.0.0.1:54538 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:48] INFO: 127.0.0.1:53818 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:48] INFO: 127.0.0.1:54622 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:48] INFO: 127.0.0.1:54342 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:48] INFO: 127.0.0.1:53744 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:48] INFO: 127.0.0.1:54142 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:48 TP0] Decode batch. #running-req: 173, #token: 71424, token usage: 0.01, cuda graph: True, gen throughput (token/s): 16273.08, #queue-req: 0,
[2025-09-06 08:37:48] INFO: 127.0.0.1:55112 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:48] INFO: 127.0.0.1:53802 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:49] INFO: 127.0.0.1:54982 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:49] INFO: 127.0.0.1:53908 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:49] INFO: 127.0.0.1:55380 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:49 TP0] Decode batch. #running-req: 168, #token: 75840, token usage: 0.01, cuda graph: True, gen throughput (token/s): 15646.05, #queue-req: 0,
[2025-09-06 08:37:49] INFO: 127.0.0.1:54644 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:49] INFO: 127.0.0.1:54720 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:49] INFO: 127.0.0.1:53980 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:49] INFO: 127.0.0.1:54704 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:49] INFO: 127.0.0.1:55180 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:49] INFO: 127.0.0.1:54150 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:49] INFO: 127.0.0.1:53746 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:49] INFO: 127.0.0.1:55290 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:49] INFO: 127.0.0.1:54904 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:49] INFO: 127.0.0.1:54278 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:49] INFO: 127.0.0.1:55140 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:49 TP0] Decode batch. #running-req: 156, #token: 77824, token usage: 0.01, cuda graph: True, gen throughput (token/s): 14502.97, #queue-req: 0,
[2025-09-06 08:37:49] INFO: 127.0.0.1:54816 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:49] INFO: 127.0.0.1:53756 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:49] INFO: 127.0.0.1:54526 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:49] INFO: 127.0.0.1:55120 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:49] INFO: 127.0.0.1:54338 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:49] INFO: 127.0.0.1:55062 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:49] INFO: 127.0.0.1:55178 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:49] INFO: 127.0.0.1:55396 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:49] INFO: 127.0.0.1:54910 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:49] INFO: 127.0.0.1:53844 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:50 TP0] Decode batch. #running-req: 146, #token: 78272, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13669.30, #queue-req: 0,
[2025-09-06 08:37:50] INFO: 127.0.0.1:54436 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:50] INFO: 127.0.0.1:53766 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:50] INFO: 127.0.0.1:54386 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:50] INFO: 127.0.0.1:55334 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:50 TP0] Decode batch. #running-req: 142, #token: 82112, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13344.68, #queue-req: 0,
[2025-09-06 08:37:50] INFO: 127.0.0.1:54844 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:50] INFO: 127.0.0.1:54860 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:50] INFO: 127.0.0.1:54504 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:50] INFO: 127.0.0.1:54086 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:50] INFO: 127.0.0.1:53928 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:50] INFO: 127.0.0.1:54232 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:50] INFO: 127.0.0.1:55348 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:50] INFO: 127.0.0.1:54114 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:50] INFO: 127.0.0.1:55412 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:50] INFO: 127.0.0.1:54272 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:50] INFO: 127.0.0.1:55216 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:50] INFO: 127.0.0.1:55214 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:50 TP0] Decode batch. #running-req: 131, #token: 81152, token usage: 0.01, cuda graph: True, gen throughput (token/s): 12770.37, #queue-req: 0,
[2025-09-06 08:37:50] INFO: 127.0.0.1:54196 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:50] INFO: 127.0.0.1:54018 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:50] INFO: 127.0.0.1:55052 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:51] INFO: 127.0.0.1:54886 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:51] INFO: 127.0.0.1:54486 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:51] INFO: 127.0.0.1:54698 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:51] INFO: 127.0.0.1:54180 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:51] INFO: 127.0.0.1:54152 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:51] INFO: 127.0.0.1:54658 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:51] INFO: 127.0.0.1:54536 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:51 TP0] Decode batch. #running-req: 120, #token: 79616, token usage: 0.01, cuda graph: True, gen throughput (token/s): 15921.90, #queue-req: 0,
[2025-09-06 08:37:51] INFO: 127.0.0.1:53876 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:51] INFO: 127.0.0.1:55226 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:51] INFO: 127.0.0.1:54302 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:51] INFO: 127.0.0.1:54394 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:51] INFO: 127.0.0.1:55014 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:51] INFO: 127.0.0.1:54246 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:51] INFO: 127.0.0.1:54986 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:51] INFO: 127.0.0.1:54742 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:51] INFO: 127.0.0.1:53934 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:51] INFO: 127.0.0.1:55000 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:51] INFO: 127.0.0.1:54838 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:51 TP0] Decode batch. #running-req: 109, #token: 76416, token usage: 0.01, cuda graph: True, gen throughput (token/s): 15876.24, #queue-req: 0,
[2025-09-06 08:37:51] INFO: 127.0.0.1:54744 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:51] INFO: 127.0.0.1:55404 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:51] INFO: 127.0.0.1:54168 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:51] INFO: 127.0.0.1:55082 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:51] INFO: 127.0.0.1:53800 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:51] INFO: 127.0.0.1:54006 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:51] INFO: 127.0.0.1:54028 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:51] INFO: 127.0.0.1:53862 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:51] INFO: 127.0.0.1:54234 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:51 TP0] Decode batch. #running-req: 100, #token: 74048, token usage: 0.01, cuda graph: True, gen throughput (token/s): 14915.39, #queue-req: 0,
[2025-09-06 08:37:51] INFO: 127.0.0.1:53994 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:51] INFO: 127.0.0.1:54294 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:51] INFO: 127.0.0.1:53874 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:51] INFO: 127.0.0.1:53852 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:51] INFO: 127.0.0.1:54870 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:52] INFO: 127.0.0.1:54900 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:52] INFO: 127.0.0.1:54854 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:52] INFO: 127.0.0.1:54124 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:52] INFO: 127.0.0.1:54936 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:52] INFO: 127.0.0.1:54952 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:52] INFO: 127.0.0.1:54684 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:52 TP0] Decode batch. #running-req: 89, #token: 69824, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13651.23, #queue-req: 0,
[2025-09-06 08:37:52] INFO: 127.0.0.1:53886 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:52] INFO: 127.0.0.1:54516 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:52] INFO: 127.0.0.1:54798 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:52] INFO: 127.0.0.1:54962 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:52] INFO: 127.0.0.1:55384 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:52] INFO: 127.0.0.1:54874 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:52 TP0] Decode batch. #running-req: 84, #token: 67392, token usage: 0.01, cuda graph: True, gen throughput (token/s): 12980.65, #queue-req: 0,
[2025-09-06 08:37:52] INFO: 127.0.0.1:54378 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:52] INFO: 127.0.0.1:53770 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:52] INFO: 127.0.0.1:54834 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:52] INFO: 127.0.0.1:55342 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:52] INFO: 127.0.0.1:54970 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:52] INFO: 127.0.0.1:54452 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:52] INFO: 127.0.0.1:53954 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:52] INFO: 127.0.0.1:54216 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:52] INFO: 127.0.0.1:54408 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:52] INFO: 127.0.0.1:54098 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:52] INFO: 127.0.0.1:54746 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:52] INFO: 127.0.0.1:55128 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:52] INFO: 127.0.0.1:54912 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:52] INFO: 127.0.0.1:53742 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:52] INFO: 127.0.0.1:54670 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:52 TP0] Decode batch. #running-req: 68, #token: 58624, token usage: 0.01, cuda graph: True, gen throughput (token/s): 11580.63, #queue-req: 0,
[2025-09-06 08:37:52] INFO: 127.0.0.1:53916 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:52] INFO: 127.0.0.1:55352 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:52 TP0] Decode batch. #running-req: 66, #token: 59904, token usage: 0.01, cuda graph: True, gen throughput (token/s): 10703.52, #queue-req: 0,
[2025-09-06 08:37:53] INFO: 127.0.0.1:54258 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:53] INFO: 127.0.0.1:54572 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:53] INFO: 127.0.0.1:54938 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:53 TP0] Decode batch. #running-req: 63, #token: 59328, token usage: 0.01, cuda graph: True, gen throughput (token/s): 10438.09, #queue-req: 0,
[2025-09-06 08:37:53] INFO: 127.0.0.1:54208 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:53] INFO: 127.0.0.1:54990 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:53] INFO: 127.0.0.1:54350 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:53] INFO: 127.0.0.1:54160 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:53] INFO: 127.0.0.1:54806 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:53] INFO: 127.0.0.1:54566 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:53] INFO: 127.0.0.1:54496 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:53 TP0] Decode batch. #running-req: 56, #token: 54592, token usage: 0.01, cuda graph: True, gen throughput (token/s): 9934.25, #queue-req: 0,
[2025-09-06 08:37:53] INFO: 127.0.0.1:55118 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:53] INFO: 127.0.0.1:53784 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:53 TP0] Decode batch. #running-req: 54, #token: 54080, token usage: 0.01, cuda graph: True, gen throughput (token/s): 9371.48, #queue-req: 0,
[2025-09-06 08:37:53] INFO: 127.0.0.1:55360 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:53] INFO: 127.0.0.1:55320 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:53] INFO: 127.0.0.1:54128 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:53] INFO: 127.0.0.1:54774 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:53] INFO: 127.0.0.1:54634 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:53] INFO: 127.0.0.1:55172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:53 TP0] Decode batch. #running-req: 48, #token: 50880, token usage: 0.01, cuda graph: True, gen throughput (token/s): 8626.22, #queue-req: 0,
[2025-09-06 08:37:53] INFO: 127.0.0.1:55156 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:53] INFO: 127.0.0.1:54324 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:53] INFO: 127.0.0.1:54256 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:54] INFO: 127.0.0.1:55198 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:54 TP0] Decode batch. #running-req: 44, #token: 48256, token usage: 0.01, cuda graph: True, gen throughput (token/s): 8115.64, #queue-req: 0,
[2025-09-06 08:37:54] INFO: 127.0.0.1:54034 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:54] INFO: 127.0.0.1:55068 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:54] INFO: 127.0.0.1:55100 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:54 TP0] Decode batch. #running-req: 41, #token: 46528, token usage: 0.01, cuda graph: True, gen throughput (token/s): 7577.10, #queue-req: 0,
[2025-09-06 08:37:54] INFO: 127.0.0.1:54458 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:54] INFO: 127.0.0.1:55296 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:54] INFO: 127.0.0.1:55406 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:54] INFO: 127.0.0.1:54596 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:54] INFO: 127.0.0.1:54482 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:54] INFO: 127.0.0.1:54628 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:54] INFO: 127.0.0.1:55274 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:54 TP0] Decode batch. #running-req: 34, #token: 39104, token usage: 0.00, cuda graph: True, gen throughput (token/s): 6738.34, #queue-req: 0,
[2025-09-06 08:37:54] INFO: 127.0.0.1:54074 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:54] INFO: 127.0.0.1:54444 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:54] INFO: 127.0.0.1:54092 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:54] INFO: 127.0.0.1:53872 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:54] INFO: 127.0.0.1:54818 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:54] INFO: 127.0.0.1:54768 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:54 TP0] Decode batch. #running-req: 28, #token: 34880, token usage: 0.00, cuda graph: True, gen throughput (token/s): 5616.97, #queue-req: 0,
[2025-09-06 08:37:54] INFO: 127.0.0.1:53918 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:54] INFO: 127.0.0.1:54362 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:54] INFO: 127.0.0.1:54728 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:54] INFO: 127.0.0.1:53754 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:54] INFO: 127.0.0.1:55280 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:54 TP0] Decode batch. #running-req: 23, #token: 29184, token usage: 0.00, cuda graph: True, gen throughput (token/s): 4789.55, #queue-req: 0,
[2025-09-06 08:37:54] INFO: 127.0.0.1:54756 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:55] INFO: 127.0.0.1:54268 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:55] INFO: 127.0.0.1:55182 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:55] INFO: 127.0.0.1:54356 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:55 TP0] Decode batch. #running-req: 19, #token: 24832, token usage: 0.00, cuda graph: True, gen throughput (token/s): 4037.65, #queue-req: 0,
[2025-09-06 08:37:55] INFO: 127.0.0.1:55030 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:55 TP0] Decode batch. #running-req: 18, #token: 21568, token usage: 0.00, cuda graph: True, gen throughput (token/s): 3674.82, #queue-req: 0,
[2025-09-06 08:37:55] INFO: 127.0.0.1:53846 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:55] INFO: 127.0.0.1:55024 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:55] INFO: 127.0.0.1:53834 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:55] INFO: 127.0.0.1:54110 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:55] INFO: 127.0.0.1:53778 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:55] INFO: 127.0.0.1:54568 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:55 TP0] Decode batch. #running-req: 12, #token: 16576, token usage: 0.00, cuda graph: True, gen throughput (token/s): 2813.33, #queue-req: 0,
[2025-09-06 08:37:55] INFO: 127.0.0.1:54280 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:55 TP0] Decode batch. #running-req: 11, #token: 15744, token usage: 0.00, cuda graph: True, gen throughput (token/s): 2427.20, #queue-req: 0,
[2025-09-06 08:37:55] INFO: 127.0.0.1:55040 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:55] INFO: 127.0.0.1:54148 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:55 TP0] Decode batch. #running-req: 9, #token: 13248, token usage: 0.00, cuda graph: True, gen throughput (token/s): 2133.90, #queue-req: 0,
[2025-09-06 08:37:56 TP0] Decode batch. #running-req: 9, #token: 13440, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1900.71, #queue-req: 0,
[2025-09-06 08:37:56] INFO: 127.0.0.1:54470 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:56] INFO: 127.0.0.1:54588 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:56] INFO: 127.0.0.1:54060 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:56] INFO: 127.0.0.1:54112 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:56] INFO: 127.0.0.1:55142 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:56 TP0] Decode batch. #running-req: 4, #token: 4736, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1559.00, #queue-req: 0,
[2025-09-06 08:37:56] INFO: 127.0.0.1:54610 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:56 TP0] Decode batch. #running-req: 3, #token: 4864, token usage: 0.00, cuda graph: True, gen throughput (token/s): 803.60, #queue-req: 0,
[2025-09-06 08:37:56] INFO: 127.0.0.1:55258 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:37:56 TP0] Decode batch. #running-req: 2, #token: 3392, token usage: 0.00, cuda graph: True, gen throughput (token/s): 610.98, #queue-req: 0,
[2025-09-06 08:37:56] INFO: 127.0.0.1:54852 - "POST /v1/chat/completions HTTP/1.1" 200 OK
1%| | 1/198 [00:10<34:54, 10.63s/it][2025-09-06 08:37:56 TP0] Decode batch. #running-req: 1, #token: 1664, token usage: 0.00, cuda graph: True, gen throughput (token/s): 512.37, #queue-req: 0,
[2025-09-06 08:37:56 TP0] Decode batch. #running-req: 1, #token: 1728, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.50, #queue-req: 0,
[2025-09-06 08:37:56 TP0] Decode batch. #running-req: 1, #token: 1792, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.95, #queue-req: 0,
[2025-09-06 08:37:57 TP0] Decode batch. #running-req: 1, #token: 1792, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.45, #queue-req: 0,
[2025-09-06 08:37:57 TP0] Decode batch. #running-req: 1, #token: 1856, token usage: 0.00, cuda graph: True, gen throughput (token/s): 331.02, #queue-req: 0,
[2025-09-06 08:37:57 TP0] Decode batch. #running-req: 1, #token: 1920, token usage: 0.00, cuda graph: True, gen throughput (token/s): 331.02, #queue-req: 0,
[2025-09-06 08:37:57 TP0] Decode batch. #running-req: 1, #token: 1920, token usage: 0.00, cuda graph: True, gen throughput (token/s): 331.09, #queue-req: 0,
[2025-09-06 08:37:57 TP0] Decode batch. #running-req: 1, #token: 1984, token usage: 0.00, cuda graph: True, gen throughput (token/s): 331.74, #queue-req: 0,
[2025-09-06 08:37:57 TP0] Decode batch. #running-req: 1, #token: 1984, token usage: 0.00, cuda graph: True, gen throughput (token/s): 331.23, #queue-req: 0,
[2025-09-06 08:37:57 TP0] Decode batch. #running-req: 1, #token: 2048, token usage: 0.00, cuda graph: True, gen throughput (token/s): 331.15, #queue-req: 0,
[2025-09-06 08:37:57 TP0] Decode batch. #running-req: 1, #token: 2112, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.99, #queue-req: 0,
[2025-09-06 08:37:58 TP0] Decode batch. #running-req: 1, #token: 2112, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.87, #queue-req: 0,
[2025-09-06 08:37:58] INFO: 127.0.0.1:55070 - "POST /v1/chat/completions HTTP/1.1" 200 OK
15%|█▌ | 30/198 [00:12<00:50, 3.36it/s] 100%|██████████| 198/198 [00:12<00:00, 16.45it/s]
/usr/lib/python3.12/subprocess.py:1127: ResourceWarning: subprocess 42006 is still running
_warn("subprocess %s is still running" % self.pid,
ResourceWarning: Enable tracemalloc to get the object allocation traceback
.
----------------------------------------------------------------------
Ran 1 test in 176.322s
OK
Writing report to /tmp/gpqa__home_yiliu7_models_openai_gpt-oss-120b.html
{'chars': 1670.3434343434344, 'chars:std': 964.280672515106, 'score:std': 0.46958834412435685, 'score': 0.6717171717171717}
Writing results to /tmp/gpqa__home_yiliu7_models_openai_gpt-oss-120b.json
Total latency: 12.093 s
Score: 0.672
Evaluation end: model=/home/yiliu7/models/openai/gpt-oss-120b reasoning_effort=low expected_score=0.6 metrics={'chars': 1670.3434343434344, 'chars:std': 964.280672515106, 'score:std': 0.46958834412435685, 'score': 0.6717171717171717}
================================================================================
Run 5:
Auto-configed device: cuda
WARNING:sglang.srt.server_args:Detected SM100 and MXFP4 quantization format for GPT-OSS model, enabling FlashInfer MXFP4 MOE kernel.
WARNING:sglang.srt.server_args:TensorRT-LLM MHA only supports page_size of 16, 32 or 64, changing page_size from None to 64.
[2025-09-06 08:38:12] server_args=ServerArgs(model_path='/home/yiliu7/models/openai/gpt-oss-120b', tokenizer_path='/home/yiliu7/models/openai/gpt-oss-120b', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=8400, skip_server_warmup=False, warmups=None, nccl_port=None, dtype='bfloat16', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', mem_fraction_static=0.93, max_running_requests=None, max_queued_requests=9223372036854775807, max_total_tokens=None, chunked_prefill_size=16384, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, page_size=64, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=True, device='cuda', tp_size=4, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=861196426, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, gc_warning_threshold_secs=0.0, api_key=None, served_model_name='/home/yiliu7/models/openai/gpt-oss-120b', weight_version='default', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_backend='triton', attention_backend='trtllm_mha', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, ep_size=1, moe_a2a_backend='none', moe_runner_backend='flashinfer_mxfp4', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=True, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', disable_radix_cache=False, cuda_graph_max_bs=200, cuda_graph_bs=None, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, torch_compile_max_bs=32, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, enable_return_hidden_states=False, scheduler_recv_interval=1, numa_node=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, debug_tensor_dump_prefill_only=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, num_reserved_decode_tokens=512, custom_weight_loader=[], weight_loader_disable_mmap=False, enable_pdmux=False, sm_group_num=3, enable_ep_moe=False, enable_deepep_moe=False, enable_flashinfer_cutlass_moe=False, enable_flashinfer_trtllm_moe=False, enable_triton_kernel_moe=False, enable_flashinfer_mxfp4_moe=False)
All deep_gemm operations loaded successfully!
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-06 08:38:12] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:38:12] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-06 08:38:13] Using default HuggingFace chat template with detected content format: string
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-06 08:38:19 TP0] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:38:19 TP0] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-06 08:38:19 TP3] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:38:19 TP3] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-06 08:38:19 TP2] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:38:19 TP2] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-06 08:38:19 TP1] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:38:19 TP1] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-06 08:38:20 TP0] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:38:20 TP0] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-06 08:38:20 TP0] Init torch distributed begin.
[2025-09-06 08:38:20 TP3] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:38:20 TP3] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-06 08:38:20 TP2] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:38:20 TP2] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-06 08:38:20 TP1] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:38:20 TP1] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[2025-09-06 08:38:21 TP0] sglang is using nccl==2.27.3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[2025-09-06 08:38:24 TP0] Init torch distributed ends. mem usage=1.46 GB
[2025-09-06 08:38:24 TP0] Load weight begin. avail mem=176.28 GB
All deep_gemm operations loaded successfully!
Loading safetensors checkpoint shards: 0% Completed | 0/15 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 15/15 [00:00<00:00, 1591.28it/s]
[2025-09-06 08:38:33 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.0.mlp.experts), it might take a while...
All deep_gemm operations loaded successfully!
[2025-09-06 08:38:36 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.1.mlp.experts), it might take a while...
All deep_gemm operations loaded successfully!
All deep_gemm operations loaded successfully!
[2025-09-06 08:38:39 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.2.mlp.experts), it might take a while...
[2025-09-06 08:38:42 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.3.mlp.experts), it might take a while...
[2025-09-06 08:38:46 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.4.mlp.experts), it might take a while...
[2025-09-06 08:38:49 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.5.mlp.experts), it might take a while...
[2025-09-06 08:38:52 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.6.mlp.experts), it might take a while...
[2025-09-06 08:38:55 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.7.mlp.experts), it might take a while...
[2025-09-06 08:38:58 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.8.mlp.experts), it might take a while...
[2025-09-06 08:39:01 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.9.mlp.experts), it might take a while...
[2025-09-06 08:39:04 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.10.mlp.experts), it might take a while...
[2025-09-06 08:39:07 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.11.mlp.experts), it might take a while...
[2025-09-06 08:39:10 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.12.mlp.experts), it might take a while...
[2025-09-06 08:39:13 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.13.mlp.experts), it might take a while...
[2025-09-06 08:39:16 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.14.mlp.experts), it might take a while...
[2025-09-06 08:39:19 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.15.mlp.experts), it might take a while...
[2025-09-06 08:39:22 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.16.mlp.experts), it might take a while...
[2025-09-06 08:39:25 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.17.mlp.experts), it might take a while...
[2025-09-06 08:39:28 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.18.mlp.experts), it might take a while...
[2025-09-06 08:39:32 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.19.mlp.experts), it might take a while...
[2025-09-06 08:39:35 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.20.mlp.experts), it might take a while...
[2025-09-06 08:39:38 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.21.mlp.experts), it might take a while...
[2025-09-06 08:39:41 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.22.mlp.experts), it might take a while...
[2025-09-06 08:39:44 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.23.mlp.experts), it might take a while...
[2025-09-06 08:39:47 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.24.mlp.experts), it might take a while...
[2025-09-06 08:39:50 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.25.mlp.experts), it might take a while...
[2025-09-06 08:39:53 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.26.mlp.experts), it might take a while...
[2025-09-06 08:39:56 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.27.mlp.experts), it might take a while...
[2025-09-06 08:39:59 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.28.mlp.experts), it might take a while...
[2025-09-06 08:40:02 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.29.mlp.experts), it might take a while...
[2025-09-06 08:40:05 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.30.mlp.experts), it might take a while...
[2025-09-06 08:40:08 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.31.mlp.experts), it might take a while...
[2025-09-06 08:40:11 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.32.mlp.experts), it might take a while...
[2025-09-06 08:40:14 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.33.mlp.experts), it might take a while...
[2025-09-06 08:40:18 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.34.mlp.experts), it might take a while...
[2025-09-06 08:40:21 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.35.mlp.experts), it might take a while...
[2025-09-06 08:40:24 TP0] Load weight end. type=GptOssForCausalLM, dtype=torch.bfloat16, avail mem=158.06 GB, mem usage=18.22 GB.
[2025-09-06 08:40:31 TP2] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB
[2025-09-06 08:40:31 TP1] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB
[2025-09-06 08:40:31 TP0] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB
[2025-09-06 08:40:31 TP0] Memory pool end. avail mem=10.23 GB
[2025-09-06 08:40:31 TP3] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB
[2025-09-06 08:40:32 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=9.54 GB
[2025-09-06 08:40:32 TP0] Capture cuda graph bs [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200]
0%| | 0/28 [00:00<?, ?it/s] Capturing batches (bs=200 avail_mem=9.39 GB): 0%| | 0/28 [00:00<?, ?it/s]rank 0 allocated ipc_handles: [['0x73dcf8000000', '0x73b734000000', '0x73b700000000', '0x73b6fc000000'], ['0x73b6fee00000', '0x73b6ff000000', '0x73b6ff200000', '0x73b6ff400000'], ['0x73b6f2000000', '0x73b6e8000000', '0x73b6de000000', '0x73b6d4000000']]
[2025-09-06 08:40:34.384] [info] lamportInitialize start: buffer: 0x73b6f2000000, size: 71303168
rank 1 allocated ipc_handles: [['0x75bc64000000', '0x75e1fa000000', '0x75bc00000000', '0x75bbfc000000'], ['0x75bbff000000', '0x75bbfee00000', '0x75bbff200000', '0x75bbff400000'], ['0x75bbe8000000', '0x75bbf2000000', '0x75bbde000000', '0x75bbd4000000']]
[2025-09-06 08:40:34.433] [info] lamportInitialize start: buffer: 0x75bbf2000000, size: 71303168
rank 2 allocated ipc_handles: [['0x74be10000000', '0x74bdac000000', '0x74e3a6000000', '0x74bda8000000'], ['0x74bdab000000', '0x74bdab200000', '0x74bdaae00000', '0x74bdab400000'], ['0x74bd94000000', '0x74bd8a000000', '0x74bd9e000000', '0x74bd80000000']]
[2025-09-06 08:40:34.483] [info] lamportInitialize start: buffer: 0x74bd9e000000, size: 71303168
rank 3 allocated ipc_handles: [['0x795024000000', '0x794fc8000000', '0x794fc4000000', '0x7975c0000000'], ['0x794fc7000000', '0x794fc7200000', '0x794fc7400000', '0x794fc6e00000'], ['0x794fb0000000', '0x794fa6000000', '0x794f9c000000', '0x794fba000000']]
[2025-09-06 08:40:34.533] [info] lamportInitialize start: buffer: 0x794fba000000, size: 71303168
[2025-09-06 08:40:34 TP0] FlashInfer workspace initialized for rank 0, world_size 4
[2025-09-06 08:40:34 TP1] FlashInfer workspace initialized for rank 1, world_size 4
[2025-09-06 08:40:34 TP3] FlashInfer workspace initialized for rank 3, world_size 4
[2025-09-06 08:40:34 TP2] FlashInfer workspace initialized for rank 2, world_size 4
set flag_ptr[3] = lamport_comm_size: 47185920
Rank 0 workspace[0] 0x73dcf8000000
Rank 0 workspace[1] 0x73b734000000
Rank 0 workspace[2] 0x73b700000000
Rank 0 workspace[3] 0x73b6fc000000
Rank 0 workspace[4] 0x73b6fee00000
Rank 0 workspace[5] 0x73b6ff000000
Rank 0 workspace[6] 0x73b6ff200000
Rank 0 workspace[7] 0x73b6ff400000
Rank 0 workspace[8] 0x73b6f2000000
Rank 0 workspace[9] 0x73b6e8000000
Rank 0 workspace[10] 0x73b6de000000
Rank 0 workspace[11] 0x73b6d4000000
Rank 0 workspace[12] 0x73e2eb264400
set flag_ptr[3] = lamport_comm_size: 47185920
Rank 3 workspace[0] 0x795024000000
Rank 3 workspace[1] 0x794fc8000000
Rank 3 workspace[2] 0x794fc4000000
Rank 3 workspace[3] 0x7975c0000000
Rank 3 workspace[4] 0x794fc7000000
Rank 3 workspace[5] 0x794fc7200000
Rank 3 workspace[6] 0x794fc7400000
Rank 3 workspace[7] 0x794fc6e00000
Rank 3 workspace[8] 0x794fb0000000
Rank 3 workspace[9] 0x794fa6000000
Rank 3 workspace[10] 0x794f9c000000
Rank 3 workspace[11] 0x794fba000000
Rank 3 workspace[12] 0x797bbd264400
set flag_ptr[3] = lamport_comm_size: 47185920
Rank 1 workspace[0] 0x75bc64000000
Rank 1 workspace[1] 0x75e1fa000000
Rank 1 workspace[2] 0x75bc00000000
Rank 1 workspace[3] 0x75bbfc000000
Rank 1 workspace[4] 0x75bbff000000
Rank 1 workspace[5] 0x75bbfee00000
Rank 1 workspace[6] 0x75bbff200000
Rank 1 workspace[7] 0x75bbff400000
Rank 1 workspace[8] 0x75bbe8000000
Rank 1 workspace[9] 0x75bbf2000000
Rank 1 workspace[10] 0x75bbde000000
Rank 1 workspace[11] 0x75bbd4000000
Rank 1 workspace[12] 0x75e801264400
set flag_ptr[3] = lamport_comm_size: 47185920
Rank 2 workspace[0] 0x74be10000000
Rank 2 workspace[1] 0x74bdac000000
Rank 2 workspace[2] 0x74e3a6000000
Rank 2 workspace[3] 0x74bda8000000
Rank 2 workspace[4] 0x74bdab000000
Rank 2 workspace[5] 0x74bdab200000
Rank 2 workspace[6] 0x74bdaae00000
Rank 2 workspace[7] 0x74bdab400000
Rank 2 workspace[8] 0x74bd94000000
Rank 2 workspace[9] 0x74bd8a000000
Rank 2 workspace[10] 0x74bd9e000000
Rank 2 workspace[11] 0x74bd80000000
Rank 2 workspace[12] 0x74e9b1264400
Capturing batches (bs=200 avail_mem=9.39 GB): 4%|▎ | 1/28 [00:02<01:02, 2.30s/it] Capturing batches (bs=192 avail_mem=8.06 GB): 4%|▎ | 1/28 [00:02<01:02, 2.30s/it] Capturing batches (bs=192 avail_mem=8.06 GB): 7%|▋ | 2/28 [00:02<00:27, 1.04s/it] Capturing batches (bs=184 avail_mem=8.05 GB): 7%|▋ | 2/28 [00:02<00:27, 1.04s/it] Capturing batches (bs=176 avail_mem=8.04 GB): 7%|▋ | 2/28 [00:02<00:27, 1.04s/it] Capturing batches (bs=176 avail_mem=8.04 GB): 14%|█▍ | 4/28 [00:02<00:10, 2.23it/s] Capturing batches (bs=168 avail_mem=8.03 GB): 14%|█▍ | 4/28 [00:02<00:10, 2.23it/s] Capturing batches (bs=160 avail_mem=8.02 GB): 14%|█▍ | 4/28 [00:02<00:10, 2.23it/s] Capturing batches (bs=160 avail_mem=8.02 GB): 21%|██▏ | 6/28 [00:02<00:06, 3.55it/s] Capturing batches (bs=152 avail_mem=8.01 GB): 21%|██▏ | 6/28 [00:02<00:06, 3.55it/s] Capturing batches (bs=144 avail_mem=8.00 GB): 21%|██▏ | 6/28 [00:02<00:06, 3.55it/s] Capturing batches (bs=144 avail_mem=8.00 GB): 29%|██▊ | 8/28 [00:03<00:04, 4.82it/s] Capturing batches (bs=136 avail_mem=8.00 GB): 29%|██▊ | 8/28 [00:03<00:04, 4.82it/s] Capturing batches (bs=136 avail_mem=8.00 GB): 32%|███▏ | 9/28 [00:03<00:03, 5.42it/s] Capturing batches (bs=128 avail_mem=7.99 GB): 32%|███▏ | 9/28 [00:03<00:03, 5.42it/s] Capturing batches (bs=128 avail_mem=7.99 GB): 36%|███▌ | 10/28 [00:03<00:02, 6.02it/s] Capturing batches (bs=120 avail_mem=7.97 GB): 36%|███▌ | 10/28 [00:03<00:02, 6.02it/s] Capturing batches (bs=112 avail_mem=7.97 GB): 36%|███▌ | 10/28 [00:03<00:02, 6.02it/s] Capturing batches (bs=112 avail_mem=7.97 GB): 43%|████▎ | 12/28 [00:03<00:02, 7.35it/s] Capturing batches (bs=104 avail_mem=7.96 GB): 43%|████▎ | 12/28 [00:03<00:02, 7.35it/s] Capturing batches (bs=96 avail_mem=7.95 GB): 43%|████▎ | 12/28 [00:03<00:02, 7.35it/s] Capturing batches (bs=96 avail_mem=7.95 GB): 50%|█████ | 14/28 [00:03<00:01, 8.08it/s] Capturing batches (bs=88 avail_mem=7.94 GB): 50%|█████ | 14/28 [00:03<00:01, 8.08it/s] Capturing batches (bs=88 avail_mem=7.94 GB): 54%|█████▎ | 15/28 [00:03<00:01, 8.40it/s] Capturing batches (bs=80 avail_mem=7.93 GB): 54%|█████▎ | 15/28 [00:03<00:01, 8.40it/s] Capturing batches (bs=72 avail_mem=7.92 GB): 54%|█████▎ | 15/28 [00:03<00:01, 8.40it/s] Capturing batches (bs=72 avail_mem=7.92 GB): 61%|██████ | 17/28 [00:03<00:01, 9.23it/s] Capturing batches (bs=64 avail_mem=7.91 GB): 61%|██████ | 17/28 [00:03<00:01, 9.23it/s] Capturing batches (bs=56 avail_mem=7.90 GB): 61%|██████ | 17/28 [00:04<00:01, 9.23it/s] Capturing batches (bs=56 avail_mem=7.90 GB): 68%|██████▊ | 19/28 [00:04<00:00, 9.53it/s] Capturing batches (bs=48 avail_mem=7.89 GB): 68%|██████▊ | 19/28 [00:04<00:00, 9.53it/s] Capturing batches (bs=40 avail_mem=7.88 GB): 68%|██████▊ | 19/28 [00:04<00:00, 9.53it/s] Capturing batches (bs=40 avail_mem=7.88 GB): 75%|███████▌ | 21/28 [00:04<00:00, 9.85it/s] Capturing batches (bs=32 avail_mem=7.88 GB): 75%|███████▌ | 21/28 [00:04<00:00, 9.85it/s] Capturing batches (bs=24 avail_mem=7.87 GB): 75%|███████▌ | 21/28 [00:04<00:00, 9.85it/s] Capturing batches (bs=24 avail_mem=7.87 GB): 82%|████████▏ | 23/28 [00:04<00:00, 10.35it/s] Capturing batches (bs=16 avail_mem=7.86 GB): 82%|████████▏ | 23/28 [00:04<00:00, 10.35it/s] Capturing batches (bs=8 avail_mem=7.85 GB): 82%|████████▏ | 23/28 [00:04<00:00, 10.35it/s] Capturing batches (bs=8 avail_mem=7.85 GB): 89%|████████▉ | 25/28 [00:04<00:00, 10.47it/s] Capturing batches (bs=4 avail_mem=7.84 GB): 89%|████████▉ | 25/28 [00:04<00:00, 10.47it/s] Capturing batches (bs=2 avail_mem=7.83 GB): 89%|████████▉ | 25/28 [00:04<00:00, 10.47it/s] Capturing batches (bs=2 avail_mem=7.83 GB): 96%|█████████▋| 27/28 [00:04<00:00, 10.88it/s] Capturing batches (bs=1 avail_mem=7.82 GB): 96%|█████████▋| 27/28 [00:04<00:00, 10.88it/s] Capturing batches (bs=1 avail_mem=7.82 GB): 100%|██████████| 28/28 [00:04<00:00, 5.63it/s]
[2025-09-06 08:40:37 TP0] Registering 56 cuda graph addresses
[2025-09-06 08:40:37 TP1] Registering 56 cuda graph addresses
[2025-09-06 08:40:37 TP2] Registering 56 cuda graph addresses
[2025-09-06 08:40:37 TP3] Registering 56 cuda graph addresses
[2025-09-06 08:40:37 TP0] Capture cuda graph end. Time elapsed: 5.51 s. mem usage=1.73 GB. avail mem=7.81 GB.
[2025-09-06 08:40:38 TP0] max_total_num_tokens=8487040, chunked_prefill_size=16384, max_prefill_tokens=16384, max_running_requests=4096, context_len=131072, available_gpu_mem=7.81 GB
[2025-09-06 08:40:39] INFO: Started server process [44552]
[2025-09-06 08:40:39] INFO: Waiting for application startup.
[2025-09-06 08:40:39] INFO: Application startup complete.
[2025-09-06 08:40:39] INFO: Uvicorn running on http://127.0.0.1:8400 (Press CTRL+C to quit)
[2025-09-06 08:40:40] INFO: 127.0.0.1:40126 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-09-06 08:40:40 TP0] Prefill batch. #new-seq: 1, #new-token: 64, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-09-06 08:40:41] INFO: 127.0.0.1:40140 - "POST /generate HTTP/1.1" 200 OK
[2025-09-06 08:40:41] The server is fired up and ready to roll!
[2025-09-06 08:40:46 TP0] Prefill batch. #new-seq: 1, #new-token: 64, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-09-06 08:40:47] INFO: 127.0.0.1:40142 - "GET /health_generate HTTP/1.1" 200 OK
command=python3 -m sglang.launch_server --model-path /home/yiliu7/models/openai/gpt-oss-120b --tp 4 --cuda-graph-max-bs 200 --mem-fraction-static 0.93 --device cuda --host 127.0.0.1 --port 8400
Evaluation start: model=/home/yiliu7/models/openai/gpt-oss-120b reasoning_effort=low expected_score=0.6
ChatCompletionSampler initialized with self.system_message=None self.temperature=0.1 self.max_tokens=4096 self.reasoning_effort='low'
0%| | 0/198 [00:00<?, ?it/s][2025-09-06 08:40:48 TP0] Prefill batch. #new-seq: 1, #new-token: 256, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-09-06 08:40:48 TP0] Prefill batch. #new-seq: 2, #new-token: 576, #cached-token: 0, token usage: 0.00, #running-req: 1, #queue-req: 0,
[2025-09-06 08:40:48 TP0] Prefill batch. #new-seq: 14, #new-token: 4096, #cached-token: 896, token usage: 0.00, #running-req: 3, #queue-req: 0,
[2025-09-06 08:40:48 TP0] Prefill batch. #new-seq: 15, #new-token: 4736, #cached-token: 960, token usage: 0.00, #running-req: 17, #queue-req: 0,
[2025-09-06 08:40:49 TP0] Prefill batch. #new-seq: 26, #new-token: 7168, #cached-token: 1664, token usage: 0.00, #running-req: 32, #queue-req: 0,
[2025-09-06 08:40:49 TP0] Prefill batch. #new-seq: 10, #new-token: 2560, #cached-token: 640, token usage: 0.00, #running-req: 58, #queue-req: 0,
[2025-09-06 08:40:49 TP0] Prefill batch. #new-seq: 32, #new-token: 10752, #cached-token: 2048, token usage: 0.00, #running-req: 68, #queue-req: 0,
[2025-09-06 08:40:49 TP0] Prefill batch. #new-seq: 58, #new-token: 15744, #cached-token: 3776, token usage: 0.00, #running-req: 100, #queue-req: 0,
[2025-09-06 08:40:49 TP0] Prefill batch. #new-seq: 40, #new-token: 10944, #cached-token: 2560, token usage: 0.01, #running-req: 158, #queue-req: 0,
[2025-09-06 08:40:49 TP0] Decode batch. #running-req: 198, #token: 62976, token usage: 0.01, cuda graph: True, gen throughput (token/s): 528.97, #queue-req: 0,
[2025-09-06 08:40:49] INFO: 127.0.0.1:40254 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:49] INFO: 127.0.0.1:41524 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:49] INFO: 127.0.0.1:40968 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:50 TP0] Decode batch. #running-req: 195, #token: 69824, token usage: 0.01, cuda graph: True, gen throughput (token/s): 17214.69, #queue-req: 0,
[2025-09-06 08:40:50] INFO: 127.0.0.1:40724 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:50] INFO: 127.0.0.1:41188 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:50] INFO: 127.0.0.1:40482 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:50] INFO: 127.0.0.1:41328 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:50] INFO: 127.0.0.1:40108 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:50] INFO: 127.0.0.1:40540 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:50] INFO: 127.0.0.1:41066 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:50] INFO: 127.0.0.1:41182 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:50] INFO: 127.0.0.1:41704 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:50] INFO: 127.0.0.1:40556 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:50 TP0] Decode batch. #running-req: 185, #token: 71744, token usage: 0.01, cuda graph: True, gen throughput (token/s): 17078.90, #queue-req: 0,
[2025-09-06 08:40:50] INFO: 127.0.0.1:40956 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:50] INFO: 127.0.0.1:41430 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:50] INFO: 127.0.0.1:41060 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:50] INFO: 127.0.0.1:40940 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:50] INFO: 127.0.0.1:41668 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:51] INFO: 127.0.0.1:40338 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:51] INFO: 127.0.0.1:40398 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:51] INFO: 127.0.0.1:41196 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:51] INFO: 127.0.0.1:40726 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:51] INFO: 127.0.0.1:40640 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:51] INFO: 127.0.0.1:40672 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:51] INFO: 127.0.0.1:40566 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:51 TP0] Decode batch. #running-req: 173, #token: 73216, token usage: 0.01, cuda graph: True, gen throughput (token/s): 16408.26, #queue-req: 0,
[2025-09-06 08:40:51] INFO: 127.0.0.1:40208 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:51] INFO: 127.0.0.1:41490 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:51] INFO: 127.0.0.1:41310 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:51] INFO: 127.0.0.1:41750 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:51 TP0] Decode batch. #running-req: 169, #token: 77568, token usage: 0.01, cuda graph: True, gen throughput (token/s): 15698.13, #queue-req: 0,
[2025-09-06 08:40:51] INFO: 127.0.0.1:40138 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:51] INFO: 127.0.0.1:40592 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:51] INFO: 127.0.0.1:40986 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:51] INFO: 127.0.0.1:40498 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:52 TP0] Decode batch. #running-req: 165, #token: 82432, token usage: 0.01, cuda graph: True, gen throughput (token/s): 15224.52, #queue-req: 0,
[2025-09-06 08:40:52] INFO: 127.0.0.1:41262 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:52] INFO: 127.0.0.1:41790 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:52] INFO: 127.0.0.1:40896 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:52] INFO: 127.0.0.1:41772 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:52] INFO: 127.0.0.1:40348 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:52] INFO: 127.0.0.1:41142 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:52] INFO: 127.0.0.1:40448 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:52] INFO: 127.0.0.1:41292 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:52 TP0] Decode batch. #running-req: 157, #token: 84416, token usage: 0.01, cuda graph: True, gen throughput (token/s): 14434.46, #queue-req: 0,
[2025-09-06 08:40:52] INFO: 127.0.0.1:41166 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:52] INFO: 127.0.0.1:40706 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:52] INFO: 127.0.0.1:41036 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:52] INFO: 127.0.0.1:41092 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:52] INFO: 127.0.0.1:40560 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:52] INFO: 127.0.0.1:40888 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:52] INFO: 127.0.0.1:41556 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:52] INFO: 127.0.0.1:40296 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:52] INFO: 127.0.0.1:40222 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:52] INFO: 127.0.0.1:40794 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:52] INFO: 127.0.0.1:41482 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:52] INFO: 127.0.0.1:41374 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:52] INFO: 127.0.0.1:41074 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:52] INFO: 127.0.0.1:41438 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:52 TP0] Decode batch. #running-req: 145, #token: 82560, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13510.77, #queue-req: 0,
[2025-09-06 08:40:52] INFO: 127.0.0.1:40164 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:52] INFO: 127.0.0.1:40230 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:52] INFO: 127.0.0.1:40730 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:53] INFO: 127.0.0.1:41682 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:53] INFO: 127.0.0.1:40386 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:53] INFO: 127.0.0.1:40666 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:53] INFO: 127.0.0.1:41362 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:53] INFO: 127.0.0.1:40080 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:53 TP0] Decode batch. #running-req: 135, #token: 83392, token usage: 0.01, cuda graph: True, gen throughput (token/s): 12758.89, #queue-req: 0,
[2025-09-06 08:40:53] INFO: 127.0.0.1:40804 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:53] INFO: 127.0.0.1:40756 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:53] INFO: 127.0.0.1:40272 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:53] INFO: 127.0.0.1:41734 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:53] INFO: 127.0.0.1:40360 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:53] INFO: 127.0.0.1:41616 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:53] INFO: 127.0.0.1:41574 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:53] INFO: 127.0.0.1:40824 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:53] INFO: 127.0.0.1:41346 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:53] INFO: 127.0.0.1:41644 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:53] INFO: 127.0.0.1:41434 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:53] INFO: 127.0.0.1:41402 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:53] INFO: 127.0.0.1:40328 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:53 TP0] Decode batch. #running-req: 122, #token: 80704, token usage: 0.01, cuda graph: True, gen throughput (token/s): 14211.91, #queue-req: 0,
[2025-09-06 08:40:53] INFO: 127.0.0.1:41604 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:53] INFO: 127.0.0.1:41830 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:53] INFO: 127.0.0.1:40104 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:53] INFO: 127.0.0.1:41124 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:53] INFO: 127.0.0.1:41330 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:53] INFO: 127.0.0.1:40268 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:53] INFO: 127.0.0.1:40408 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:53] INFO: 127.0.0.1:40438 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:53] INFO: 127.0.0.1:41352 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:53] INFO: 127.0.0.1:40148 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:54 TP0] Decode batch. #running-req: 112, #token: 78400, token usage: 0.01, cuda graph: True, gen throughput (token/s): 15988.90, #queue-req: 0,
[2025-09-06 08:40:54] INFO: 127.0.0.1:41502 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:54] INFO: 127.0.0.1:40480 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:54] INFO: 127.0.0.1:40248 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:54] INFO: 127.0.0.1:41136 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:54 TP0] Decode batch. #running-req: 108, #token: 79616, token usage: 0.01, cuda graph: True, gen throughput (token/s): 15405.37, #queue-req: 0,
[2025-09-06 08:40:54] INFO: 127.0.0.1:40916 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:54] INFO: 127.0.0.1:40624 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:54] INFO: 127.0.0.1:40292 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:54] INFO: 127.0.0.1:40648 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:54] INFO: 127.0.0.1:40280 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:54] INFO: 127.0.0.1:41560 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:54] INFO: 127.0.0.1:40608 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:54] INFO: 127.0.0.1:41510 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:54] INFO: 127.0.0.1:40830 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:54] INFO: 127.0.0.1:40524 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:54] INFO: 127.0.0.1:40720 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:54] INFO: 127.0.0.1:41508 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:54] INFO: 127.0.0.1:41652 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:54] INFO: 127.0.0.1:40580 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:54 TP0] Decode batch. #running-req: 94, #token: 73088, token usage: 0.01, cuda graph: True, gen throughput (token/s): 14289.78, #queue-req: 0,
[2025-09-06 08:40:54] INFO: 127.0.0.1:40442 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:54] INFO: 127.0.0.1:41518 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:54] INFO: 127.0.0.1:40520 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:54] INFO: 127.0.0.1:40928 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:54] INFO: 127.0.0.1:41278 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:54] INFO: 127.0.0.1:40270 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:54] INFO: 127.0.0.1:40374 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:54] INFO: 127.0.0.1:40778 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:54] INFO: 127.0.0.1:41824 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:54 TP0] Decode batch. #running-req: 85, #token: 69760, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13354.62, #queue-req: 0,
[2025-09-06 08:40:54] INFO: 127.0.0.1:41598 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:54] INFO: 127.0.0.1:41760 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:55] INFO: 127.0.0.1:41466 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:55] INFO: 127.0.0.1:41584 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:55 TP0] Decode batch. #running-req: 81, #token: 68416, token usage: 0.01, cuda graph: True, gen throughput (token/s): 12492.43, #queue-req: 0,
[2025-09-06 08:40:55] INFO: 127.0.0.1:41446 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:55] INFO: 127.0.0.1:41544 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:55] INFO: 127.0.0.1:41158 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:55] INFO: 127.0.0.1:40980 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:55] INFO: 127.0.0.1:40678 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:55] INFO: 127.0.0.1:40772 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:55] INFO: 127.0.0.1:40950 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:55] INFO: 127.0.0.1:41210 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:55] INFO: 127.0.0.1:41308 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:55 TP0] Decode batch. #running-req: 72, #token: 63872, token usage: 0.01, cuda graph: True, gen throughput (token/s): 12045.72, #queue-req: 0,
[2025-09-06 08:40:55] INFO: 127.0.0.1:40844 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:55] INFO: 127.0.0.1:41020 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:55] INFO: 127.0.0.1:40656 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:55] INFO: 127.0.0.1:41324 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:55] INFO: 127.0.0.1:41630 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:55] INFO: 127.0.0.1:41700 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:55] INFO: 127.0.0.1:40124 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:55] INFO: 127.0.0.1:40904 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:55 TP0] Decode batch. #running-req: 64, #token: 60288, token usage: 0.01, cuda graph: True, gen throughput (token/s): 10692.53, #queue-req: 0,
[2025-09-06 08:40:55] INFO: 127.0.0.1:40746 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:55] INFO: 127.0.0.1:41462 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:55] INFO: 127.0.0.1:41572 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:55] INFO: 127.0.0.1:40858 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:55] INFO: 127.0.0.1:41636 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:55] INFO: 127.0.0.1:40378 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:55 TP0] Decode batch. #running-req: 58, #token: 56704, token usage: 0.01, cuda graph: True, gen throughput (token/s): 10220.62, #queue-req: 0,
[2025-09-06 08:40:55] INFO: 127.0.0.1:41732 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:55] INFO: 127.0.0.1:40816 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:55] INFO: 127.0.0.1:41690 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:55] INFO: 127.0.0.1:40096 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:56] INFO: 127.0.0.1:41414 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:56 TP0] Decode batch. #running-req: 53, #token: 54080, token usage: 0.01, cuda graph: True, gen throughput (token/s): 9137.63, #queue-req: 0,
[2025-09-06 08:40:56] INFO: 127.0.0.1:40880 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:56] INFO: 127.0.0.1:41740 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:56] INFO: 127.0.0.1:40300 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:56] INFO: 127.0.0.1:41696 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:56 TP0] Decode batch. #running-req: 49, #token: 51968, token usage: 0.01, cuda graph: True, gen throughput (token/s): 8730.68, #queue-req: 0,
[2025-09-06 08:40:56] INFO: 127.0.0.1:40820 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:56] INFO: 127.0.0.1:41778 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:56] INFO: 127.0.0.1:40462 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:56] INFO: 127.0.0.1:41004 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:56] INFO: 127.0.0.1:41098 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:56] INFO: 127.0.0.1:40290 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:56 TP0] Decode batch. #running-req: 43, #token: 47680, token usage: 0.01, cuda graph: True, gen throughput (token/s): 8114.26, #queue-req: 0,
[2025-09-06 08:40:56] INFO: 127.0.0.1:41264 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:56] INFO: 127.0.0.1:41002 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:56] INFO: 127.0.0.1:41122 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:56] INFO: 127.0.0.1:40316 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:56] INFO: 127.0.0.1:41422 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:56] INFO: 127.0.0.1:41386 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:56 TP0] Decode batch. #running-req: 37, #token: 41472, token usage: 0.00, cuda graph: True, gen throughput (token/s): 7218.83, #queue-req: 0,
[2025-09-06 08:40:56] INFO: 127.0.0.1:41202 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:56] INFO: 127.0.0.1:41108 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:56] INFO: 127.0.0.1:40532 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:56] INFO: 127.0.0.1:40988 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:56] INFO: 127.0.0.1:41350 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:56] INFO: 127.0.0.1:41260 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:56] INFO: 127.0.0.1:40204 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:56] INFO: 127.0.0.1:41088 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:57] INFO: 127.0.0.1:41132 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:57 TP0] Decode batch. #running-req: 28, #token: 33536, token usage: 0.00, cuda graph: True, gen throughput (token/s): 5858.74, #queue-req: 0,
[2025-09-06 08:40:57] INFO: 127.0.0.1:40872 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:57] INFO: 127.0.0.1:40242 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:57] INFO: 127.0.0.1:41528 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:57 TP0] Decode batch. #running-req: 25, #token: 31168, token usage: 0.00, cuda graph: True, gen throughput (token/s): 4823.55, #queue-req: 0,
[2025-09-06 08:40:57] INFO: 127.0.0.1:41244 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:57] INFO: 127.0.0.1:40194 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:57] INFO: 127.0.0.1:41670 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:57] INFO: 127.0.0.1:40506 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:57] INFO: 127.0.0.1:40476 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:57 TP0] Decode batch. #running-req: 20, #token: 25152, token usage: 0.00, cuda graph: True, gen throughput (token/s): 4278.43, #queue-req: 0,
[2025-09-06 08:40:57 TP0] Decode batch. #running-req: 20, #token: 25920, token usage: 0.00, cuda graph: True, gen throughput (token/s): 3898.73, #queue-req: 0,
[2025-09-06 08:40:57] INFO: 127.0.0.1:41712 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:57] INFO: 127.0.0.1:40596 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:57 TP0] Decode batch. #running-req: 18, #token: 24192, token usage: 0.00, cuda graph: True, gen throughput (token/s): 3718.50, #queue-req: 0,
[2025-09-06 08:40:58 TP0] Decode batch. #running-req: 18, #token: 24640, token usage: 0.00, cuda graph: True, gen throughput (token/s): 3511.95, #queue-req: 0,
[2025-09-06 08:40:58] INFO: 127.0.0.1:40424 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:58] INFO: 127.0.0.1:41720 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:58] INFO: 127.0.0.1:41236 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:58 TP0] Decode batch. #running-req: 15, #token: 21184, token usage: 0.00, cuda graph: True, gen throughput (token/s): 3166.87, #queue-req: 0,
[2025-09-06 08:40:58] INFO: 127.0.0.1:40788 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:58] INFO: 127.0.0.1:40350 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:58] INFO: 127.0.0.1:41804 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:58] INFO: 127.0.0.1:41222 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:58] INFO: 127.0.0.1:41460 - "POST /v1/chat/completions HTTP/1.1" 200 OK
1%| | 1/198 [00:09<31:52, 9.71s/it][2025-09-06 08:40:58] INFO: 127.0.0.1:41050 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:58 TP0] Decode batch. #running-req: 9, #token: 13056, token usage: 0.00, cuda graph: True, gen throughput (token/s): 2633.08, #queue-req: 0,
[2025-09-06 08:40:58] INFO: 127.0.0.1:40512 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:58 TP0] Decode batch. #running-req: 8, #token: 11776, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1903.68, #queue-req: 0,
[2025-09-06 08:40:58] INFO: 127.0.0.1:41822 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:58] INFO: 127.0.0.1:40694 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:58 TP0] Decode batch. #running-req: 6, #token: 9152, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1819.86, #queue-req: 0,
[2025-09-06 08:40:58] INFO: 127.0.0.1:40156 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:58] INFO: 127.0.0.1:40370 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:58 TP0] Decode batch. #running-req: 4, #token: 6144, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1302.25, #queue-req: 0,
[2025-09-06 08:40:59 TP0] Decode batch. #running-req: 4, #token: 6336, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1056.97, #queue-req: 0,
[2025-09-06 08:40:59] INFO: 127.0.0.1:41820 - "POST /v1/chat/completions HTTP/1.1" 200 OK
23%|██▎ | 45/198 [00:10<00:26, 5.87it/s][2025-09-06 08:40:59 TP0] Decode batch. #running-req: 3, #token: 4928, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1005.99, #queue-req: 0,
[2025-09-06 08:40:59 TP0] Decode batch. #running-req: 3, #token: 4992, token usage: 0.00, cuda graph: True, gen throughput (token/s): 799.03, #queue-req: 0,
[2025-09-06 08:40:59] INFO: 127.0.0.1:40680 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:59 TP0] Decode batch. #running-req: 2, #token: 3456, token usage: 0.00, cuda graph: True, gen throughput (token/s): 716.67, #queue-req: 0,
[2025-09-06 08:40:59] INFO: 127.0.0.1:40642 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:40:59 TP0] Decode batch. #running-req: 1, #token: 1792, token usage: 0.00, cuda graph: True, gen throughput (token/s): 546.76, #queue-req: 0,
[2025-09-06 08:40:59 TP0] Decode batch. #running-req: 1, #token: 1792, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.99, #queue-req: 0,
[2025-09-06 08:40:59 TP0] Decode batch. #running-req: 1, #token: 1856, token usage: 0.00, cuda graph: True, gen throughput (token/s): 331.77, #queue-req: 0,
[2025-09-06 08:41:00 TP0] Decode batch. #running-req: 1, #token: 1856, token usage: 0.00, cuda graph: True, gen throughput (token/s): 328.65, #queue-req: 0,
[2025-09-06 08:41:00 TP0] Decode batch. #running-req: 1, #token: 1920, token usage: 0.00, cuda graph: True, gen throughput (token/s): 331.68, #queue-req: 0,
[2025-09-06 08:41:00 TP0] Decode batch. #running-req: 1, #token: 1984, token usage: 0.00, cuda graph: True, gen throughput (token/s): 331.19, #queue-req: 0,
[2025-09-06 08:41:00 TP0] Decode batch. #running-req: 1, #token: 1984, token usage: 0.00, cuda graph: True, gen throughput (token/s): 331.51, #queue-req: 0,
[2025-09-06 08:41:00 TP0] Decode batch. #running-req: 1, #token: 2048, token usage: 0.00, cuda graph: True, gen throughput (token/s): 331.70, #queue-req: 0,
[2025-09-06 08:41:00 TP0] Decode batch. #running-req: 1, #token: 2112, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.81, #queue-req: 0,
[2025-09-06 08:41:00 TP0] Decode batch. #running-req: 1, #token: 2112, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.76, #queue-req: 0,
[2025-09-06 08:41:00 TP0] Decode batch. #running-req: 1, #token: 2176, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.45, #queue-req: 0,
[2025-09-06 08:41:01 TP0] Decode batch. #running-req: 1, #token: 2176, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.66, #queue-req: 0,
[2025-09-06 08:41:01 TP0] Decode batch. #running-req: 1, #token: 2240, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.81, #queue-req: 0,
[2025-09-06 08:41:01 TP0] Decode batch. #running-req: 1, #token: 2304, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.68, #queue-req: 0,
[2025-09-06 08:41:01 TP0] Decode batch. #running-req: 1, #token: 2304, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.63, #queue-req: 0,
[2025-09-06 08:41:01 TP0] Decode batch. #running-req: 1, #token: 2368, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.50, #queue-req: 0,
[2025-09-06 08:41:01 TP0] Decode batch. #running-req: 1, #token: 2432, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.94, #queue-req: 0,
[2025-09-06 08:41:01 TP0] Decode batch. #running-req: 1, #token: 2432, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.84, #queue-req: 0,
[2025-09-06 08:41:01 TP0] Decode batch. #running-req: 1, #token: 2496, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.78, #queue-req: 0,
[2025-09-06 08:41:02 TP0] Decode batch. #running-req: 1, #token: 2496, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.12, #queue-req: 0,
[2025-09-06 08:41:02 TP0] Decode batch. #running-req: 1, #token: 2560, token usage: 0.00, cuda graph: True, gen throughput (token/s): 328.85, #queue-req: 0,
[2025-09-06 08:41:02] INFO: 127.0.0.1:40180 - "POST /v1/chat/completions HTTP/1.1" 200 OK
28%|██▊ | 56/198 [00:13<00:27, 5.09it/s] 100%|██████████| 198/198 [00:13<00:00, 14.66it/s]
/usr/lib/python3.12/subprocess.py:1127: ResourceWarning: subprocess 44552 is still running
_warn("subprocess %s is still running" % self.pid,
ResourceWarning: Enable tracemalloc to get the object allocation traceback
E
======================================================================
ERROR: test_mxfp4_120b (__main__.TestGptOss4Gpu.test_mxfp4_120b)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/yiliu7/sglang/python/sglang/srt/utils.py", line 2187, in retry
return fn()
^^^^
File "/home/yiliu7/sglang/python/sglang/test/test_utils.py", line 1396, in <lambda>
lambda: super(CustomTestCase, self)._callTestMethod(method),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: 0.5707070707070707 not greater than or equal to 0.6
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/yiliu7/sglang/python/sglang/test/test_utils.py", line 1395, in _callTestMethod
retry(
File "/home/yiliu7/sglang/python/sglang/srt/utils.py", line 2190, in retry
raise Exception(f"retry() exceed maximum number of retries.")
Exception: retry() exceed maximum number of retries.
----------------------------------------------------------------------
Ran 1 test in 178.092s
FAILED (errors=1)
Writing report to /tmp/gpqa__home_yiliu7_models_openai_gpt-oss-120b.html
{'chars': 1742.560606060606, 'chars:std': 1075.4861263368614, 'score:std': 0.4949752621616814, 'score': 0.5707070707070707}
Writing results to /tmp/gpqa__home_yiliu7_models_openai_gpt-oss-120b.json
Total latency: 13.566 s
Score: 0.571
Evaluation end: model=/home/yiliu7/models/openai/gpt-oss-120b reasoning_effort=low expected_score=0.6 metrics={'chars': 1742.560606060606, 'chars:std': 1075.4861263368614, 'score:std': 0.4949752621616814, 'score': 0.5707070707070707}
================================================================================
Run 6:
Auto-configed device: cuda
WARNING:sglang.srt.server_args:Detected SM100 and MXFP4 quantization format for GPT-OSS model, enabling FlashInfer MXFP4 MOE kernel.
WARNING:sglang.srt.server_args:TensorRT-LLM MHA only supports page_size of 16, 32 or 64, changing page_size from None to 64.
[2025-09-06 08:41:17] server_args=ServerArgs(model_path='/home/yiliu7/models/openai/gpt-oss-120b', tokenizer_path='/home/yiliu7/models/openai/gpt-oss-120b', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=8400, skip_server_warmup=False, warmups=None, nccl_port=None, dtype='bfloat16', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', mem_fraction_static=0.93, max_running_requests=None, max_queued_requests=9223372036854775807, max_total_tokens=None, chunked_prefill_size=16384, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, page_size=64, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=True, device='cuda', tp_size=4, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=88006138, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, gc_warning_threshold_secs=0.0, api_key=None, served_model_name='/home/yiliu7/models/openai/gpt-oss-120b', weight_version='default', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_backend='triton', attention_backend='trtllm_mha', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, ep_size=1, moe_a2a_backend='none', moe_runner_backend='flashinfer_mxfp4', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=True, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', disable_radix_cache=False, cuda_graph_max_bs=200, cuda_graph_bs=None, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, torch_compile_max_bs=32, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, enable_return_hidden_states=False, scheduler_recv_interval=1, numa_node=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, debug_tensor_dump_prefill_only=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, num_reserved_decode_tokens=512, custom_weight_loader=[], weight_loader_disable_mmap=False, enable_pdmux=False, sm_group_num=3, enable_ep_moe=False, enable_deepep_moe=False, enable_flashinfer_cutlass_moe=False, enable_flashinfer_trtllm_moe=False, enable_triton_kernel_moe=False, enable_flashinfer_mxfp4_moe=False)
All deep_gemm operations loaded successfully!
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-06 08:41:17] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:41:17] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-06 08:41:18] Using default HuggingFace chat template with detected content format: string
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-06 08:41:24 TP0] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:41:24 TP0] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-06 08:41:24 TP2] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:41:24 TP2] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
`torch_dtype` is deprecated! Use `dtype` instead!
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-06 08:41:24 TP3] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:41:24 TP1] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:41:24 TP3] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-06 08:41:24 TP1] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-06 08:41:25 TP0] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:41:25 TP0] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-06 08:41:25 TP0] Init torch distributed begin.
[2025-09-06 08:41:25 TP2] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:41:25 TP2] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-06 08:41:25 TP1] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:41:25 TP1] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-06 08:41:25 TP3] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:41:25 TP3] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[2025-09-06 08:41:26 TP0] sglang is using nccl==2.27.3
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[2025-09-06 08:41:29 TP0] Init torch distributed ends. mem usage=1.46 GB
[2025-09-06 08:41:29 TP0] Load weight begin. avail mem=176.28 GB
All deep_gemm operations loaded successfully!
Loading safetensors checkpoint shards: 0% Completed | 0/15 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 15/15 [00:00<00:00, 1477.42it/s]
All deep_gemm operations loaded successfully!
[2025-09-06 08:41:39 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.0.mlp.experts), it might take a while...
All deep_gemm operations loaded successfully!
All deep_gemm operations loaded successfully!
[2025-09-06 08:41:42 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.1.mlp.experts), it might take a while...
[2025-09-06 08:41:45 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.2.mlp.experts), it might take a while...
[2025-09-06 08:41:48 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.3.mlp.experts), it might take a while...
[2025-09-06 08:41:51 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.4.mlp.experts), it might take a while...
[2025-09-06 08:41:55 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.5.mlp.experts), it might take a while...
[2025-09-06 08:41:58 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.6.mlp.experts), it might take a while...
[2025-09-06 08:42:01 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.7.mlp.experts), it might take a while...
[2025-09-06 08:42:04 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.8.mlp.experts), it might take a while...
[2025-09-06 08:42:07 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.9.mlp.experts), it might take a while...
[2025-09-06 08:42:10 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.10.mlp.experts), it might take a while...
[2025-09-06 08:42:13 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.11.mlp.experts), it might take a while...
[2025-09-06 08:42:16 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.12.mlp.experts), it might take a while...
[2025-09-06 08:42:19 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.13.mlp.experts), it might take a while...
[2025-09-06 08:42:22 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.14.mlp.experts), it might take a while...
[2025-09-06 08:42:25 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.15.mlp.experts), it might take a while...
[2025-09-06 08:42:28 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.16.mlp.experts), it might take a while...
[2025-09-06 08:42:31 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.17.mlp.experts), it might take a while...
[2025-09-06 08:42:34 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.18.mlp.experts), it might take a while...
[2025-09-06 08:42:38 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.19.mlp.experts), it might take a while...
[2025-09-06 08:42:41 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.20.mlp.experts), it might take a while...
[2025-09-06 08:42:44 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.21.mlp.experts), it might take a while...
[2025-09-06 08:42:47 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.22.mlp.experts), it might take a while...
[2025-09-06 08:42:50 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.23.mlp.experts), it might take a while...
[2025-09-06 08:42:53 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.24.mlp.experts), it might take a while...
[2025-09-06 08:42:56 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.25.mlp.experts), it might take a while...
[2025-09-06 08:42:59 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.26.mlp.experts), it might take a while...
[2025-09-06 08:43:02 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.27.mlp.experts), it might take a while...
[2025-09-06 08:43:05 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.28.mlp.experts), it might take a while...
[2025-09-06 08:43:08 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.29.mlp.experts), it might take a while...
[2025-09-06 08:43:11 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.30.mlp.experts), it might take a while...
[2025-09-06 08:43:14 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.31.mlp.experts), it might take a while...
[2025-09-06 08:43:18 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.32.mlp.experts), it might take a while...
[2025-09-06 08:43:21 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.33.mlp.experts), it might take a while...
[2025-09-06 08:43:24 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.34.mlp.experts), it might take a while...
[2025-09-06 08:43:27 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.35.mlp.experts), it might take a while...
[2025-09-06 08:43:30 TP0] Load weight end. type=GptOssForCausalLM, dtype=torch.bfloat16, avail mem=158.06 GB, mem usage=18.22 GB.
[2025-09-06 08:43:36 TP2] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB
[2025-09-06 08:43:36 TP0] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB
[2025-09-06 08:43:36 TP0] Memory pool end. avail mem=10.23 GB
[2025-09-06 08:43:36 TP3] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB
[2025-09-06 08:43:36 TP1] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB
[2025-09-06 08:43:36 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=9.54 GB
[2025-09-06 08:43:36 TP0] Capture cuda graph bs [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200]
0%| | 0/28 [00:00<?, ?it/s] Capturing batches (bs=200 avail_mem=9.39 GB): 0%| | 0/28 [00:00<?, ?it/s]rank 1 allocated ipc_handles: [['0x7ecb5c000000', '0x7ef116000000', '0x7ecb18000000', '0x7ecb14000000'], ['0x7ecb17000000', '0x7ecb16e00000', '0x7ecb17200000', '0x7ecb17400000'], ['0x7ecb00000000', '0x7ecb0a000000', '0x7ecaf6000000', '0x7ecaec000000']]
[2025-09-06 08:43:38.924] [info] lamportInitialize start: buffer: 0x7ecb0a000000, size: 71303168
rank 3 allocated ipc_handles: [['0x7cbe7c000000', '0x7cbe42000000', '0x7cbe3e000000', '0x7ce43c000000'], ['0x7cbe41000000', '0x7cbe41200000', '0x7cbe41400000', '0x7cbe40e00000'], ['0x7cbe2a000000', '0x7cbe20000000', '0x7cbe16000000', '0x7cbe34000000']]
[2025-09-06 08:43:38.972] [info] lamportInitialize start: buffer: 0x7cbe34000000, size: 71303168
rank 0 allocated ipc_handles: [['0x7bf8d2000000', '0x7bd316000000', '0x7bd2e4000000', '0x7bd2e0000000'], ['0x7bd2e2e00000', '0x7bd2e3000000', '0x7bd2e3200000', '0x7bd2e3400000'], ['0x7bd2d6000000', '0x7bd2cc000000', '0x7bd2c2000000', '0x7bd2b8000000']]
[2025-09-06 08:43:39.021] [info] lamportInitialize start: buffer: 0x7bd2d6000000, size: 71303168
rank 2 allocated ipc_handles: [['0x72c838000000', '0x72c834000000', '0x72ee34000000', '0x72c830000000'], ['0x72c833000000', '0x72c833200000', '0x72c832e00000', '0x72c833400000'], ['0x72c81c000000', '0x72c812000000', '0x72c826000000', '0x72c808000000']]
[2025-09-06 08:43:39.071] [info] lamportInitialize start: buffer: 0x72c826000000, size: 71303168
[2025-09-06 08:43:39 TP0] FlashInfer workspace initialized for rank 0, world_size 4
[2025-09-06 08:43:39 TP1] FlashInfer workspace initialized for rank 1, world_size 4
[2025-09-06 08:43:39 TP2] FlashInfer workspace initialized for rank 2, world_size 4
[2025-09-06 08:43:39 TP3] FlashInfer workspace initialized for rank 3, world_size 4
set flag_ptr[3] = lamport_comm_size: 47185920
Rank 0 workspace[0] 0x7bf8d2000000
Rank 0 workspace[1] 0x7bd316000000
Rank 0 workspace[2] 0x7bd2e4000000
Rank 0 workspace[3] 0x7bd2e0000000
Rank 0 workspace[4] 0x7bd2e2e00000
Rank 0 workspace[5] 0x7bd2e3000000
Rank 0 workspace[6] 0x7bd2e3200000
Rank 0 workspace[7] 0x7bd2e3400000
Rank 0 workspace[8] 0x7bd2d6000000
Rank 0 workspace[9] 0x7bd2cc000000
Rank 0 workspace[10] 0x7bd2c2000000
Rank 0 workspace[11] 0x7bd2b8000000
Rank 0 workspace[12] 0x7bfec7264400
set flag_ptr[3] = lamport_comm_size: 47185920
Rank 3 workspace[0] 0x7cbe7c000000
Rank 3 workspace[1] 0x7cbe42000000
Rank 3 workspace[2] 0x7cbe3e000000
Rank 3 workspace[3] 0x7ce43c000000
Rank 3 workspace[4] 0x7cbe41000000
Rank 3 workspace[5] 0x7cbe41200000
Rank 3 workspace[6] 0x7cbe41400000
Rank 3 workspace[7] 0x7cbe40e00000
Rank 3 workspace[8] 0x7cbe2a000000
Rank 3 workspace[9] 0x7cbe20000000
Rank 3 workspace[10] 0x7cbe16000000
Rank 3 workspace[11] 0x7cbe34000000
Rank 3 workspace[12] 0x7cea3b264400
set flag_ptr[3] = lamport_comm_size: 47185920
Rank 2 workspace[0] 0x72c838000000
Rank 2 workspace[1] 0x72c834000000
Rank 2 workspace[2] 0x72ee34000000
Rank 2 workspace[3] 0x72c830000000
Rank 2 workspace[4] 0x72c833000000
Rank 2 workspace[5] 0x72c833200000
Rank 2 workspace[6] 0x72c832e00000
Rank 2 workspace[7] 0x72c833400000
Rank 2 workspace[8] 0x72c81c000000
Rank 2 workspace[9] 0x72c812000000
Rank 2 workspace[10] 0x72c826000000
Rank 2 workspace[11] 0x72c808000000
Rank 2 workspace[12] 0x72f441264400
set flag_ptr[3] = lamport_comm_size: 47185920
Rank 1 workspace[0] 0x7ecb5c000000
Rank 1 workspace[1] 0x7ef116000000
Rank 1 workspace[2] 0x7ecb18000000
Rank 1 workspace[3] 0x7ecb14000000
Rank 1 workspace[4] 0x7ecb17000000
Rank 1 workspace[5] 0x7ecb16e00000
Rank 1 workspace[6] 0x7ecb17200000
Rank 1 workspace[7] 0x7ecb17400000
Rank 1 workspace[8] 0x7ecb00000000
Rank 1 workspace[9] 0x7ecb0a000000
Rank 1 workspace[10] 0x7ecaf6000000
Rank 1 workspace[11] 0x7ecaec000000
Rank 1 workspace[12] 0x7ef71f264400
Capturing batches (bs=200 avail_mem=9.39 GB): 4%|▎ | 1/28 [00:02<00:57, 2.12s/it] Capturing batches (bs=192 avail_mem=8.06 GB): 4%|▎ | 1/28 [00:02<00:57, 2.12s/it] Capturing batches (bs=192 avail_mem=8.06 GB): 7%|▋ | 2/28 [00:02<00:25, 1.04it/s] Capturing batches (bs=184 avail_mem=8.05 GB): 7%|▋ | 2/28 [00:02<00:25, 1.04it/s] Capturing batches (bs=176 avail_mem=8.04 GB): 7%|▋ | 2/28 [00:02<00:25, 1.04it/s] Capturing batches (bs=176 avail_mem=8.04 GB): 14%|█▍ | 4/28 [00:02<00:09, 2.43it/s] Capturing batches (bs=168 avail_mem=8.03 GB): 14%|█▍ | 4/28 [00:02<00:09, 2.43it/s] Capturing batches (bs=160 avail_mem=8.02 GB): 14%|█▍ | 4/28 [00:02<00:09, 2.43it/s] Capturing batches (bs=160 avail_mem=8.02 GB): 21%|██▏ | 6/28 [00:02<00:05, 3.89it/s] Capturing batches (bs=152 avail_mem=8.01 GB): 21%|██▏ | 6/28 [00:02<00:05, 3.89it/s] Capturing batches (bs=144 avail_mem=8.00 GB): 21%|██▏ | 6/28 [00:02<00:05, 3.89it/s] Capturing batches (bs=144 avail_mem=8.00 GB): 29%|██▊ | 8/28 [00:02<00:03, 5.32it/s] Capturing batches (bs=136 avail_mem=8.00 GB): 29%|██▊ | 8/28 [00:02<00:03, 5.32it/s] Capturing batches (bs=128 avail_mem=7.99 GB): 29%|██▊ | 8/28 [00:02<00:03, 5.32it/s] Capturing batches (bs=128 avail_mem=7.99 GB): 36%|███▌ | 10/28 [00:02<00:02, 6.54it/s] Capturing batches (bs=120 avail_mem=7.97 GB): 36%|███▌ | 10/28 [00:02<00:02, 6.54it/s] Capturing batches (bs=112 avail_mem=7.97 GB): 36%|███▌ | 10/28 [00:03<00:02, 6.54it/s] Capturing batches (bs=112 avail_mem=7.97 GB): 43%|████▎ | 12/28 [00:03<00:02, 7.61it/s] Capturing batches (bs=104 avail_mem=7.96 GB): 43%|████▎ | 12/28 [00:03<00:02, 7.61it/s] Capturing batches (bs=96 avail_mem=7.95 GB): 43%|████▎ | 12/28 [00:03<00:02, 7.61it/s] Capturing batches (bs=96 avail_mem=7.95 GB): 50%|█████ | 14/28 [00:03<00:01, 8.48it/s] Capturing batches (bs=88 avail_mem=7.94 GB): 50%|█████ | 14/28 [00:03<00:01, 8.48it/s] Capturing batches (bs=80 avail_mem=7.93 GB): 50%|█████ | 14/28 [00:03<00:01, 8.48it/s] Capturing batches (bs=80 avail_mem=7.93 GB): 57%|█████▋ | 16/28 [00:03<00:01, 9.20it/s] Capturing batches (bs=72 avail_mem=7.92 GB): 57%|█████▋ | 16/28 [00:03<00:01, 9.20it/s] Capturing batches (bs=64 avail_mem=7.91 GB): 57%|█████▋ | 16/28 [00:03<00:01, 9.20it/s] Capturing batches (bs=64 avail_mem=7.91 GB): 64%|██████▍ | 18/28 [00:03<00:01, 9.76it/s] Capturing batches (bs=56 avail_mem=7.90 GB): 64%|██████▍ | 18/28 [00:03<00:01, 9.76it/s] Capturing batches (bs=48 avail_mem=7.89 GB): 64%|██████▍ | 18/28 [00:03<00:01, 9.76it/s] Capturing batches (bs=48 avail_mem=7.89 GB): 71%|███████▏ | 20/28 [00:03<00:00, 10.27it/s] Capturing batches (bs=40 avail_mem=7.88 GB): 71%|███████▏ | 20/28 [00:03<00:00, 10.27it/s] Capturing batches (bs=32 avail_mem=7.88 GB): 71%|███████▏ | 20/28 [00:03<00:00, 10.27it/s] Capturing batches (bs=32 avail_mem=7.88 GB): 79%|███████▊ | 22/28 [00:04<00:00, 10.64it/s] Capturing batches (bs=24 avail_mem=7.87 GB): 79%|███████▊ | 22/28 [00:04<00:00, 10.64it/s] Capturing batches (bs=16 avail_mem=7.86 GB): 79%|███████▊ | 22/28 [00:04<00:00, 10.64it/s] Capturing batches (bs=16 avail_mem=7.86 GB): 86%|████████▌ | 24/28 [00:04<00:00, 10.94it/s] Capturing batches (bs=8 avail_mem=7.85 GB): 86%|████████▌ | 24/28 [00:04<00:00, 10.94it/s] Capturing batches (bs=4 avail_mem=7.84 GB): 86%|████████▌ | 24/28 [00:04<00:00, 10.94it/s] Capturing batches (bs=4 avail_mem=7.84 GB): 93%|█████████▎| 26/28 [00:04<00:00, 11.33it/s] Capturing batches (bs=2 avail_mem=7.83 GB): 93%|█████████▎| 26/28 [00:04<00:00, 11.33it/s] Capturing batches (bs=1 avail_mem=7.82 GB): 93%|█████████▎| 26/28 [00:04<00:00, 11.33it/s] Capturing batches (bs=1 avail_mem=7.82 GB): 100%|██████████| 28/28 [00:04<00:00, 10.79it/s] Capturing batches (bs=1 avail_mem=7.82 GB): 100%|██████████| 28/28 [00:04<00:00, 6.13it/s]
[2025-09-06 08:43:41 TP3] Registering 56 cuda graph addresses
[2025-09-06 08:43:41 TP0] Registering 56 cuda graph addresses
[2025-09-06 08:43:41 TP2] Registering 56 cuda graph addresses
[2025-09-06 08:43:41 TP1] Registering 56 cuda graph addresses
[2025-09-06 08:43:41 TP0] Capture cuda graph end. Time elapsed: 5.10 s. mem usage=1.73 GB. avail mem=7.81 GB.
[2025-09-06 08:43:42 TP0] max_total_num_tokens=8487040, chunked_prefill_size=16384, max_prefill_tokens=16384, max_running_requests=4096, context_len=131072, available_gpu_mem=7.81 GB
[2025-09-06 08:43:43] INFO: Started server process [47053]
[2025-09-06 08:43:43] INFO: Waiting for application startup.
[2025-09-06 08:43:43] INFO: Application startup complete.
[2025-09-06 08:43:43] INFO: Uvicorn running on http://127.0.0.1:8400 (Press CTRL+C to quit)
[2025-09-06 08:43:44] INFO: 127.0.0.1:58318 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-09-06 08:43:44 TP0] Prefill batch. #new-seq: 1, #new-token: 64, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-09-06 08:43:45] INFO: 127.0.0.1:58322 - "POST /generate HTTP/1.1" 200 OK
[2025-09-06 08:43:45] The server is fired up and ready to roll!
[2025-09-06 08:43:51 TP0] Prefill batch. #new-seq: 1, #new-token: 64, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-09-06 08:43:52] INFO: 127.0.0.1:52732 - "GET /health_generate HTTP/1.1" 200 OK
command=python3 -m sglang.launch_server --model-path /home/yiliu7/models/openai/gpt-oss-120b --tp 4 --cuda-graph-max-bs 200 --mem-fraction-static 0.93 --device cuda --host 127.0.0.1 --port 8400
Evaluation start: model=/home/yiliu7/models/openai/gpt-oss-120b reasoning_effort=low expected_score=0.6
ChatCompletionSampler initialized with self.system_message=None self.temperature=0.1 self.max_tokens=4096 self.reasoning_effort='low'
0%| | 0/198 [00:00<?, ?it/s][2025-09-06 08:43:53 TP0] Prefill batch. #new-seq: 1, #new-token: 320, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-09-06 08:43:53 TP0] Prefill batch. #new-seq: 3, #new-token: 768, #cached-token: 0, token usage: 0.00, #running-req: 1, #queue-req: 0,
[2025-09-06 08:43:53 TP0] Prefill batch. #new-seq: 4, #new-token: 1152, #cached-token: 256, token usage: 0.00, #running-req: 4, #queue-req: 0,
[2025-09-06 08:43:53 TP0] Prefill batch. #new-seq: 18, #new-token: 5568, #cached-token: 1152, token usage: 0.00, #running-req: 8, #queue-req: 0,
[2025-09-06 08:43:53 TP0] Prefill batch. #new-seq: 18, #new-token: 5120, #cached-token: 1152, token usage: 0.00, #running-req: 26, #queue-req: 0,
[2025-09-06 08:43:53 TP0] Prefill batch. #new-seq: 30, #new-token: 7872, #cached-token: 1920, token usage: 0.00, #running-req: 44, #queue-req: 0,
[2025-09-06 08:43:53 TP0] Prefill batch. #new-seq: 24, #new-token: 8320, #cached-token: 1536, token usage: 0.00, #running-req: 74, #queue-req: 0,
[2025-09-06 08:43:53 TP0] Prefill batch. #new-seq: 24, #new-token: 6592, #cached-token: 1536, token usage: 0.00, #running-req: 98, #queue-req: 0,
[2025-09-06 08:43:54 TP0] Prefill batch. #new-seq: 33, #new-token: 8896, #cached-token: 2176, token usage: 0.00, #running-req: 122, #queue-req: 0,
[2025-09-06 08:43:54 TP0] Prefill batch. #new-seq: 43, #new-token: 12224, #cached-token: 2816, token usage: 0.01, #running-req: 155, #queue-req: 0,
[2025-09-06 08:43:54 TP0] Decode batch. #running-req: 198, #token: 62848, token usage: 0.01, cuda graph: True, gen throughput (token/s): 487.94, #queue-req: 0,
[2025-09-06 08:43:54] INFO: 127.0.0.1:52948 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:54] INFO: 127.0.0.1:52900 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:54] INFO: 127.0.0.1:53670 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:55] INFO: 127.0.0.1:53800 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:55] INFO: 127.0.0.1:54298 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:55 TP0] Decode batch. #running-req: 194, #token: 68352, token usage: 0.01, cuda graph: True, gen throughput (token/s): 17172.79, #queue-req: 0,
[2025-09-06 08:43:55] INFO: 127.0.0.1:52808 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:55] INFO: 127.0.0.1:53238 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:55] INFO: 127.0.0.1:52874 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:55] INFO: 127.0.0.1:53274 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:55] INFO: 127.0.0.1:54182 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:55] INFO: 127.0.0.1:53978 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:55] INFO: 127.0.0.1:53768 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:55 TP0] Decode batch. #running-req: 186, #token: 73792, token usage: 0.01, cuda graph: True, gen throughput (token/s): 16945.77, #queue-req: 0,
[2025-09-06 08:43:55] INFO: 127.0.0.1:54494 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:55] INFO: 127.0.0.1:53900 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:55] INFO: 127.0.0.1:53894 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:55] INFO: 127.0.0.1:53452 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:55] INFO: 127.0.0.1:53336 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:55] INFO: 127.0.0.1:53294 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:55] INFO: 127.0.0.1:53440 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:55] INFO: 127.0.0.1:53628 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:55] INFO: 127.0.0.1:53856 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:55] INFO: 127.0.0.1:53178 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:56 TP0] Decode batch. #running-req: 176, #token: 74048, token usage: 0.01, cuda graph: True, gen throughput (token/s): 16353.88, #queue-req: 0,
[2025-09-06 08:43:56] INFO: 127.0.0.1:54048 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:56] INFO: 127.0.0.1:53040 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:56] INFO: 127.0.0.1:54504 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:56] INFO: 127.0.0.1:54042 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:56] INFO: 127.0.0.1:53108 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:56] INFO: 127.0.0.1:52794 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:56] INFO: 127.0.0.1:53198 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:56 TP0] Decode batch. #running-req: 169, #token: 77760, token usage: 0.01, cuda graph: True, gen throughput (token/s): 15971.45, #queue-req: 0,
[2025-09-06 08:43:56] INFO: 127.0.0.1:52958 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:56] INFO: 127.0.0.1:53600 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:56] INFO: 127.0.0.1:53476 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:56] INFO: 127.0.0.1:53986 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:56] INFO: 127.0.0.1:54524 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:56] INFO: 127.0.0.1:53698 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:56] INFO: 127.0.0.1:53606 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:56 TP0] Decode batch. #running-req: 162, #token: 81472, token usage: 0.01, cuda graph: True, gen throughput (token/s): 14919.82, #queue-req: 0,
[2025-09-06 08:43:57] INFO: 127.0.0.1:53120 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:57] INFO: 127.0.0.1:54530 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:57] INFO: 127.0.0.1:54378 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:57] INFO: 127.0.0.1:53050 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:57] INFO: 127.0.0.1:53356 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:57] INFO: 127.0.0.1:53162 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:57] INFO: 127.0.0.1:53864 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:57] INFO: 127.0.0.1:52768 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:57] INFO: 127.0.0.1:52924 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:57 TP0] Decode batch. #running-req: 154, #token: 82560, token usage: 0.01, cuda graph: True, gen throughput (token/s): 14149.30, #queue-req: 0,
[2025-09-06 08:43:57] INFO: 127.0.0.1:53884 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:57] INFO: 127.0.0.1:52908 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:57] INFO: 127.0.0.1:54242 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:57] INFO: 127.0.0.1:54326 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:57] INFO: 127.0.0.1:53520 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:57] INFO: 127.0.0.1:53540 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:57] INFO: 127.0.0.1:53378 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:57] INFO: 127.0.0.1:53278 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:57] INFO: 127.0.0.1:54328 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:57] INFO: 127.0.0.1:52810 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:57] INFO: 127.0.0.1:53756 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:57 TP0] Decode batch. #running-req: 142, #token: 83264, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13411.74, #queue-req: 0,
[2025-09-06 08:43:58] INFO: 127.0.0.1:52864 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:58] INFO: 127.0.0.1:53172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:58] INFO: 127.0.0.1:53780 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:58] INFO: 127.0.0.1:54050 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:58] INFO: 127.0.0.1:54212 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:58 TP0] Decode batch. #running-req: 137, #token: 85312, token usage: 0.01, cuda graph: True, gen throughput (token/s): 12996.31, #queue-req: 0,
[2025-09-06 08:43:58] INFO: 127.0.0.1:54372 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:58] INFO: 127.0.0.1:54190 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:58] INFO: 127.0.0.1:53426 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:58] INFO: 127.0.0.1:53958 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:58] INFO: 127.0.0.1:53516 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:58] INFO: 127.0.0.1:53652 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:58] INFO: 127.0.0.1:52766 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:58 TP0] Decode batch. #running-req: 130, #token: 86720, token usage: 0.01, cuda graph: True, gen throughput (token/s): 12626.29, #queue-req: 0,
[2025-09-06 08:43:58] INFO: 127.0.0.1:54126 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:58] INFO: 127.0.0.1:53928 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:58] INFO: 127.0.0.1:53792 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:58] INFO: 127.0.0.1:53636 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:58] INFO: 127.0.0.1:53126 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:58] INFO: 127.0.0.1:53066 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:58] INFO: 127.0.0.1:53862 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:58] INFO: 127.0.0.1:54260 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:58] INFO: 127.0.0.1:54404 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:58] INFO: 127.0.0.1:53966 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:58] INFO: 127.0.0.1:53096 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:58] INFO: 127.0.0.1:53506 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:58] INFO: 127.0.0.1:54096 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:59 TP0] Decode batch. #running-req: 118, #token: 83264, token usage: 0.01, cuda graph: True, gen throughput (token/s): 15464.60, #queue-req: 0,
[2025-09-06 08:43:59] INFO: 127.0.0.1:53626 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:59] INFO: 127.0.0.1:52788 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:59] INFO: 127.0.0.1:52896 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:59] INFO: 127.0.0.1:54354 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:59] INFO: 127.0.0.1:53026 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:59] INFO: 127.0.0.1:54226 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:59] INFO: 127.0.0.1:53660 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:59] INFO: 127.0.0.1:54290 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:59] INFO: 127.0.0.1:54488 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:59] INFO: 127.0.0.1:53322 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:59] INFO: 127.0.0.1:53454 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:59] INFO: 127.0.0.1:52840 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:59] INFO: 127.0.0.1:54280 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:59] INFO: 127.0.0.1:54432 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:59 TP0] Decode batch. #running-req: 103, #token: 76160, token usage: 0.01, cuda graph: True, gen throughput (token/s): 15313.38, #queue-req: 0,
[2025-09-06 08:43:59] INFO: 127.0.0.1:52744 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:59] INFO: 127.0.0.1:54454 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:59] INFO: 127.0.0.1:53218 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:59] INFO: 127.0.0.1:54172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:59] INFO: 127.0.0.1:54124 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:59] INFO: 127.0.0.1:54274 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:59] INFO: 127.0.0.1:53152 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:59] INFO: 127.0.0.1:53312 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:59] INFO: 127.0.0.1:53212 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:59] INFO: 127.0.0.1:54156 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:59] INFO: 127.0.0.1:54142 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:59] INFO: 127.0.0.1:53342 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:59 TP0] Decode batch. #running-req: 91, #token: 71104, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13987.12, #queue-req: 0,
[2025-09-06 08:43:59] INFO: 127.0.0.1:53504 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:59] INFO: 127.0.0.1:53082 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:59] INFO: 127.0.0.1:54314 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:59] INFO: 127.0.0.1:53380 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:59] INFO: 127.0.0.1:53024 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:59] INFO: 127.0.0.1:53432 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:59] INFO: 127.0.0.1:54262 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:59 TP0] Decode batch. #running-req: 84, #token: 68288, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13118.97, #queue-req: 0,
[2025-09-06 08:43:59] INFO: 127.0.0.1:54442 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:59] INFO: 127.0.0.1:53556 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:59] INFO: 127.0.0.1:54344 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:59] INFO: 127.0.0.1:54414 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:43:59] INFO: 127.0.0.1:52982 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:00] INFO: 127.0.0.1:52970 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:00] INFO: 127.0.0.1:53122 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:00 TP0] Decode batch. #running-req: 77, #token: 65984, token usage: 0.01, cuda graph: True, gen throughput (token/s): 12342.54, #queue-req: 0,
[2025-09-06 08:44:00] INFO: 127.0.0.1:53142 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:00] INFO: 127.0.0.1:54370 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:00] INFO: 127.0.0.1:53258 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:00] INFO: 127.0.0.1:53062 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:00] INFO: 127.0.0.1:54386 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:00] INFO: 127.0.0.1:54250 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:00 TP0] Decode batch. #running-req: 71, #token: 63680, token usage: 0.01, cuda graph: True, gen throughput (token/s): 11621.65, #queue-req: 0,
[2025-09-06 08:44:00] INFO: 127.0.0.1:53462 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:00] INFO: 127.0.0.1:53836 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:00] INFO: 127.0.0.1:53720 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:00] INFO: 127.0.0.1:54206 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:00] INFO: 127.0.0.1:53014 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:00] INFO: 127.0.0.1:54400 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:00] INFO: 127.0.0.1:54512 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:00] INFO: 127.0.0.1:53186 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:00] INFO: 127.0.0.1:52964 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:00] INFO: 127.0.0.1:53832 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:00] INFO: 127.0.0.1:53306 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:00] INFO: 127.0.0.1:53682 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:00 TP0] Decode batch. #running-req: 59, #token: 55040, token usage: 0.01, cuda graph: True, gen throughput (token/s): 10315.34, #queue-req: 0,
[2025-09-06 08:44:00] INFO: 127.0.0.1:53580 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:00] INFO: 127.0.0.1:53442 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:00] INFO: 127.0.0.1:54478 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:00] INFO: 127.0.0.1:54526 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:00] INFO: 127.0.0.1:54066 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:00] INFO: 127.0.0.1:53250 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:00 TP0] Decode batch. #running-req: 54, #token: 51520, token usage: 0.01, cuda graph: True, gen throughput (token/s): 9487.70, #queue-req: 0,
[2025-09-06 08:44:00] INFO: 127.0.0.1:53528 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:00] INFO: 127.0.0.1:53562 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:00] INFO: 127.0.0.1:53762 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:00] INFO: 127.0.0.1:54080 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:01] INFO: 127.0.0.1:53816 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:01 TP0] Decode batch. #running-req: 48, #token: 48768, token usage: 0.01, cuda graph: True, gen throughput (token/s): 8627.06, #queue-req: 0,
[2025-09-06 08:44:01] INFO: 127.0.0.1:53000 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:01] INFO: 127.0.0.1:52882 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:01] INFO: 127.0.0.1:52828 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:01] INFO: 127.0.0.1:53734 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:01 TP0] Decode batch. #running-req: 45, #token: 45376, token usage: 0.01, cuda graph: True, gen throughput (token/s): 8071.15, #queue-req: 0,
[2025-09-06 08:44:01] INFO: 127.0.0.1:53620 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:01] INFO: 127.0.0.1:52754 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:01] INFO: 127.0.0.1:53368 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:01] INFO: 127.0.0.1:54412 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:01] INFO: 127.0.0.1:53844 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:01] INFO: 127.0.0.1:53872 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:01 TP0] Decode batch. #running-req: 38, #token: 41984, token usage: 0.00, cuda graph: True, gen throughput (token/s): 7190.87, #queue-req: 0,
[2025-09-06 08:44:01] INFO: 127.0.0.1:53576 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:01] INFO: 127.0.0.1:54002 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:01] INFO: 127.0.0.1:54146 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:01 TP0] Decode batch. #running-req: 35, #token: 40256, token usage: 0.00, cuda graph: True, gen throughput (token/s): 6559.52, #queue-req: 0,
[2025-09-06 08:44:01] INFO: 127.0.0.1:54216 - "POST /v1/chat/completions HTTP/1.1" 200 OK
1%| | 1/198 [00:08<27:03, 8.24s/it][2025-09-06 08:44:01 TP0] Decode batch. #running-req: 34, #token: 40192, token usage: 0.00, cuda graph: True, gen throughput (token/s): 6195.61, #queue-req: 0,
[2025-09-06 08:44:01] INFO: 127.0.0.1:53030 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:02] INFO: 127.0.0.1:53170 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:02] INFO: 127.0.0.1:53790 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:02] INFO: 127.0.0.1:53708 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:02] INFO: 127.0.0.1:52756 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:02] INFO: 127.0.0.1:53228 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:02 TP0] Decode batch. #running-req: 29, #token: 34112, token usage: 0.00, cuda graph: True, gen throughput (token/s): 5702.41, #queue-req: 0,
[2025-09-06 08:44:02] INFO: 127.0.0.1:53416 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:02] INFO: 127.0.0.1:52850 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:02] INFO: 127.0.0.1:53392 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:02] INFO: 127.0.0.1:54462 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:02] INFO: 127.0.0.1:53914 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:02 TP0] Decode batch. #running-req: 23, #token: 29184, token usage: 0.00, cuda graph: True, gen throughput (token/s): 4869.99, #queue-req: 0,
[2025-09-06 08:44:02] INFO: 127.0.0.1:52812 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:02] INFO: 127.0.0.1:52890 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:02] INFO: 127.0.0.1:52936 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:02] INFO: 127.0.0.1:54422 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:02 TP0] Decode batch. #running-req: 19, #token: 24896, token usage: 0.00, cuda graph: True, gen throughput (token/s): 3986.23, #queue-req: 0,
[2025-09-06 08:44:02] INFO: 127.0.0.1:53492 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:02] INFO: 127.0.0.1:54218 - "POST /v1/chat/completions HTTP/1.1" 200 OK
1%| | 2/198 [00:09<12:50, 3.93s/it][2025-09-06 08:44:02] INFO: 127.0.0.1:53826 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:02 TP0] Decode batch. #running-req: 16, #token: 21888, token usage: 0.00, cuda graph: True, gen throughput (token/s): 3458.66, #queue-req: 0,
[2025-09-06 08:44:02] INFO: 127.0.0.1:54016 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:02] INFO: 127.0.0.1:54138 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:02] INFO: 127.0.0.1:54108 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:02 TP0] Decode batch. #running-req: 13, #token: 18048, token usage: 0.00, cuda graph: True, gen throughput (token/s): 3017.86, #queue-req: 0,
[2025-09-06 08:44:03] INFO: 127.0.0.1:53590 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:03] INFO: 127.0.0.1:52782 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:03 TP0] Decode batch. #running-req: 11, #token: 15680, token usage: 0.00, cuda graph: True, gen throughput (token/s): 2492.21, #queue-req: 0,
[2025-09-06 08:44:03] INFO: 127.0.0.1:54300 - "POST /v1/chat/completions HTTP/1.1" 200 OK
6%|▌ | 12/198 [00:09<01:26, 2.15it/s][2025-09-06 08:44:03 TP0] Decode batch. #running-req: 10, #token: 14784, token usage: 0.00, cuda graph: True, gen throughput (token/s): 2218.82, #queue-req: 0,
[2025-09-06 08:44:03 TP0] Decode batch. #running-req: 10, #token: 14976, token usage: 0.00, cuda graph: True, gen throughput (token/s): 2132.96, #queue-req: 0,
[2025-09-06 08:44:03] INFO: 127.0.0.1:53124 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:03] INFO: 127.0.0.1:54028 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:03] INFO: 127.0.0.1:53400 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:03 TP0] Decode batch. #running-req: 7, #token: 10880, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1887.84, #queue-req: 0,
[2025-09-06 08:44:03] INFO: 127.0.0.1:53076 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:03 TP0] Decode batch. #running-req: 6, #token: 9536, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1548.43, #queue-req: 0,
[2025-09-06 08:44:03] INFO: 127.0.0.1:53944 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:04 TP0] Decode batch. #running-req: 5, #token: 8064, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1236.11, #queue-req: 0,
[2025-09-06 08:44:04] INFO: 127.0.0.1:53746 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:04 TP0] Decode batch. #running-req: 4, #token: 4928, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1162.51, #queue-req: 0,
[2025-09-06 08:44:04] INFO: 127.0.0.1:54486 - "POST /v1/chat/completions HTTP/1.1" 200 OK
16%|█▌ | 32/198 [00:10<00:27, 6.01it/s][2025-09-06 08:44:04 TP0] Decode batch. #running-req: 3, #token: 4992, token usage: 0.00, cuda graph: True, gen throughput (token/s): 798.81, #queue-req: 0,
[2025-09-06 08:44:04 TP0] Decode batch. #running-req: 3, #token: 5120, token usage: 0.00, cuda graph: True, gen throughput (token/s): 794.70, #queue-req: 0,
[2025-09-06 08:44:04 TP0] Decode batch. #running-req: 3, #token: 5312, token usage: 0.00, cuda graph: True, gen throughput (token/s): 795.25, #queue-req: 0,
[2025-09-06 08:44:04] INFO: 127.0.0.1:52852 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:44:04 TP0] Decode batch. #running-req: 3, #token: 3648, token usage: 0.00, cuda graph: True, gen throughput (token/s): 781.23, #queue-req: 0,
27%|██▋ | 54/198 [00:11<00:13, 10.80it/s][2025-09-06 08:44:04] INFO: 127.0.0.1:52992 - "POST /v1/chat/completions HTTP/1.1" 200 OK
35%|███▍ | 69/198 [00:11<00:08, 15.68it/s][2025-09-06 08:44:04 TP0] Decode batch. #running-req: 1, #token: 1920, token usage: 0.00, cuda graph: True, gen throughput (token/s): 530.75, #queue-req: 0,
[2025-09-06 08:44:05 TP0] Decode batch. #running-req: 1, #token: 1920, token usage: 0.00, cuda graph: True, gen throughput (token/s): 331.02, #queue-req: 0,
[2025-09-06 08:44:05 TP0] Decode batch. #running-req: 1, #token: 1984, token usage: 0.00, cuda graph: True, gen throughput (token/s): 331.12, #queue-req: 0,
[2025-09-06 08:44:05 TP0] Decode batch. #running-req: 1, #token: 2048, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.90, #queue-req: 0,
[2025-09-06 08:44:05 TP0] Decode batch. #running-req: 1, #token: 2048, token usage: 0.00, cuda graph: True, gen throughput (token/s): 331.23, #queue-req: 0,
[2025-09-06 08:44:05 TP0] Decode batch. #running-req: 1, #token: 2112, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.86, #queue-req: 0,
[2025-09-06 08:44:05 TP0] Decode batch. #running-req: 1, #token: 2176, token usage: 0.00, cuda graph: True, gen throughput (token/s): 324.77, #queue-req: 0,
[2025-09-06 08:44:05 TP0] Decode batch. #running-req: 1, #token: 2176, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.95, #queue-req: 0,
[2025-09-06 08:44:05 TP0] Decode batch. #running-req: 1, #token: 2240, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.11, #queue-req: 0,
[2025-09-06 08:44:06] INFO: 127.0.0.1:53362 - "POST /v1/chat/completions HTTP/1.1" 200 OK
56%|█████▌ | 110/198 [00:12<00:03, 23.46it/s] 100%|██████████| 198/198 [00:12<00:00, 15.86it/s]
/usr/lib/python3.12/subprocess.py:1127: ResourceWarning: subprocess 47053 is still running
_warn("subprocess %s is still running" % self.pid,
ResourceWarning: Enable tracemalloc to get the object allocation traceback
.
----------------------------------------------------------------------
Ran 1 test in 177.127s
OK
Writing report to /tmp/gpqa__home_yiliu7_models_openai_gpt-oss-120b.html
{'chars': 1697.6010101010102, 'chars:std': 924.1141307574765, 'score:std': 0.48379515211426455, 'score': 0.6262626262626263}
Writing results to /tmp/gpqa__home_yiliu7_models_openai_gpt-oss-120b.json
Total latency: 12.546 s
Score: 0.626
Evaluation end: model=/home/yiliu7/models/openai/gpt-oss-120b reasoning_effort=low expected_score=0.6 metrics={'chars': 1697.6010101010102, 'chars:std': 924.1141307574765, 'score:std': 0.48379515211426455, 'score': 0.6262626262626263}
================================================================================
Run 7:
Auto-configed device: cuda
WARNING:sglang.srt.server_args:Detected SM100 and MXFP4 quantization format for GPT-OSS model, enabling FlashInfer MXFP4 MOE kernel.
WARNING:sglang.srt.server_args:TensorRT-LLM MHA only supports page_size of 16, 32 or 64, changing page_size from None to 64.
[2025-09-06 08:44:20] server_args=ServerArgs(model_path='/home/yiliu7/models/openai/gpt-oss-120b', tokenizer_path='/home/yiliu7/models/openai/gpt-oss-120b', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=8400, skip_server_warmup=False, warmups=None, nccl_port=None, dtype='bfloat16', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', mem_fraction_static=0.93, max_running_requests=None, max_queued_requests=9223372036854775807, max_total_tokens=None, chunked_prefill_size=16384, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, page_size=64, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=True, device='cuda', tp_size=4, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=917768611, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, gc_warning_threshold_secs=0.0, api_key=None, served_model_name='/home/yiliu7/models/openai/gpt-oss-120b', weight_version='default', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_backend='triton', attention_backend='trtllm_mha', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, ep_size=1, moe_a2a_backend='none', moe_runner_backend='flashinfer_mxfp4', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=True, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', disable_radix_cache=False, cuda_graph_max_bs=200, cuda_graph_bs=None, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, torch_compile_max_bs=32, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, enable_return_hidden_states=False, scheduler_recv_interval=1, numa_node=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, debug_tensor_dump_prefill_only=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, num_reserved_decode_tokens=512, custom_weight_loader=[], weight_loader_disable_mmap=False, enable_pdmux=False, sm_group_num=3, enable_ep_moe=False, enable_deepep_moe=False, enable_flashinfer_cutlass_moe=False, enable_flashinfer_trtllm_moe=False, enable_triton_kernel_moe=False, enable_flashinfer_mxfp4_moe=False)
All deep_gemm operations loaded successfully!
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-06 08:44:20] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:44:20] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-06 08:44:21] Using default HuggingFace chat template with detected content format: string
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-06 08:44:27 TP2] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:44:27 TP2] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-06 08:44:27 TP0] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:44:27 TP0] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-06 08:44:27 TP3] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:44:27 TP3] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-06 08:44:28 TP1] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:44:28 TP1] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-06 08:44:28 TP2] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:44:28 TP2] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-06 08:44:28 TP0] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:44:28 TP0] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-06 08:44:28 TP0] Init torch distributed begin.
[2025-09-06 08:44:28 TP3] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:44:28 TP3] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-06 08:44:28 TP1] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:44:28 TP1] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[2025-09-06 08:44:30 TP0] sglang is using nccl==2.27.3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[2025-09-06 08:44:32 TP0] Init torch distributed ends. mem usage=1.46 GB
[2025-09-06 08:44:33 TP0] Load weight begin. avail mem=176.28 GB
All deep_gemm operations loaded successfully!
Loading safetensors checkpoint shards: 0% Completed | 0/15 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 15/15 [00:00<00:00, 1574.99it/s]
[2025-09-06 08:44:42 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.0.mlp.experts), it might take a while...
All deep_gemm operations loaded successfully!
All deep_gemm operations loaded successfully!
All deep_gemm operations loaded successfully!
[2025-09-06 08:44:46 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.1.mlp.experts), it might take a while...
[2025-09-06 08:44:49 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.2.mlp.experts), it might take a while...
[2025-09-06 08:44:52 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.3.mlp.experts), it might take a while...
[2025-09-06 08:44:55 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.4.mlp.experts), it might take a while...
[2025-09-06 08:44:58 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.5.mlp.experts), it might take a while...
[2025-09-06 08:45:01 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.6.mlp.experts), it might take a while...
[2025-09-06 08:45:04 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.7.mlp.experts), it might take a while...
[2025-09-06 08:45:07 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.8.mlp.experts), it might take a while...
[2025-09-06 08:45:10 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.9.mlp.experts), it might take a while...
[2025-09-06 08:45:13 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.10.mlp.experts), it might take a while...
[2025-09-06 08:45:16 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.11.mlp.experts), it might take a while...
[2025-09-06 08:45:19 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.12.mlp.experts), it might take a while...
[2025-09-06 08:45:22 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.13.mlp.experts), it might take a while...
[2025-09-06 08:45:25 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.14.mlp.experts), it might take a while...
[2025-09-06 08:45:28 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.15.mlp.experts), it might take a while...
[2025-09-06 08:45:31 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.16.mlp.experts), it might take a while...
[2025-09-06 08:45:34 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.17.mlp.experts), it might take a while...
[2025-09-06 08:45:37 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.18.mlp.experts), it might take a while...
[2025-09-06 08:45:40 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.19.mlp.experts), it might take a while...
[2025-09-06 08:45:43 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.20.mlp.experts), it might take a while...
[2025-09-06 08:45:46 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.21.mlp.experts), it might take a while...
[2025-09-06 08:45:49 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.22.mlp.experts), it might take a while...
[2025-09-06 08:45:52 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.23.mlp.experts), it might take a while...
[2025-09-06 08:45:55 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.24.mlp.experts), it might take a while...
[2025-09-06 08:45:58 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.25.mlp.experts), it might take a while...
[2025-09-06 08:46:01 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.26.mlp.experts), it might take a while...
[2025-09-06 08:46:05 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.27.mlp.experts), it might take a while...
[2025-09-06 08:46:08 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.28.mlp.experts), it might take a while...
[2025-09-06 08:46:11 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.29.mlp.experts), it might take a while...
[2025-09-06 08:46:14 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.30.mlp.experts), it might take a while...
[2025-09-06 08:46:17 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.31.mlp.experts), it might take a while...
[2025-09-06 08:46:20 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.32.mlp.experts), it might take a while...
[2025-09-06 08:46:23 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.33.mlp.experts), it might take a while...
[2025-09-06 08:46:26 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.34.mlp.experts), it might take a while...
[2025-09-06 08:46:29 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.35.mlp.experts), it might take a while...
[2025-09-06 08:46:32 TP0] Load weight end. type=GptOssForCausalLM, dtype=torch.bfloat16, avail mem=158.06 GB, mem usage=18.22 GB.
[2025-09-06 08:46:37 TP0] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB
[2025-09-06 08:46:37 TP2] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB
[2025-09-06 08:46:37 TP0] Memory pool end. avail mem=10.23 GB
[2025-09-06 08:46:37 TP3] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB
[2025-09-06 08:46:37 TP1] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB
[2025-09-06 08:46:38 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=9.54 GB
[2025-09-06 08:46:38 TP0] Capture cuda graph bs [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200]
0%| | 0/28 [00:00<?, ?it/s] Capturing batches (bs=200 avail_mem=9.39 GB): 0%| | 0/28 [00:00<?, ?it/s]rank 2 allocated ipc_handles: [['0x778cbc000000', '0x778c78000000', '0x77b276000000', '0x778c74000000'], ['0x778c77000000', '0x778c77200000', '0x778c76e00000', '0x778c77400000'], ['0x778c60000000', '0x778c56000000', '0x778c6a000000', '0x778c4c000000']]
[2025-09-06 08:46:40.032] [info] lamportInitialize start: buffer: 0x778c6a000000, size: 71303168
rank 0 allocated ipc_handles: [['0x75212a000000', '0x74fb8e000000', '0x74fb38000000', '0x74fb34000000'], ['0x74fb36e00000', '0x74fb37000000', '0x74fb37200000', '0x74fb37400000'], ['0x74fb2a000000', '0x74fb20000000', '0x74fb16000000', '0x74fb0c000000']]
[2025-09-06 08:46:40.082] [info] lamportInitialize start: buffer: 0x74fb2a000000, size: 71303168
rank 3 allocated ipc_handles: [['0x78650c000000', '0x7864ae000000', '0x7864aa000000', '0x788aa8000000'], ['0x7864ad000000', '0x7864ad200000', '0x7864ad400000', '0x7864ace00000'], ['0x786496000000', '0x78648c000000', '0x786482000000', '0x7864a0000000']]
[2025-09-06 08:46:40.131] [info] lamportInitialize start: buffer: 0x7864a0000000, size: 71303168
rank 1 allocated ipc_handles: [['0x7f7614000000', '0x7f9baa000000', '0x7f75b0000000', '0x7f75ac000000'], ['0x7f75af000000', '0x7f75aee00000', '0x7f75af200000', '0x7f75af400000'], ['0x7f7598000000', '0x7f75a2000000', '0x7f758e000000', '0x7f7584000000']]
[2025-09-06 08:46:40.181] [info] lamportInitialize start: buffer: 0x7f75a2000000, size: 71303168
[2025-09-06 08:46:40 TP0] FlashInfer workspace initialized for rank 0, world_size 4
[2025-09-06 08:46:40 TP1] FlashInfer workspace initialized for rank 1, world_size 4
[2025-09-06 08:46:40 TP3] FlashInfer workspace initialized for rank 3, world_size 4
[2025-09-06 08:46:40 TP2] FlashInfer workspace initialized for rank 2, world_size 4
set flag_ptr[3] = lamport_comm_size: 47185920
Rank 3 workspace[0] 0x78650c000000
Rank 3 workspace[1] 0x7864ae000000
Rank 3 workspace[2] 0x7864aa000000
Rank 3 workspace[3] 0x788aa8000000
Rank 3 workspace[4] 0x7864ad000000
Rank 3 workspace[5] 0x7864ad200000
Rank 3 workspace[6] 0x7864ad400000
Rank 3 workspace[7] 0x7864ace00000
Rank 3 workspace[8] 0x786496000000
Rank 3 workspace[9] 0x78648c000000
Rank 3 workspace[10] 0x786482000000
Rank 3 workspace[11] 0x7864a0000000
Rank 3 workspace[12] 0x7890a3264400
set flag_ptr[3] = lamport_comm_size: 47185920
Rank 0 workspace[0] 0x75212a000000
Rank 0 workspace[1] 0x74fb8e000000
Rank 0 workspace[2] 0x74fb38000000
Rank 0 workspace[3] 0x74fb34000000
Rank 0 workspace[4] 0x74fb36e00000
Rank 0 workspace[5] 0x74fb37000000
Rank 0 workspace[6] 0x74fb37200000
Rank 0 workspace[7] 0x74fb37400000
Rank 0 workspace[8] 0x74fb2a000000
Rank 0 workspace[9] 0x74fb20000000
Rank 0 workspace[10] 0x74fb16000000
Rank 0 workspace[11] 0x74fb0c000000
Rank 0 workspace[12] 0x75271f264400
set flag_ptr[3] = lamport_comm_size: 47185920
Rank 1 workspace[0] 0x7f7614000000
Rank 1 workspace[1] 0x7f9baa000000
Rank 1 workspace[2] 0x7f75b0000000
Rank 1 workspace[3] 0x7f75ac000000
Rank 1 workspace[4] 0x7f75af000000
Rank 1 workspace[5] 0x7f75aee00000
Rank 1 workspace[6] 0x7f75af200000
Rank 1 workspace[7] 0x7f75af400000
Rank 1 workspace[8] 0x7f7598000000
Rank 1 workspace[9] 0x7f75a2000000
Rank 1 workspace[10] 0x7f758e000000
Rank 1 workspace[11] 0x7f7584000000
Rank 1 workspace[12] 0x7fa1b3264400
set flag_ptr[3] = lamport_comm_size: 47185920
Rank 2 workspace[0] 0x778cbc000000
Rank 2 workspace[1] 0x778c78000000
Rank 2 workspace[2] 0x77b276000000
Rank 2 workspace[3] 0x778c74000000
Rank 2 workspace[4] 0x778c77000000
Rank 2 workspace[5] 0x778c77200000
Rank 2 workspace[6] 0x778c76e00000
Rank 2 workspace[7] 0x778c77400000
Rank 2 workspace[8] 0x778c60000000
Rank 2 workspace[9] 0x778c56000000
Rank 2 workspace[10] 0x778c6a000000
Rank 2 workspace[11] 0x778c4c000000
Rank 2 workspace[12] 0x77b881264400
Capturing batches (bs=200 avail_mem=9.39 GB): 4%|▎ | 1/28 [00:01<00:51, 1.89s/it] Capturing batches (bs=192 avail_mem=8.06 GB): 4%|▎ | 1/28 [00:01<00:51, 1.89s/it] Capturing batches (bs=192 avail_mem=8.06 GB): 7%|▋ | 2/28 [00:02<00:22, 1.15it/s] Capturing batches (bs=184 avail_mem=8.05 GB): 7%|▋ | 2/28 [00:02<00:22, 1.15it/s] Capturing batches (bs=176 avail_mem=8.04 GB): 7%|▋ | 2/28 [00:02<00:22, 1.15it/s] Capturing batches (bs=176 avail_mem=8.04 GB): 14%|█▍ | 4/28 [00:02<00:08, 2.67it/s] Capturing batches (bs=168 avail_mem=8.03 GB): 14%|█▍ | 4/28 [00:02<00:08, 2.67it/s] Capturing batches (bs=160 avail_mem=8.02 GB): 14%|█▍ | 4/28 [00:02<00:08, 2.67it/s] Capturing batches (bs=160 avail_mem=8.02 GB): 21%|██▏ | 6/28 [00:02<00:05, 4.24it/s] Capturing batches (bs=152 avail_mem=8.01 GB): 21%|██▏ | 6/28 [00:02<00:05, 4.24it/s] Capturing batches (bs=144 avail_mem=8.00 GB): 21%|██▏ | 6/28 [00:02<00:05, 4.24it/s] Capturing batches (bs=144 avail_mem=8.00 GB): 29%|██▊ | 8/28 [00:02<00:03, 5.76it/s] Capturing batches (bs=136 avail_mem=8.00 GB): 29%|██▊ | 8/28 [00:02<00:03, 5.76it/s] Capturing batches (bs=128 avail_mem=7.99 GB): 29%|██▊ | 8/28 [00:02<00:03, 5.76it/s] Capturing batches (bs=128 avail_mem=7.99 GB): 36%|███▌ | 10/28 [00:02<00:02, 7.10it/s] Capturing batches (bs=120 avail_mem=7.97 GB): 36%|███▌ | 10/28 [00:02<00:02, 7.10it/s] Capturing batches (bs=112 avail_mem=7.97 GB): 36%|███▌ | 10/28 [00:02<00:02, 7.10it/s] Capturing batches (bs=112 avail_mem=7.97 GB): 43%|████▎ | 12/28 [00:02<00:01, 8.29it/s] Capturing batches (bs=104 avail_mem=7.96 GB): 43%|████▎ | 12/28 [00:02<00:01, 8.29it/s] Capturing batches (bs=96 avail_mem=7.95 GB): 43%|████▎ | 12/28 [00:02<00:01, 8.29it/s] Capturing batches (bs=96 avail_mem=7.95 GB): 50%|█████ | 14/28 [00:03<00:01, 9.24it/s] Capturing batches (bs=88 avail_mem=7.94 GB): 50%|█████ | 14/28 [00:03<00:01, 9.24it/s] Capturing batches (bs=80 avail_mem=7.93 GB): 50%|█████ | 14/28 [00:03<00:01, 9.24it/s] Capturing batches (bs=80 avail_mem=7.93 GB): 57%|█████▋ | 16/28 [00:03<00:01, 10.03it/s] Capturing batches (bs=72 avail_mem=7.92 GB): 57%|█████▋ | 16/28 [00:03<00:01, 10.03it/s] Capturing batches (bs=64 avail_mem=7.91 GB): 57%|█████▋ | 16/28 [00:03<00:01, 10.03it/s] Capturing batches (bs=64 avail_mem=7.91 GB): 64%|██████▍ | 18/28 [00:03<00:00, 10.65it/s] Capturing batches (bs=56 avail_mem=7.90 GB): 64%|██████▍ | 18/28 [00:03<00:00, 10.65it/s] Capturing batches (bs=48 avail_mem=7.89 GB): 64%|██████▍ | 18/28 [00:03<00:00, 10.65it/s] Capturing batches (bs=48 avail_mem=7.89 GB): 71%|███████▏ | 20/28 [00:03<00:00, 11.23it/s] Capturing batches (bs=40 avail_mem=7.88 GB): 71%|███████▏ | 20/28 [00:03<00:00, 11.23it/s] Capturing batches (bs=32 avail_mem=7.88 GB): 71%|███████▏ | 20/28 [00:03<00:00, 11.23it/s] Capturing batches (bs=32 avail_mem=7.88 GB): 79%|███████▊ | 22/28 [00:03<00:00, 11.60it/s] Capturing batches (bs=24 avail_mem=7.87 GB): 79%|███████▊ | 22/28 [00:03<00:00, 11.60it/s] Capturing batches (bs=16 avail_mem=7.86 GB): 79%|███████▊ | 22/28 [00:03<00:00, 11.60it/s] Capturing batches (bs=16 avail_mem=7.86 GB): 86%|████████▌ | 24/28 [00:03<00:00, 11.93it/s] Capturing batches (bs=8 avail_mem=7.85 GB): 86%|████████▌ | 24/28 [00:03<00:00, 11.93it/s] Capturing batches (bs=4 avail_mem=7.84 GB): 86%|████████▌ | 24/28 [00:03<00:00, 11.93it/s] Capturing batches (bs=4 avail_mem=7.84 GB): 93%|█████████▎| 26/28 [00:03<00:00, 12.47it/s] Capturing batches (bs=2 avail_mem=7.83 GB): 93%|█████████▎| 26/28 [00:03<00:00, 12.47it/s] Capturing batches (bs=1 avail_mem=7.82 GB): 93%|█████████▎| 26/28 [00:04<00:00, 12.47it/s] Capturing batches (bs=1 avail_mem=7.82 GB): 100%|██████████| 28/28 [00:04<00:00, 11.86it/s] Capturing batches (bs=1 avail_mem=7.82 GB): 100%|██████████| 28/28 [00:04<00:00, 6.74it/s]
[2025-09-06 08:46:42 TP0] Registering 56 cuda graph addresses
[2025-09-06 08:46:42 TP3] Registering 56 cuda graph addresses
[2025-09-06 08:46:42 TP1] Registering 56 cuda graph addresses
[2025-09-06 08:46:42 TP2] Registering 56 cuda graph addresses
[2025-09-06 08:46:42 TP0] Capture cuda graph end. Time elapsed: 4.66 s. mem usage=1.73 GB. avail mem=7.81 GB.
[2025-09-06 08:46:43 TP0] max_total_num_tokens=8487040, chunked_prefill_size=16384, max_prefill_tokens=16384, max_running_requests=4096, context_len=131072, available_gpu_mem=7.81 GB
[2025-09-06 08:46:44] INFO: Started server process [49519]
[2025-09-06 08:46:44] INFO: Waiting for application startup.
[2025-09-06 08:46:44] INFO: Application startup complete.
[2025-09-06 08:46:44] INFO: Uvicorn running on http://127.0.0.1:8400 (Press CTRL+C to quit)
[2025-09-06 08:46:45] INFO: 127.0.0.1:42690 - "GET /health_generate HTTP/1.1" 503 Service Unavailable
[2025-09-06 08:46:45] INFO: 127.0.0.1:42702 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-09-06 08:46:45 TP0] Prefill batch. #new-seq: 1, #new-token: 64, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-09-06 08:46:46] INFO: 127.0.0.1:42716 - "POST /generate HTTP/1.1" 200 OK
[2025-09-06 08:46:46] The server is fired up and ready to roll!
[2025-09-06 08:46:55 TP0] Prefill batch. #new-seq: 1, #new-token: 64, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-09-06 08:46:56] INFO: 127.0.0.1:47496 - "GET /health_generate HTTP/1.1" 200 OK
command=python3 -m sglang.launch_server --model-path /home/yiliu7/models/openai/gpt-oss-120b --tp 4 --cuda-graph-max-bs 200 --mem-fraction-static 0.93 --device cuda --host 127.0.0.1 --port 8400
Evaluation start: model=/home/yiliu7/models/openai/gpt-oss-120b reasoning_effort=low expected_score=0.6
ChatCompletionSampler initialized with self.system_message=None self.temperature=0.1 self.max_tokens=4096 self.reasoning_effort='low'
0%| | 0/198 [00:00<?, ?it/s][2025-09-06 08:46:56 TP0] Prefill batch. #new-seq: 1, #new-token: 320, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-09-06 08:46:56 TP0] Prefill batch. #new-seq: 3, #new-token: 960, #cached-token: 0, token usage: 0.00, #running-req: 1, #queue-req: 0,
[2025-09-06 08:46:57 TP0] Prefill batch. #new-seq: 9, #new-token: 2304, #cached-token: 576, token usage: 0.00, #running-req: 4, #queue-req: 0,
[2025-09-06 08:46:57 TP0] Prefill batch. #new-seq: 44, #new-token: 13184, #cached-token: 2816, token usage: 0.00, #running-req: 13, #queue-req: 0,
[2025-09-06 08:46:57 TP0] Prefill batch. #new-seq: 23, #new-token: 8448, #cached-token: 1472, token usage: 0.00, #running-req: 57, #queue-req: 0,
[2025-09-06 08:46:57 TP0] Prefill batch. #new-seq: 45, #new-token: 11200, #cached-token: 2880, token usage: 0.00, #running-req: 80, #queue-req: 0,
[2025-09-06 08:46:57 TP0] Prefill batch. #new-seq: 24, #new-token: 6848, #cached-token: 1600, token usage: 0.00, #running-req: 125, #queue-req: 0,
[2025-09-06 08:46:57 TP0] Prefill batch. #new-seq: 49, #new-token: 13568, #cached-token: 3200, token usage: 0.01, #running-req: 149, #queue-req: 0,
[2025-09-06 08:46:57 TP0] Decode batch. #running-req: 198, #token: 62976, token usage: 0.01, cuda graph: True, gen throughput (token/s): 422.71, #queue-req: 0,
[2025-09-06 08:46:57] INFO: 127.0.0.1:48442 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:46:58] INFO: 127.0.0.1:48364 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:46:58] INFO: 127.0.0.1:47634 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:46:58 TP0] Decode batch. #running-req: 195, #token: 69760, token usage: 0.01, cuda graph: True, gen throughput (token/s): 17213.36, #queue-req: 0,
[2025-09-06 08:46:58] INFO: 127.0.0.1:47522 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:46:58] INFO: 127.0.0.1:47972 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:46:58] INFO: 127.0.0.1:47900 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:46:58] INFO: 127.0.0.1:48432 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:46:58] INFO: 127.0.0.1:47994 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:46:58 TP0] Decode batch. #running-req: 190, #token: 75968, token usage: 0.01, cuda graph: True, gen throughput (token/s): 17152.43, #queue-req: 0,
[2025-09-06 08:46:58] INFO: 127.0.0.1:47612 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:46:58] INFO: 127.0.0.1:48514 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:46:58] INFO: 127.0.0.1:48768 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:46:58] INFO: 127.0.0.1:47846 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:46:59] INFO: 127.0.0.1:49120 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:46:59] INFO: 127.0.0.1:48562 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:46:59] INFO: 127.0.0.1:48578 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:46:59] INFO: 127.0.0.1:48894 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:46:59] INFO: 127.0.0.1:48012 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:46:59] INFO: 127.0.0.1:48018 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:46:59] INFO: 127.0.0.1:48142 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:46:59] INFO: 127.0.0.1:48648 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:46:59 TP0] Decode batch. #running-req: 178, #token: 76928, token usage: 0.01, cuda graph: True, gen throughput (token/s): 16526.97, #queue-req: 0,
[2025-09-06 08:46:59] INFO: 127.0.0.1:49034 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:46:59] INFO: 127.0.0.1:47916 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:46:59] INFO: 127.0.0.1:48330 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:46:59] INFO: 127.0.0.1:48710 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:46:59] INFO: 127.0.0.1:48130 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:46:59] INFO: 127.0.0.1:48858 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:46:59 TP0] Decode batch. #running-req: 172, #token: 79296, token usage: 0.01, cuda graph: True, gen throughput (token/s): 16031.50, #queue-req: 0,
[2025-09-06 08:46:59] INFO: 127.0.0.1:47990 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:46:59] INFO: 127.0.0.1:48380 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:46:59] INFO: 127.0.0.1:48622 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:46:59] INFO: 127.0.0.1:48654 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:46:59] INFO: 127.0.0.1:49236 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:00] INFO: 127.0.0.1:48286 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:00] INFO: 127.0.0.1:47750 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:00] INFO: 127.0.0.1:48898 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:00 TP0] Decode batch. #running-req: 164, #token: 82368, token usage: 0.01, cuda graph: True, gen throughput (token/s): 15193.77, #queue-req: 0,
[2025-09-06 08:47:00] INFO: 127.0.0.1:47760 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:00] INFO: 127.0.0.1:48718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:00] INFO: 127.0.0.1:49222 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:00] INFO: 127.0.0.1:48996 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:00] INFO: 127.0.0.1:49008 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:00] INFO: 127.0.0.1:47802 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:00] INFO: 127.0.0.1:47734 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:00 TP0] Decode batch. #running-req: 157, #token: 84032, token usage: 0.01, cuda graph: True, gen throughput (token/s): 14316.44, #queue-req: 0,
[2025-09-06 08:47:00] INFO: 127.0.0.1:47976 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:00] INFO: 127.0.0.1:47564 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:00] INFO: 127.0.0.1:47608 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:00] INFO: 127.0.0.1:48228 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:00] INFO: 127.0.0.1:48054 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:00] INFO: 127.0.0.1:49080 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:00] INFO: 127.0.0.1:48176 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:00] INFO: 127.0.0.1:48834 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:00] INFO: 127.0.0.1:48942 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:00] INFO: 127.0.0.1:48284 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:00] INFO: 127.0.0.1:48528 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:01] INFO: 127.0.0.1:48444 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:01 TP0] Decode batch. #running-req: 145, #token: 83904, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13590.98, #queue-req: 0,
[2025-09-06 08:47:01] INFO: 127.0.0.1:48408 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:01] INFO: 127.0.0.1:49022 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:01] INFO: 127.0.0.1:49142 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:01] INFO: 127.0.0.1:48096 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:01] INFO: 127.0.0.1:47710 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:01] INFO: 127.0.0.1:47768 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:01] INFO: 127.0.0.1:47622 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:01] INFO: 127.0.0.1:48666 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:01 TP0] Decode batch. #running-req: 137, #token: 85312, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13062.56, #queue-req: 0,
[2025-09-06 08:47:01] INFO: 127.0.0.1:47814 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:01] INFO: 127.0.0.1:48222 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:01] INFO: 127.0.0.1:47892 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:01] INFO: 127.0.0.1:48314 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:01] INFO: 127.0.0.1:49194 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:01] INFO: 127.0.0.1:47848 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:01 TP0] Decode batch. #running-req: 131, #token: 86528, token usage: 0.01, cuda graph: True, gen throughput (token/s): 12544.55, #queue-req: 0,
[2025-09-06 08:47:01] INFO: 127.0.0.1:48158 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:02] INFO: 127.0.0.1:47646 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:02] INFO: 127.0.0.1:48302 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:02] INFO: 127.0.0.1:48520 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:02] INFO: 127.0.0.1:49028 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:02] INFO: 127.0.0.1:48206 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:02] INFO: 127.0.0.1:48552 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:02] INFO: 127.0.0.1:47694 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:02] INFO: 127.0.0.1:48766 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:02] INFO: 127.0.0.1:48914 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:02] INFO: 127.0.0.1:47766 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:02] INFO: 127.0.0.1:48236 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:02] INFO: 127.0.0.1:47586 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:02] INFO: 127.0.0.1:48016 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:02 TP0] Decode batch. #running-req: 117, #token: 81664, token usage: 0.01, cuda graph: True, gen throughput (token/s): 14231.44, #queue-req: 0,
[2025-09-06 08:47:02] INFO: 127.0.0.1:47552 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:02] INFO: 127.0.0.1:47502 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:02] INFO: 127.0.0.1:48626 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:02] INFO: 127.0.0.1:48350 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:02] INFO: 127.0.0.1:48732 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:02] INFO: 127.0.0.1:48982 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:02] INFO: 127.0.0.1:47872 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:02] INFO: 127.0.0.1:48794 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:02] INFO: 127.0.0.1:48846 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:02] INFO: 127.0.0.1:48888 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:02] INFO: 127.0.0.1:49174 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:02 TP0] Decode batch. #running-req: 106, #token: 78528, token usage: 0.01, cuda graph: True, gen throughput (token/s): 15304.65, #queue-req: 0,
[2025-09-06 08:47:02] INFO: 127.0.0.1:49088 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:02] INFO: 127.0.0.1:48642 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:02] INFO: 127.0.0.1:48108 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:02] INFO: 127.0.0.1:47950 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:02] INFO: 127.0.0.1:48076 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:02] INFO: 127.0.0.1:48742 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:02 TP0] Decode batch. #running-req: 100, #token: 78528, token usage: 0.01, cuda graph: True, gen throughput (token/s): 14633.89, #queue-req: 0,
[2025-09-06 08:47:02] INFO: 127.0.0.1:48186 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:02] INFO: 127.0.0.1:48010 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:02] INFO: 127.0.0.1:48618 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:03] INFO: 127.0.0.1:48478 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:03] INFO: 127.0.0.1:49046 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:03 TP0] Decode batch. #running-req: 95, #token: 78272, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13824.30, #queue-req: 0,
[2025-09-06 08:47:03] INFO: 127.0.0.1:47974 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:03] INFO: 127.0.0.1:48750 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:03] INFO: 127.0.0.1:48772 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:03] INFO: 127.0.0.1:48864 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:03] INFO: 127.0.0.1:48234 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:03] INFO: 127.0.0.1:49184 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:03] INFO: 127.0.0.1:47794 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:03] INFO: 127.0.0.1:48046 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:03] INFO: 127.0.0.1:48884 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:03] INFO: 127.0.0.1:48818 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:03 TP0] Decode batch. #running-req: 85, #token: 73408, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13331.56, #queue-req: 0,
[2025-09-06 08:47:03] INFO: 127.0.0.1:47940 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:03] INFO: 127.0.0.1:48896 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:03] INFO: 127.0.0.1:48918 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:03] INFO: 127.0.0.1:48592 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:03] INFO: 127.0.0.1:48246 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:03] INFO: 127.0.0.1:48402 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:03] INFO: 127.0.0.1:48878 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:03] INFO: 127.0.0.1:48780 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:03 TP0] Decode batch. #running-req: 77, #token: 70080, token usage: 0.01, cuda graph: True, gen throughput (token/s): 12444.69, #queue-req: 0,
[2025-09-06 08:47:03] INFO: 127.0.0.1:48390 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:03] INFO: 127.0.0.1:49010 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:03] INFO: 127.0.0.1:49234 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:03] INFO: 127.0.0.1:49128 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:03] INFO: 127.0.0.1:48248 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:03] INFO: 127.0.0.1:48134 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:03] INFO: 127.0.0.1:48166 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:03 TP0] Decode batch. #running-req: 70, #token: 66176, token usage: 0.01, cuda graph: True, gen throughput (token/s): 11601.78, #queue-req: 0,
[2025-09-06 08:47:03] INFO: 127.0.0.1:47648 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:04] INFO: 127.0.0.1:48406 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:04] INFO: 127.0.0.1:49164 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:04] INFO: 127.0.0.1:47656 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:04] INFO: 127.0.0.1:49104 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:04 TP0] Decode batch. #running-req: 66, #token: 63680, token usage: 0.01, cuda graph: True, gen throughput (token/s): 11035.15, #queue-req: 0,
[2025-09-06 08:47:04] INFO: 127.0.0.1:47574 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:04] INFO: 127.0.0.1:48502 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:04] INFO: 127.0.0.1:48682 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:04] INFO: 127.0.0.1:47830 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:04] INFO: 127.0.0.1:47924 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:04] INFO: 127.0.0.1:48694 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:04 TP0] Decode batch. #running-req: 60, #token: 60544, token usage: 0.01, cuda graph: True, gen throughput (token/s): 10158.47, #queue-req: 0,
[2025-09-06 08:47:04] INFO: 127.0.0.1:48004 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:04] INFO: 127.0.0.1:48958 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:04] INFO: 127.0.0.1:48404 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:04] INFO: 127.0.0.1:48600 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:04] INFO: 127.0.0.1:47514 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:04] INFO: 127.0.0.1:48636 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:04] INFO: 127.0.0.1:48340 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:04] INFO: 127.0.0.1:48602 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:04 TP0] Decode batch. #running-req: 51, #token: 53696, token usage: 0.01, cuda graph: True, gen throughput (token/s): 9236.80, #queue-req: 0,
[2025-09-06 08:47:04] INFO: 127.0.0.1:47628 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:04] INFO: 127.0.0.1:48462 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:04] INFO: 127.0.0.1:49156 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:04] INFO: 127.0.0.1:47578 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:04] INFO: 127.0.0.1:48262 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:04] INFO: 127.0.0.1:47968 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:04] INFO: 127.0.0.1:48474 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:04] INFO: 127.0.0.1:47718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:04] INFO: 127.0.0.1:48798 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:04] INFO: 127.0.0.1:49210 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:04] INFO: 127.0.0.1:48542 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:04 TP0] Decode batch. #running-req: 40, #token: 44032, token usage: 0.01, cuda graph: True, gen throughput (token/s): 7980.75, #queue-req: 0,
[2025-09-06 08:47:04] INFO: 127.0.0.1:48456 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:04] INFO: 127.0.0.1:47860 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:04] INFO: 127.0.0.1:47964 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:05] INFO: 127.0.0.1:48066 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:05 TP0] Decode batch. #running-req: 36, #token: 41280, token usage: 0.00, cuda graph: True, gen throughput (token/s): 6785.10, #queue-req: 0,
[2025-09-06 08:47:05] INFO: 127.0.0.1:48268 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:05] INFO: 127.0.0.1:48030 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:05] INFO: 127.0.0.1:47884 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:05] INFO: 127.0.0.1:48418 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:05] INFO: 127.0.0.1:47598 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:05 TP0] Decode batch. #running-req: 31, #token: 36864, token usage: 0.00, cuda graph: True, gen throughput (token/s): 5993.33, #queue-req: 0,
[2025-09-06 08:47:05] INFO: 127.0.0.1:48120 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:05] INFO: 127.0.0.1:48298 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:05] INFO: 127.0.0.1:48984 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:05] INFO: 127.0.0.1:48966 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:05] INFO: 127.0.0.1:48580 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:05] INFO: 127.0.0.1:47668 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:05 TP0] Decode batch. #running-req: 25, #token: 31104, token usage: 0.00, cuda graph: True, gen throughput (token/s): 5349.57, #queue-req: 0,
[2025-09-06 08:47:05] INFO: 127.0.0.1:48382 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:05] INFO: 127.0.0.1:47678 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:05] INFO: 127.0.0.1:48810 - "POST /v1/chat/completions HTTP/1.1" 200 OK
1%| | 1/198 [00:08<28:53, 8.80s/it][2025-09-06 08:47:05 TP0] Decode batch. #running-req: 22, #token: 28096, token usage: 0.00, cuda graph: True, gen throughput (token/s): 4576.68, #queue-req: 0,
[2025-09-06 08:47:05] INFO: 127.0.0.1:49074 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:05] INFO: 127.0.0.1:48230 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:05] INFO: 127.0.0.1:49090 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:05 TP0] Decode batch. #running-req: 19, #token: 25280, token usage: 0.00, cuda graph: True, gen throughput (token/s): 4138.92, #queue-req: 0,
[2025-09-06 08:47:06] INFO: 127.0.0.1:48492 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:06] INFO: 127.0.0.1:49058 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:06] INFO: 127.0.0.1:47858 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:06 TP0] Decode batch. #running-req: 16, #token: 21824, token usage: 0.00, cuda graph: True, gen throughput (token/s): 3493.30, #queue-req: 0,
[2025-09-06 08:47:06] INFO: 127.0.0.1:48926 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:06 TP0] Decode batch. #running-req: 15, #token: 20864, token usage: 0.00, cuda graph: True, gen throughput (token/s): 3256.37, #queue-req: 0,
[2025-09-06 08:47:06] INFO: 127.0.0.1:48460 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:06] INFO: 127.0.0.1:48716 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:06] INFO: 127.0.0.1:48850 - "POST /v1/chat/completions HTTP/1.1" 200 OK
1%| | 2/198 [00:09<13:20, 4.08s/it][2025-09-06 08:47:06] INFO: 127.0.0.1:49178 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:06 TP0] Decode batch. #running-req: 11, #token: 15808, token usage: 0.00, cuda graph: True, gen throughput (token/s): 2734.59, #queue-req: 0,
[2025-09-06 08:47:06] INFO: 127.0.0.1:48080 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:06 TP0] Decode batch. #running-req: 10, #token: 14976, token usage: 0.00, cuda graph: True, gen throughput (token/s): 2231.28, #queue-req: 0,
[2025-09-06 08:47:06] INFO: 127.0.0.1:47778 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:06 TP0] Decode batch. #running-req: 9, #token: 13632, token usage: 0.00, cuda graph: True, gen throughput (token/s): 2032.86, #queue-req: 0,
[2025-09-06 08:47:06] INFO: 127.0.0.1:49012 - "POST /v1/chat/completions HTTP/1.1" 200 OK
12%|█▏ | 24/198 [00:10<00:40, 4.32it/s][2025-09-06 08:47:07] INFO: 127.0.0.1:48744 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:07 TP0] Decode batch. #running-req: 7, #token: 11200, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1870.16, #queue-req: 0,
[2025-09-06 08:47:07] INFO: 127.0.0.1:47756 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:07 TP0] Decode batch. #running-req: 6, #token: 9600, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1521.52, #queue-req: 0,
[2025-09-06 08:47:07 TP0] Decode batch. #running-req: 6, #token: 9856, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1447.87, #queue-req: 0,
[2025-09-06 08:47:07 TP0] Decode batch. #running-req: 6, #token: 10176, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1450.22, #queue-req: 0,
[2025-09-06 08:47:07] INFO: 127.0.0.1:47536 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:07] INFO: 127.0.0.1:48190 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:07 TP0] Decode batch. #running-req: 4, #token: 6848, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1383.04, #queue-req: 0,
[2025-09-06 08:47:07 TP0] Decode batch. #running-req: 4, #token: 7104, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1055.55, #queue-req: 0,
[2025-09-06 08:47:08] INFO: 127.0.0.1:47820 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:08 TP0] Decode batch. #running-req: 3, #token: 5440, token usage: 0.00, cuda graph: True, gen throughput (token/s): 969.66, #queue-req: 0,
[2025-09-06 08:47:08] INFO: 127.0.0.1:48596 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:47:08] INFO: 127.0.0.1:49106 - "POST /v1/chat/completions HTTP/1.1" 200 OK
23%|██▎ | 45/198 [00:11<00:20, 7.56it/s][2025-09-06 08:47:08 TP0] Decode batch. #running-req: 1, #token: 1856, token usage: 0.00, cuda graph: True, gen throughput (token/s): 487.72, #queue-req: 0,
[2025-09-06 08:47:08 TP0] Decode batch. #running-req: 1, #token: 1920, token usage: 0.00, cuda graph: True, gen throughput (token/s): 331.44, #queue-req: 0,
[2025-09-06 08:47:08 TP0] Decode batch. #running-req: 1, #token: 1920, token usage: 0.00, cuda graph: True, gen throughput (token/s): 333.22, #queue-req: 0,
[2025-09-06 08:47:08 TP0] Decode batch. #running-req: 1, #token: 1984, token usage: 0.00, cuda graph: True, gen throughput (token/s): 331.67, #queue-req: 0,
[2025-09-06 08:47:08 TP0] Decode batch. #running-req: 1, #token: 2048, token usage: 0.00, cuda graph: True, gen throughput (token/s): 331.29, #queue-req: 0,
[2025-09-06 08:47:08 TP0] Decode batch. #running-req: 1, #token: 2048, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.86, #queue-req: 0,
[2025-09-06 08:47:08 TP0] Decode batch. #running-req: 1, #token: 2112, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.52, #queue-req: 0,
[2025-09-06 08:47:09 TP0] Decode batch. #running-req: 1, #token: 2176, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.63, #queue-req: 0,
[2025-09-06 08:47:09 TP0] Decode batch. #running-req: 1, #token: 2176, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.19, #queue-req: 0,
[2025-09-06 08:47:09 TP0] Decode batch. #running-req: 1, #token: 2240, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.04, #queue-req: 0,
[2025-09-06 08:47:09 TP0] Decode batch. #running-req: 1, #token: 2240, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.00, #queue-req: 0,
[2025-09-06 08:47:09 TP0] Decode batch. #running-req: 1, #token: 2304, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.07, #queue-req: 0,
[2025-09-06 08:47:09] INFO: 127.0.0.1:48032 - "POST /v1/chat/completions HTTP/1.1" 200 OK
56%|█████▌ | 110/198 [00:12<00:04, 18.79it/s] 100%|██████████| 198/198 [00:12<00:00, 15.65it/s]
/usr/lib/python3.12/subprocess.py:1127: ResourceWarning: subprocess 49519 is still running
_warn("subprocess %s is still running" % self.pid,
ResourceWarning: Enable tracemalloc to get the object allocation traceback
.
----------------------------------------------------------------------
Ran 1 test in 176.921s
OK
Writing report to /tmp/gpqa__home_yiliu7_models_openai_gpt-oss-120b.html
{'chars': 1763.0353535353536, 'chars:std': 976.7007511699713, 'score:std': 0.4863193178670999, 'score': 0.6161616161616161}
Writing results to /tmp/gpqa__home_yiliu7_models_openai_gpt-oss-120b.json
Total latency: 12.715 s
Score: 0.616
Evaluation end: model=/home/yiliu7/models/openai/gpt-oss-120b reasoning_effort=low expected_score=0.6 metrics={'chars': 1763.0353535353536, 'chars:std': 976.7007511699713, 'score:std': 0.4863193178670999, 'score': 0.6161616161616161}
================================================================================
Run 8:
Auto-configed device: cuda
WARNING:sglang.srt.server_args:Detected SM100 and MXFP4 quantization format for GPT-OSS model, enabling FlashInfer MXFP4 MOE kernel.
WARNING:sglang.srt.server_args:TensorRT-LLM MHA only supports page_size of 16, 32 or 64, changing page_size from None to 64.
[2025-09-06 08:47:24] server_args=ServerArgs(model_path='/home/yiliu7/models/openai/gpt-oss-120b', tokenizer_path='/home/yiliu7/models/openai/gpt-oss-120b', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=8400, skip_server_warmup=False, warmups=None, nccl_port=None, dtype='bfloat16', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', mem_fraction_static=0.93, max_running_requests=None, max_queued_requests=9223372036854775807, max_total_tokens=None, chunked_prefill_size=16384, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, page_size=64, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=True, device='cuda', tp_size=4, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=384373251, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, gc_warning_threshold_secs=0.0, api_key=None, served_model_name='/home/yiliu7/models/openai/gpt-oss-120b', weight_version='default', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_backend='triton', attention_backend='trtllm_mha', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, ep_size=1, moe_a2a_backend='none', moe_runner_backend='flashinfer_mxfp4', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=True, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', disable_radix_cache=False, cuda_graph_max_bs=200, cuda_graph_bs=None, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, torch_compile_max_bs=32, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, enable_return_hidden_states=False, scheduler_recv_interval=1, numa_node=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, debug_tensor_dump_prefill_only=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, num_reserved_decode_tokens=512, custom_weight_loader=[], weight_loader_disable_mmap=False, enable_pdmux=False, sm_group_num=3, enable_ep_moe=False, enable_deepep_moe=False, enable_flashinfer_cutlass_moe=False, enable_flashinfer_trtllm_moe=False, enable_triton_kernel_moe=False, enable_flashinfer_mxfp4_moe=False)
All deep_gemm operations loaded successfully!
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-06 08:47:24] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:47:24] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-06 08:47:25] Using default HuggingFace chat template with detected content format: string
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-06 08:47:31 TP0] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:47:31 TP0] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-06 08:47:31 TP3] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:47:31 TP3] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-06 08:47:32 TP2] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:47:32 TP2] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-06 08:47:32 TP0] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:47:32 TP0] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-06 08:47:32 TP0] Init torch distributed begin.
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-06 08:47:32 TP1] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:47:32 TP1] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-06 08:47:32 TP3] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:47:32 TP3] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-06 08:47:32 TP2] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:47:32 TP2] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-06 08:47:32 TP1] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:47:32 TP1] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[2025-09-06 08:47:34 TP0] sglang is using nccl==2.27.3
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[2025-09-06 08:47:36 TP0] Init torch distributed ends. mem usage=1.46 GB
[2025-09-06 08:47:36 TP0] Load weight begin. avail mem=176.28 GB
All deep_gemm operations loaded successfully!
Loading safetensors checkpoint shards: 0% Completed | 0/15 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 15/15 [00:00<00:00, 1621.43it/s]
All deep_gemm operations loaded successfully!
All deep_gemm operations loaded successfully!
[2025-09-06 08:47:49 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.0.mlp.experts), it might take a while...
All deep_gemm operations loaded successfully!
[2025-09-06 08:47:52 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.1.mlp.experts), it might take a while...
[2025-09-06 08:47:55 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.2.mlp.experts), it might take a while...
[2025-09-06 08:47:58 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.3.mlp.experts), it might take a while...
[2025-09-06 08:48:01 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.4.mlp.experts), it might take a while...
[2025-09-06 08:48:05 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.5.mlp.experts), it might take a while...
[2025-09-06 08:48:08 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.6.mlp.experts), it might take a while...
[2025-09-06 08:48:11 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.7.mlp.experts), it might take a while...
[2025-09-06 08:48:14 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.8.mlp.experts), it might take a while...
[2025-09-06 08:48:17 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.9.mlp.experts), it might take a while...
[2025-09-06 08:48:20 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.10.mlp.experts), it might take a while...
[2025-09-06 08:48:23 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.11.mlp.experts), it might take a while...
[2025-09-06 08:48:26 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.12.mlp.experts), it might take a while...
[2025-09-06 08:48:30 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.13.mlp.experts), it might take a while...
[2025-09-06 08:48:33 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.14.mlp.experts), it might take a while...
[2025-09-06 08:48:36 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.15.mlp.experts), it might take a while...
[2025-09-06 08:48:39 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.16.mlp.experts), it might take a while...
[2025-09-06 08:48:42 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.17.mlp.experts), it might take a while...
[2025-09-06 08:48:45 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.18.mlp.experts), it might take a while...
[2025-09-06 08:48:48 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.19.mlp.experts), it might take a while...
[2025-09-06 08:48:51 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.20.mlp.experts), it might take a while...
[2025-09-06 08:48:54 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.21.mlp.experts), it might take a while...
[2025-09-06 08:48:58 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.22.mlp.experts), it might take a while...
[2025-09-06 08:49:01 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.23.mlp.experts), it might take a while...
[2025-09-06 08:49:04 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.24.mlp.experts), it might take a while...
[2025-09-06 08:49:07 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.25.mlp.experts), it might take a while...
[2025-09-06 08:49:10 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.26.mlp.experts), it might take a while...
[2025-09-06 08:49:13 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.27.mlp.experts), it might take a while...
[2025-09-06 08:49:16 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.28.mlp.experts), it might take a while...
[2025-09-06 08:49:19 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.29.mlp.experts), it might take a while...
[2025-09-06 08:49:23 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.30.mlp.experts), it might take a while...
[2025-09-06 08:49:26 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.31.mlp.experts), it might take a while...
[2025-09-06 08:49:29 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.32.mlp.experts), it might take a while...
[2025-09-06 08:49:32 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.33.mlp.experts), it might take a while...
[2025-09-06 08:49:35 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.34.mlp.experts), it might take a while...
[2025-09-06 08:49:38 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.35.mlp.experts), it might take a while...
[2025-09-06 08:49:41 TP0] Load weight end. type=GptOssForCausalLM, dtype=torch.bfloat16, avail mem=158.06 GB, mem usage=18.22 GB.
[2025-09-06 08:49:41 TP0] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB
[2025-09-06 08:49:41 TP0] Memory pool end. avail mem=10.23 GB
[2025-09-06 08:49:41 TP1] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB
[2025-09-06 08:49:41 TP2] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB
[2025-09-06 08:49:41 TP3] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB
[2025-09-06 08:49:42 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=9.54 GB
[2025-09-06 08:49:42 TP0] Capture cuda graph bs [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200]
0%| | 0/28 [00:00<?, ?it/s] Capturing batches (bs=200 avail_mem=9.39 GB): 0%| | 0/28 [00:00<?, ?it/s]rank 2 allocated ipc_handles: [['0x7fc778000000', '0x7fc774000000', '0x7fed74000000', '0x7fc770000000'], ['0x7fc773000000', '0x7fc773200000', '0x7fc772e00000', '0x7fc773400000'], ['0x7fc75c000000', '0x7fc752000000', '0x7fc766000000', '0x7fc748000000']]
[2025-09-06 08:49:44.171] [info] lamportInitialize start: buffer: 0x7fc766000000, size: 71303168
rank 3 allocated ipc_handles: [['0x79b95c000000', '0x79b922000000', '0x79b91e000000', '0x79df12000000'], ['0x79b921000000', '0x79b921200000', '0x79b921400000', '0x79b920e00000'], ['0x79b90a000000', '0x79b900000000', '0x79b8f6000000', '0x79b914000000']]
[2025-09-06 08:49:44.223] [info] lamportInitialize start: buffer: 0x79b914000000, size: 71303168
rank 0 allocated ipc_handles: [['0x7896b2000000', '0x7870f6000000', '0x7870c4000000', '0x7870c0000000'], ['0x7870c2e00000', '0x7870c3000000', '0x7870c3200000', '0x7870c3400000'], ['0x7870b6000000', '0x7870ac000000', '0x7870a2000000', '0x787098000000']]
[2025-09-06 08:49:44.271] [info] lamportInitialize start: buffer: 0x7870b6000000, size: 71303168
rank 1 allocated ipc_handles: [['0x7956e2000000', '0x797c78000000', '0x79567c000000', '0x795678000000'], ['0x79567b000000', '0x79567ae00000', '0x79567b200000', '0x79567b400000'], ['0x795664000000', '0x79566e000000', '0x79565a000000', '0x795650000000']]
[2025-09-06 08:49:44.320] [info] lamportInitialize start: buffer: 0x79566e000000, size: 71303168
[2025-09-06 08:49:44 TP0] FlashInfer workspace initialized for rank 0, world_size 4
[2025-09-06 08:49:44 TP1] FlashInfer workspace initialized for rank 1, world_size 4
[2025-09-06 08:49:44 TP3] FlashInfer workspace initialized for rank 3, world_size 4
[2025-09-06 08:49:44 TP2] FlashInfer workspace initialized for rank 2, world_size 4
set flag_ptr[3] = lamport_comm_size: 47185920
Rank 0 workspace[0] 0x7896b2000000
Rank 0 workspace[1] 0x7870f6000000
Rank 0 workspace[2] 0x7870c4000000
Rank 0 workspace[3] 0x7870c0000000
Rank 0 workspace[4] 0x7870c2e00000
Rank 0 workspace[5] 0x7870c3000000
Rank 0 workspace[6] 0x7870c3200000
Rank 0 workspace[7] 0x7870c3400000
Rank 0 workspace[8] 0x7870b6000000
Rank 0 workspace[9] 0x7870ac000000
Rank 0 workspace[10] 0x7870a2000000
Rank 0 workspace[11] 0x787098000000
Rank 0 workspace[12] 0x789cab264400
set flag_ptr[3] = lamport_comm_size: 47185920
Rank 1 workspace[0] 0x7956e2000000
Rank 1 workspace[1] 0x797c78000000
Rank 1 workspace[2] 0x79567c000000
Rank 1 workspace[3] 0x795678000000
Rank 1 workspace[4] 0x79567b000000
Rank 1 workspace[5] 0x79567ae00000
Rank 1 workspace[6] 0x79567b200000
Rank 1 workspace[7] 0x79567b400000
Rank 1 workspace[8] 0x795664000000
Rank 1 workspace[9] 0x79566e000000
Rank 1 workspace[10] 0x79565a000000
Rank 1 workspace[11] 0x795650000000
Rank 1 workspace[12] 0x79827f264400
set flag_ptr[3] = lamport_comm_size: 47185920
Rank 2 workspace[0] 0x7fc778000000
Rank 2 workspace[1] 0x7fc774000000
Rank 2 workspace[2] 0x7fed74000000
Rank 2 workspace[3] 0x7fc770000000
Rank 2 workspace[4] 0x7fc773000000
Rank 2 workspace[5] 0x7fc773200000
Rank 2 workspace[6] 0x7fc772e00000
Rank 2 workspace[7] 0x7fc773400000
Rank 2 workspace[8] 0x7fc75c000000
Rank 2 workspace[9] 0x7fc752000000
Rank 2 workspace[10] 0x7fc766000000
Rank 2 workspace[11] 0x7fc748000000
Rank 2 workspace[12] 0x7ff383264400
set flag_ptr[3] = lamport_comm_size: 47185920
Rank 3 workspace[0] 0x79b95c000000
Rank 3 workspace[1] 0x79b922000000
Rank 3 workspace[2] 0x79b91e000000
Rank 3 workspace[3] 0x79df12000000
Rank 3 workspace[4] 0x79b921000000
Rank 3 workspace[5] 0x79b921200000
Rank 3 workspace[6] 0x79b921400000
Rank 3 workspace[7] 0x79b920e00000
Rank 3 workspace[8] 0x79b90a000000
Rank 3 workspace[9] 0x79b900000000
Rank 3 workspace[10] 0x79b8f6000000
Rank 3 workspace[11] 0x79b914000000
Rank 3 workspace[12] 0x79e50f264400
Capturing batches (bs=200 avail_mem=9.39 GB): 4%|▎ | 1/28 [00:02<00:57, 2.13s/it] Capturing batches (bs=192 avail_mem=8.06 GB): 4%|▎ | 1/28 [00:02<00:57, 2.13s/it] Capturing batches (bs=192 avail_mem=8.06 GB): 7%|▋ | 2/28 [00:02<00:25, 1.03it/s] Capturing batches (bs=184 avail_mem=8.05 GB): 7%|▋ | 2/28 [00:02<00:25, 1.03it/s] Capturing batches (bs=176 avail_mem=8.04 GB): 7%|▋ | 2/28 [00:02<00:25, 1.03it/s] Capturing batches (bs=176 avail_mem=8.04 GB): 14%|█▍ | 4/28 [00:02<00:10, 2.40it/s] Capturing batches (bs=168 avail_mem=8.03 GB): 14%|█▍ | 4/28 [00:02<00:10, 2.40it/s] Capturing batches (bs=160 avail_mem=8.02 GB): 14%|█▍ | 4/28 [00:02<00:10, 2.40it/s] Capturing batches (bs=160 avail_mem=8.02 GB): 21%|██▏ | 6/28 [00:02<00:05, 3.81it/s] Capturing batches (bs=152 avail_mem=8.01 GB): 21%|██▏ | 6/28 [00:02<00:05, 3.81it/s] Capturing batches (bs=144 avail_mem=8.00 GB): 21%|██▏ | 6/28 [00:02<00:05, 3.81it/s] Capturing batches (bs=144 avail_mem=8.00 GB): 29%|██▊ | 8/28 [00:02<00:03, 5.19it/s] Capturing batches (bs=136 avail_mem=8.00 GB): 29%|██▊ | 8/28 [00:02<00:03, 5.19it/s] Capturing batches (bs=128 avail_mem=7.99 GB): 29%|██▊ | 8/28 [00:02<00:03, 5.19it/s] Capturing batches (bs=128 avail_mem=7.99 GB): 36%|███▌ | 10/28 [00:03<00:02, 6.40it/s] Capturing batches (bs=120 avail_mem=7.97 GB): 36%|███▌ | 10/28 [00:03<00:02, 6.40it/s] Capturing batches (bs=112 avail_mem=7.97 GB): 36%|███▌ | 10/28 [00:03<00:02, 6.40it/s] Capturing batches (bs=112 avail_mem=7.97 GB): 43%|████▎ | 12/28 [00:03<00:02, 7.46it/s] Capturing batches (bs=104 avail_mem=7.96 GB): 43%|████▎ | 12/28 [00:03<00:02, 7.46it/s] Capturing batches (bs=96 avail_mem=7.95 GB): 43%|████▎ | 12/28 [00:03<00:02, 7.46it/s] Capturing batches (bs=96 avail_mem=7.95 GB): 50%|█████ | 14/28 [00:03<00:01, 8.32it/s] Capturing batches (bs=88 avail_mem=7.94 GB): 50%|█████ | 14/28 [00:03<00:01, 8.32it/s] Capturing batches (bs=80 avail_mem=7.93 GB): 50%|█████ | 14/28 [00:03<00:01, 8.32it/s] Capturing batches (bs=80 avail_mem=7.93 GB): 57%|█████▋ | 16/28 [00:03<00:01, 9.05it/s] Capturing batches (bs=72 avail_mem=7.92 GB): 57%|█████▋ | 16/28 [00:03<00:01, 9.05it/s] Capturing batches (bs=64 avail_mem=7.91 GB): 57%|█████▋ | 16/28 [00:03<00:01, 9.05it/s] Capturing batches (bs=64 avail_mem=7.91 GB): 64%|██████▍ | 18/28 [00:03<00:01, 9.65it/s] Capturing batches (bs=56 avail_mem=7.90 GB): 64%|██████▍ | 18/28 [00:03<00:01, 9.65it/s] Capturing batches (bs=48 avail_mem=7.89 GB): 64%|██████▍ | 18/28 [00:03<00:01, 9.65it/s] Capturing batches (bs=48 avail_mem=7.89 GB): 71%|███████▏ | 20/28 [00:03<00:00, 10.15it/s] Capturing batches (bs=40 avail_mem=7.88 GB): 71%|███████▏ | 20/28 [00:03<00:00, 10.15it/s] Capturing batches (bs=32 avail_mem=7.88 GB): 71%|███████▏ | 20/28 [00:03<00:00, 10.15it/s] Capturing batches (bs=32 avail_mem=7.88 GB): 79%|███████▊ | 22/28 [00:04<00:00, 10.58it/s] Capturing batches (bs=24 avail_mem=7.87 GB): 79%|███████▊ | 22/28 [00:04<00:00, 10.58it/s] Capturing batches (bs=16 avail_mem=7.86 GB): 79%|███████▊ | 22/28 [00:04<00:00, 10.58it/s] Capturing batches (bs=16 avail_mem=7.86 GB): 86%|████████▌ | 24/28 [00:04<00:00, 10.89it/s] Capturing batches (bs=8 avail_mem=7.85 GB): 86%|████████▌ | 24/28 [00:04<00:00, 10.89it/s] Capturing batches (bs=4 avail_mem=7.84 GB): 86%|████████▌ | 24/28 [00:04<00:00, 10.89it/s] Capturing batches (bs=4 avail_mem=7.84 GB): 93%|█████████▎| 26/28 [00:04<00:00, 11.39it/s] Capturing batches (bs=2 avail_mem=7.83 GB): 93%|█████████▎| 26/28 [00:04<00:00, 11.39it/s] Capturing batches (bs=1 avail_mem=7.82 GB): 93%|█████████▎| 26/28 [00:04<00:00, 11.39it/s][2025-09-06 08:49:47 TP2] Registering 56 cuda graph addresses
Capturing batches (bs=1 avail_mem=7.82 GB): 100%|██████████| 28/28 [00:04<00:00, 10.89it/s] Capturing batches (bs=1 avail_mem=7.82 GB): 100%|██████████| 28/28 [00:04<00:00, 6.08it/s]
[2025-09-06 08:49:47 TP1] Registering 56 cuda graph addresses
[2025-09-06 08:49:47 TP0] Registering 56 cuda graph addresses
[2025-09-06 08:49:47 TP3] Registering 56 cuda graph addresses
[2025-09-06 08:49:47 TP0] Capture cuda graph end. Time elapsed: 5.11 s. mem usage=1.73 GB. avail mem=7.81 GB.
[2025-09-06 08:49:47 TP0] max_total_num_tokens=8487040, chunked_prefill_size=16384, max_prefill_tokens=16384, max_running_requests=4096, context_len=131072, available_gpu_mem=7.81 GB
[2025-09-06 08:49:48] INFO: Started server process [51985]
[2025-09-06 08:49:48] INFO: Waiting for application startup.
[2025-09-06 08:49:48] INFO: Application startup complete.
[2025-09-06 08:49:48] INFO: Uvicorn running on http://127.0.0.1:8400 (Press CTRL+C to quit)
[2025-09-06 08:49:49] INFO: 127.0.0.1:57250 - "GET /health_generate HTTP/1.1" 503 Service Unavailable
[2025-09-06 08:49:49] INFO: 127.0.0.1:57262 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-09-06 08:49:49 TP0] Prefill batch. #new-seq: 1, #new-token: 64, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-09-06 08:49:50] INFO: 127.0.0.1:57266 - "POST /generate HTTP/1.1" 200 OK
[2025-09-06 08:49:50] The server is fired up and ready to roll!
[2025-09-06 08:49:59 TP0] Prefill batch. #new-seq: 1, #new-token: 64, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-09-06 08:50:00] INFO: 127.0.0.1:43482 - "GET /health_generate HTTP/1.1" 200 OK
command=python3 -m sglang.launch_server --model-path /home/yiliu7/models/openai/gpt-oss-120b --tp 4 --cuda-graph-max-bs 200 --mem-fraction-static 0.93 --device cuda --host 127.0.0.1 --port 8400
Evaluation start: model=/home/yiliu7/models/openai/gpt-oss-120b reasoning_effort=low expected_score=0.6
ChatCompletionSampler initialized with self.system_message=None self.temperature=0.1 self.max_tokens=4096 self.reasoning_effort='low'
0%| | 0/198 [00:00<?, ?it/s][2025-09-06 08:50:00 TP0] Prefill batch. #new-seq: 1, #new-token: 320, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-09-06 08:50:00 TP0] Prefill batch. #new-seq: 1, #new-token: 384, #cached-token: 0, token usage: 0.00, #running-req: 1, #queue-req: 0,
[2025-09-06 08:50:01 TP0] Prefill batch. #new-seq: 8, #new-token: 1920, #cached-token: 512, token usage: 0.00, #running-req: 2, #queue-req: 0,
[2025-09-06 08:50:01 TP0] Prefill batch. #new-seq: 11, #new-token: 2688, #cached-token: 704, token usage: 0.00, #running-req: 10, #queue-req: 0,
[2025-09-06 08:50:01 TP0] Prefill batch. #new-seq: 10, #new-token: 2432, #cached-token: 640, token usage: 0.00, #running-req: 21, #queue-req: 0,
[2025-09-06 08:50:01 TP0] Prefill batch. #new-seq: 27, #new-token: 9728, #cached-token: 1728, token usage: 0.00, #running-req: 31, #queue-req: 0,
[2025-09-06 08:50:01 TP0] Prefill batch. #new-seq: 15, #new-token: 3648, #cached-token: 960, token usage: 0.00, #running-req: 58, #queue-req: 0,
[2025-09-06 08:50:01 TP0] Prefill batch. #new-seq: 36, #new-token: 9856, #cached-token: 2304, token usage: 0.00, #running-req: 73, #queue-req: 0,
[2025-09-06 08:50:01 TP0] Prefill batch. #new-seq: 21, #new-token: 6144, #cached-token: 1408, token usage: 0.00, #running-req: 109, #queue-req: 0,
[2025-09-06 08:50:01 TP0] Prefill batch. #new-seq: 47, #new-token: 12800, #cached-token: 3072, token usage: 0.00, #running-req: 130, #queue-req: 0,
[2025-09-06 08:50:01 TP0] Prefill batch. #new-seq: 21, #new-token: 6720, #cached-token: 1408, token usage: 0.01, #running-req: 177, #queue-req: 0,
[2025-09-06 08:50:01 TP0] Decode batch. #running-req: 198, #token: 62848, token usage: 0.01, cuda graph: True, gen throughput (token/s): 422.02, #queue-req: 0,
[2025-09-06 08:50:01] INFO: 127.0.0.1:43990 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:02 TP0] Decode batch. #running-req: 197, #token: 70336, token usage: 0.01, cuda graph: True, gen throughput (token/s): 17371.23, #queue-req: 0,
[2025-09-06 08:50:02] INFO: 127.0.0.1:44218 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:02] INFO: 127.0.0.1:45000 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:02] INFO: 127.0.0.1:43588 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:02] INFO: 127.0.0.1:44960 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:02] INFO: 127.0.0.1:44392 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:02] INFO: 127.0.0.1:44352 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:02] INFO: 127.0.0.1:44078 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:02] INFO: 127.0.0.1:44466 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:02 TP0] Decode batch. #running-req: 190, #token: 74816, token usage: 0.01, cuda graph: True, gen throughput (token/s): 17163.76, #queue-req: 0,
[2025-09-06 08:50:02] INFO: 127.0.0.1:45142 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:02] INFO: 127.0.0.1:43690 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:02] INFO: 127.0.0.1:43774 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:02] INFO: 127.0.0.1:43610 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:02] INFO: 127.0.0.1:43786 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:02] INFO: 127.0.0.1:44138 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:02] INFO: 127.0.0.1:44304 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:02] INFO: 127.0.0.1:44778 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:02] INFO: 127.0.0.1:43626 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:02] INFO: 127.0.0.1:44232 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:02] INFO: 127.0.0.1:43944 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:03] INFO: 127.0.0.1:44650 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:03] INFO: 127.0.0.1:43508 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:03] INFO: 127.0.0.1:44698 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:03 TP0] Decode batch. #running-req: 175, #token: 72512, token usage: 0.01, cuda graph: True, gen throughput (token/s): 16427.91, #queue-req: 0,
[2025-09-06 08:50:03] INFO: 127.0.0.1:45202 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:03] INFO: 127.0.0.1:45070 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:03] INFO: 127.0.0.1:43628 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:03] INFO: 127.0.0.1:44836 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:03] INFO: 127.0.0.1:44066 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:03] INFO: 127.0.0.1:43518 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:03] INFO: 127.0.0.1:45014 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:03] INFO: 127.0.0.1:44338 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:03] INFO: 127.0.0.1:44012 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:03 TP0] Decode batch. #running-req: 167, #token: 76032, token usage: 0.01, cuda graph: True, gen throughput (token/s): 15692.31, #queue-req: 0,
[2025-09-06 08:50:03] INFO: 127.0.0.1:45040 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:04] INFO: 127.0.0.1:44714 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:04] INFO: 127.0.0.1:43904 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:04] INFO: 127.0.0.1:44410 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:04] INFO: 127.0.0.1:45106 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:04 TP0] Decode batch. #running-req: 164, #token: 79872, token usage: 0.01, cuda graph: True, gen throughput (token/s): 14907.71, #queue-req: 0,
[2025-09-06 08:50:04] INFO: 127.0.0.1:44388 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:04] INFO: 127.0.0.1:44790 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:04] INFO: 127.0.0.1:45188 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:04] INFO: 127.0.0.1:45110 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:04] INFO: 127.0.0.1:44086 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:04] INFO: 127.0.0.1:44762 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:04] INFO: 127.0.0.1:43812 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:04] INFO: 127.0.0.1:45148 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:04] INFO: 127.0.0.1:43902 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:04] INFO: 127.0.0.1:44206 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:04] INFO: 127.0.0.1:45198 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:04] INFO: 127.0.0.1:44374 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:04 TP0] Decode batch. #running-req: 149, #token: 79872, token usage: 0.01, cuda graph: True, gen throughput (token/s): 14000.20, #queue-req: 0,
[2025-09-06 08:50:04] INFO: 127.0.0.1:44190 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:04] INFO: 127.0.0.1:44846 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:04] INFO: 127.0.0.1:44608 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:04] INFO: 127.0.0.1:43752 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:04] INFO: 127.0.0.1:43858 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:04] INFO: 127.0.0.1:44536 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:04] INFO: 127.0.0.1:43704 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:04] INFO: 127.0.0.1:44048 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:04] INFO: 127.0.0.1:44560 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:04] INFO: 127.0.0.1:44624 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:04] INFO: 127.0.0.1:44982 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:04 TP0] Decode batch. #running-req: 138, #token: 78912, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13241.58, #queue-req: 0,
[2025-09-06 08:50:04] INFO: 127.0.0.1:44932 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:04] INFO: 127.0.0.1:45036 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:05] INFO: 127.0.0.1:43620 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:05] INFO: 127.0.0.1:43866 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:05] INFO: 127.0.0.1:44816 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:05] INFO: 127.0.0.1:45008 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:05] INFO: 127.0.0.1:43568 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:05] INFO: 127.0.0.1:44362 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:05] INFO: 127.0.0.1:44162 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:05] INFO: 127.0.0.1:44420 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:05 TP0] Decode batch. #running-req: 128, #token: 79424, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13042.26, #queue-req: 0,
[2025-09-06 08:50:05] INFO: 127.0.0.1:45090 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:05] INFO: 127.0.0.1:43834 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:05] INFO: 127.0.0.1:44498 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:05] INFO: 127.0.0.1:45116 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:05] INFO: 127.0.0.1:44482 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:05] INFO: 127.0.0.1:44334 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:05] INFO: 127.0.0.1:43848 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:05] INFO: 127.0.0.1:43664 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:05] INFO: 127.0.0.1:44704 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:05 TP0] Decode batch. #running-req: 119, #token: 78336, token usage: 0.01, cuda graph: True, gen throughput (token/s): 16564.89, #queue-req: 0,
[2025-09-06 08:50:05] INFO: 127.0.0.1:45252 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:05] INFO: 127.0.0.1:43958 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:05] INFO: 127.0.0.1:44914 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:05] INFO: 127.0.0.1:45258 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:05] INFO: 127.0.0.1:44774 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:05] INFO: 127.0.0.1:44398 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:05] INFO: 127.0.0.1:43554 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:05] INFO: 127.0.0.1:44406 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:05] INFO: 127.0.0.1:43874 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:05] INFO: 127.0.0.1:44684 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:05] INFO: 127.0.0.1:44948 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:05] INFO: 127.0.0.1:43646 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:05] INFO: 127.0.0.1:43830 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:05] INFO: 127.0.0.1:44554 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:05 TP0] Decode batch. #running-req: 105, #token: 74176, token usage: 0.01, cuda graph: True, gen throughput (token/s): 15339.23, #queue-req: 0,
[2025-09-06 08:50:06] INFO: 127.0.0.1:44254 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:06] INFO: 127.0.0.1:43816 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:06] INFO: 127.0.0.1:44852 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:06] INFO: 127.0.0.1:44116 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:06] INFO: 127.0.0.1:44506 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:06] INFO: 127.0.0.1:44126 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:06] INFO: 127.0.0.1:44582 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:06 TP0] Decode batch. #running-req: 98, #token: 72448, token usage: 0.01, cuda graph: True, gen throughput (token/s): 14430.93, #queue-req: 0,
[2025-09-06 08:50:06] INFO: 127.0.0.1:43886 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:06] INFO: 127.0.0.1:44820 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:06] INFO: 127.0.0.1:43986 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:06] INFO: 127.0.0.1:43654 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:06] INFO: 127.0.0.1:45210 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:06] INFO: 127.0.0.1:43922 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:06] INFO: 127.0.0.1:43896 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:06] INFO: 127.0.0.1:43758 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:06] INFO: 127.0.0.1:44728 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:06 TP0] Decode batch. #running-req: 89, #token: 68352, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13443.96, #queue-req: 0,
[2025-09-06 08:50:06] INFO: 127.0.0.1:43722 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:06] INFO: 127.0.0.1:44546 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:06] INFO: 127.0.0.1:44864 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:06] INFO: 127.0.0.1:44272 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:06] INFO: 127.0.0.1:43798 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:06] INFO: 127.0.0.1:44640 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:06] INFO: 127.0.0.1:45192 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:06] INFO: 127.0.0.1:45266 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:06] INFO: 127.0.0.1:45124 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:06] INFO: 127.0.0.1:45058 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:06] INFO: 127.0.0.1:44566 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:06] INFO: 127.0.0.1:44726 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:06 TP0] Decode batch. #running-req: 77, #token: 63232, token usage: 0.01, cuda graph: True, gen throughput (token/s): 12632.89, #queue-req: 0,
[2025-09-06 08:50:06] INFO: 127.0.0.1:43974 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:06] INFO: 127.0.0.1:43636 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:06] INFO: 127.0.0.1:44490 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:06] INFO: 127.0.0.1:45084 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:06] INFO: 127.0.0.1:44064 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:06] INFO: 127.0.0.1:44376 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:07 TP0] Decode batch. #running-req: 71, #token: 60864, token usage: 0.01, cuda graph: True, gen throughput (token/s): 11695.44, #queue-req: 0,
[2025-09-06 08:50:07] INFO: 127.0.0.1:43602 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:07] INFO: 127.0.0.1:43810 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:07] INFO: 127.0.0.1:44036 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:07] INFO: 127.0.0.1:43918 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:07] INFO: 127.0.0.1:45052 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:07] INFO: 127.0.0.1:45108 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:07] INFO: 127.0.0.1:44030 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:07 TP0] Decode batch. #running-req: 64, #token: 57792, token usage: 0.01, cuda graph: True, gen throughput (token/s): 10662.01, #queue-req: 0,
[2025-09-06 08:50:07] INFO: 127.0.0.1:44664 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:07] INFO: 127.0.0.1:43996 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:07] INFO: 127.0.0.1:43890 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:07] INFO: 127.0.0.1:43762 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:07] INFO: 127.0.0.1:44200 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:07] INFO: 127.0.0.1:44306 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:07] INFO: 127.0.0.1:44744 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:07 TP0] Decode batch. #running-req: 57, #token: 53632, token usage: 0.01, cuda graph: True, gen throughput (token/s): 9975.86, #queue-req: 0,
[2025-09-06 08:50:07] INFO: 127.0.0.1:44980 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:07] INFO: 127.0.0.1:45200 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:07] INFO: 127.0.0.1:43528 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:07] INFO: 127.0.0.1:45170 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:07] INFO: 127.0.0.1:43510 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:07] INFO: 127.0.0.1:44436 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:07] INFO: 127.0.0.1:44094 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:07] INFO: 127.0.0.1:44074 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:07 TP0] Decode batch. #running-req: 50, #token: 47808, token usage: 0.01, cuda graph: True, gen throughput (token/s): 9000.03, #queue-req: 0,
[2025-09-06 08:50:07] INFO: 127.0.0.1:45138 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:07] INFO: 127.0.0.1:45238 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:07] INFO: 127.0.0.1:44154 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:07] INFO: 127.0.0.1:44880 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:07] INFO: 127.0.0.1:44860 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:08 TP0] Decode batch. #running-req: 44, #token: 45056, token usage: 0.01, cuda graph: True, gen throughput (token/s): 8202.58, #queue-req: 0,
[2025-09-06 08:50:08] INFO: 127.0.0.1:44968 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:08] INFO: 127.0.0.1:43688 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:08] INFO: 127.0.0.1:44670 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:08] INFO: 127.0.0.1:43574 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:08] INFO: 127.0.0.1:44800 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:08 TP0] Decode batch. #running-req: 39, #token: 41408, token usage: 0.00, cuda graph: True, gen throughput (token/s): 7408.45, #queue-req: 0,
[2025-09-06 08:50:08] INFO: 127.0.0.1:44592 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:08] INFO: 127.0.0.1:43706 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:08] INFO: 127.0.0.1:44020 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:08 TP0] Decode batch. #running-req: 36, #token: 39680, token usage: 0.00, cuda graph: True, gen throughput (token/s): 6741.61, #queue-req: 0,
[2025-09-06 08:50:08] INFO: 127.0.0.1:43934 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:08] INFO: 127.0.0.1:44666 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:08] INFO: 127.0.0.1:44794 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:08] INFO: 127.0.0.1:44318 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:08 TP0] Decode batch. #running-req: 32, #token: 36544, token usage: 0.00, cuda graph: True, gen throughput (token/s): 6085.82, #queue-req: 0,
[2025-09-06 08:50:08] INFO: 127.0.0.1:45028 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:08] INFO: 127.0.0.1:43542 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:08] INFO: 127.0.0.1:44520 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:08] INFO: 127.0.0.1:44176 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:08 TP0] Decode batch. #running-req: 29, #token: 33024, token usage: 0.00, cuda graph: True, gen throughput (token/s): 5870.70, #queue-req: 0,
[2025-09-06 08:50:09 TP0] Decode batch. #running-req: 28, #token: 34176, token usage: 0.00, cuda graph: True, gen throughput (token/s): 5250.06, #queue-req: 0,
[2025-09-06 08:50:09] INFO: 127.0.0.1:44088 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:09] INFO: 127.0.0.1:44248 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:09] INFO: 127.0.0.1:44916 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:09] INFO: 127.0.0.1:44750 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:09 TP0] Decode batch. #running-req: 24, #token: 30400, token usage: 0.00, cuda graph: True, gen throughput (token/s): 4915.10, #queue-req: 0,
[2025-09-06 08:50:09] INFO: 127.0.0.1:44444 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:09] INFO: 127.0.0.1:44908 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:09] INFO: 127.0.0.1:44504 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:09] INFO: 127.0.0.1:43496 - "POST /v1/chat/completions HTTP/1.1" 200 OK
1%| | 1/198 [00:08<28:17, 8.62s/it][2025-09-06 08:50:09 TP0] Decode batch. #running-req: 20, #token: 26112, token usage: 0.00, cuda graph: True, gen throughput (token/s): 4419.02, #queue-req: 0,
[2025-09-06 08:50:09] INFO: 127.0.0.1:44770 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:09] INFO: 127.0.0.1:44984 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:09] INFO: 127.0.0.1:44100 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:09] INFO: 127.0.0.1:45064 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:09] INFO: 127.0.0.1:44584 - "POST /v1/chat/completions HTTP/1.1" 200 OK
6%|▌ | 12/198 [00:08<01:38, 1.88it/s][2025-09-06 08:50:09] INFO: 127.0.0.1:43744 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:09 TP0] Decode batch. #running-req: 14, #token: 19328, token usage: 0.00, cuda graph: True, gen throughput (token/s): 3433.39, #queue-req: 0,
[2025-09-06 08:50:09] INFO: 127.0.0.1:44288 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:09] INFO: 127.0.0.1:44458 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:09 TP0] Decode batch. #running-req: 12, #token: 16896, token usage: 0.00, cuda graph: True, gen throughput (token/s): 2626.78, #queue-req: 0,
[2025-09-06 08:50:10] INFO: 127.0.0.1:44892 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:10] INFO: 127.0.0.1:44296 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:10 TP0] Decode batch. #running-req: 10, #token: 14464, token usage: 0.00, cuda graph: True, gen throughput (token/s): 2343.56, #queue-req: 0,
[2025-09-06 08:50:10] INFO: 127.0.0.1:43728 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:10 TP0] Decode batch. #running-req: 9, #token: 13504, token usage: 0.00, cuda graph: True, gen throughput (token/s): 2011.41, #queue-req: 0,
[2025-09-06 08:50:10] INFO: 127.0.0.1:45224 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:10] INFO: 127.0.0.1:45160 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:10 TP0] Decode batch. #running-req: 7, #token: 10688, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1811.20, #queue-req: 0,
[2025-09-06 08:50:10 TP0] Decode batch. #running-req: 7, #token: 10880, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1671.46, #queue-req: 0,
[2025-09-06 08:50:10 TP0] Decode batch. #running-req: 7, #token: 11264, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1669.39, #queue-req: 0,
[2025-09-06 08:50:10] INFO: 127.0.0.1:44730 - "POST /v1/chat/completions HTTP/1.1" 200 OK
12%|█▏ | 24/198 [00:10<00:48, 3.59it/s][2025-09-06 08:50:10 TP0] Decode batch. #running-req: 6, #token: 8448, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1617.26, #queue-req: 0,
[2025-09-06 08:50:10] INFO: 127.0.0.1:44264 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:11 TP0] Decode batch. #running-req: 5, #token: 8576, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1209.93, #queue-req: 0,
[2025-09-06 08:50:11 TP0] Decode batch. #running-req: 5, #token: 8832, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1203.85, #queue-req: 0,
[2025-09-06 08:50:11 TP0] Decode batch. #running-req: 5, #token: 9024, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1194.25, #queue-req: 0,
[2025-09-06 08:50:11] INFO: 127.0.0.1:45182 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:11] INFO: 127.0.0.1:43818 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:11 TP0] Decode batch. #running-req: 3, #token: 5312, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1010.18, #queue-req: 0,
[2025-09-06 08:50:11 TP0] Decode batch. #running-req: 3, #token: 5440, token usage: 0.00, cuda graph: True, gen throughput (token/s): 800.43, #queue-req: 0,
[2025-09-06 08:50:11 TP0] Decode batch. #running-req: 3, #token: 5504, token usage: 0.00, cuda graph: True, gen throughput (token/s): 801.81, #queue-req: 0,
[2025-09-06 08:50:12] INFO: 127.0.0.1:45030 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:50:12] INFO: 127.0.0.1:44822 - "POST /v1/chat/completions HTTP/1.1" 200 OK
15%|█▌ | 30/198 [00:11<00:41, 4.01it/s][2025-09-06 08:50:12 TP0] Decode batch. #running-req: 1, #token: 1920, token usage: 0.00, cuda graph: True, gen throughput (token/s): 607.06, #queue-req: 0,
[2025-09-06 08:50:12 TP0] Decode batch. #running-req: 1, #token: 1984, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.95, #queue-req: 0,
[2025-09-06 08:50:12 TP0] Decode batch. #running-req: 1, #token: 2048, token usage: 0.00, cuda graph: True, gen throughput (token/s): 331.35, #queue-req: 0,
[2025-09-06 08:50:12 TP0] Decode batch. #running-req: 1, #token: 2048, token usage: 0.00, cuda graph: True, gen throughput (token/s): 331.17, #queue-req: 0,
[2025-09-06 08:50:12 TP0] Decode batch. #running-req: 1, #token: 2112, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.16, #queue-req: 0,
[2025-09-06 08:50:12 TP0] Decode batch. #running-req: 1, #token: 2176, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.91, #queue-req: 0,
[2025-09-06 08:50:12 TP0] Decode batch. #running-req: 1, #token: 2176, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.34, #queue-req: 0,
[2025-09-06 08:50:12 TP0] Decode batch. #running-req: 1, #token: 2240, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.84, #queue-req: 0,
[2025-09-06 08:50:13 TP0] Decode batch. #running-req: 1, #token: 2240, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.21, #queue-req: 0,
[2025-09-06 08:50:13 TP0] Decode batch. #running-req: 1, #token: 2304, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.80, #queue-req: 0,
[2025-09-06 08:50:13 TP0] Decode batch. #running-req: 1, #token: 2368, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.60, #queue-req: 0,
[2025-09-06 08:50:13 TP0] Decode batch. #running-req: 1, #token: 2368, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.75, #queue-req: 0,
[2025-09-06 08:50:13 TP0] Decode batch. #running-req: 1, #token: 2432, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.77, #queue-req: 0,
[2025-09-06 08:50:13 TP0] Decode batch. #running-req: 1, #token: 2496, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.77, #queue-req: 0,
[2025-09-06 08:50:13 TP0] Decode batch. #running-req: 1, #token: 2496, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.75, #queue-req: 0,
[2025-09-06 08:50:13 TP0] Decode batch. #running-req: 1, #token: 2560, token usage: 0.00, cuda graph: True, gen throughput (token/s): 328.93, #queue-req: 0,
[2025-09-06 08:50:14 TP0] Decode batch. #running-req: 1, #token: 2560, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.56, #queue-req: 0,
[2025-09-06 08:50:14 TP0] Decode batch. #running-req: 1, #token: 2624, token usage: 0.00, cuda graph: True, gen throughput (token/s): 328.80, #queue-req: 0,
[2025-09-06 08:50:14 TP0] Decode batch. #running-req: 1, #token: 2688, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.02, #queue-req: 0,
[2025-09-06 08:50:14 TP0] Decode batch. #running-req: 1, #token: 2688, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.23, #queue-req: 0,
[2025-09-06 08:50:14 TP0] Decode batch. #running-req: 1, #token: 2752, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.06, #queue-req: 0,
[2025-09-06 08:50:14 TP0] Decode batch. #running-req: 1, #token: 2816, token usage: 0.00, cuda graph: True, gen throughput (token/s): 328.64, #queue-req: 0,
[2025-09-06 08:50:14 TP0] Decode batch. #running-req: 1, #token: 2816, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.15, #queue-req: 0,
[2025-09-06 08:50:14 TP0] Decode batch. #running-req: 1, #token: 2880, token usage: 0.00, cuda graph: True, gen throughput (token/s): 328.99, #queue-req: 0,
[2025-09-06 08:50:14 TP0] Decode batch. #running-req: 1, #token: 2880, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.25, #queue-req: 0,
[2025-09-06 08:50:15 TP0] Decode batch. #running-req: 1, #token: 2944, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.01, #queue-req: 0,
[2025-09-06 08:50:15 TP0] Decode batch. #running-req: 1, #token: 3008, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.00, #queue-req: 0,
[2025-09-06 08:50:15 TP0] Decode batch. #running-req: 1, #token: 3008, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.25, #queue-req: 0,
[2025-09-06 08:50:15 TP0] Decode batch. #running-req: 1, #token: 3072, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.11, #queue-req: 0,
[2025-09-06 08:50:15 TP0] Decode batch. #running-req: 1, #token: 3136, token usage: 0.00, cuda graph: True, gen throughput (token/s): 326.28, #queue-req: 0,
[2025-09-06 08:50:15 TP0] Decode batch. #running-req: 1, #token: 3136, token usage: 0.00, cuda graph: True, gen throughput (token/s): 328.20, #queue-req: 0,
[2025-09-06 08:50:15 TP0] Decode batch. #running-req: 1, #token: 3200, token usage: 0.00, cuda graph: True, gen throughput (token/s): 328.24, #queue-req: 0,
[2025-09-06 08:50:15 TP0] Decode batch. #running-req: 1, #token: 3200, token usage: 0.00, cuda graph: True, gen throughput (token/s): 327.81, #queue-req: 0,
[2025-09-06 08:50:16 TP0] Decode batch. #running-req: 1, #token: 3264, token usage: 0.00, cuda graph: True, gen throughput (token/s): 328.10, #queue-req: 0,
[2025-09-06 08:50:16 TP0] Decode batch. #running-req: 1, #token: 3328, token usage: 0.00, cuda graph: True, gen throughput (token/s): 327.98, #queue-req: 0,
[2025-09-06 08:50:16 TP0] Decode batch. #running-req: 1, #token: 3328, token usage: 0.00, cuda graph: True, gen throughput (token/s): 328.30, #queue-req: 0,
[2025-09-06 08:50:16 TP0] Decode batch. #running-req: 1, #token: 3392, token usage: 0.00, cuda graph: True, gen throughput (token/s): 327.75, #queue-req: 0,
[2025-09-06 08:50:16 TP0] Decode batch. #running-req: 1, #token: 3456, token usage: 0.00, cuda graph: True, gen throughput (token/s): 327.72, #queue-req: 0,
[2025-09-06 08:50:16 TP0] Decode batch. #running-req: 1, #token: 3456, token usage: 0.00, cuda graph: True, gen throughput (token/s): 327.36, #queue-req: 0,
[2025-09-06 08:50:16 TP0] Decode batch. #running-req: 1, #token: 3520, token usage: 0.00, cuda graph: True, gen throughput (token/s): 327.55, #queue-req: 0,
[2025-09-06 08:50:16 TP0] Decode batch. #running-req: 1, #token: 3520, token usage: 0.00, cuda graph: True, gen throughput (token/s): 327.20, #queue-req: 0,
[2025-09-06 08:50:17 TP0] Decode batch. #running-req: 1, #token: 3584, token usage: 0.00, cuda graph: True, gen throughput (token/s): 327.35, #queue-req: 0,
[2025-09-06 08:50:17 TP0] Decode batch. #running-req: 1, #token: 3648, token usage: 0.00, cuda graph: True, gen throughput (token/s): 327.56, #queue-req: 0,
[2025-09-06 08:50:17 TP0] Decode batch. #running-req: 1, #token: 3648, token usage: 0.00, cuda graph: True, gen throughput (token/s): 326.78, #queue-req: 0,
[2025-09-06 08:50:17 TP0] Decode batch. #running-req: 1, #token: 3712, token usage: 0.00, cuda graph: True, gen throughput (token/s): 326.70, #queue-req: 0,
[2025-09-06 08:50:17 TP0] Decode batch. #running-req: 1, #token: 3776, token usage: 0.00, cuda graph: True, gen throughput (token/s): 326.99, #queue-req: 0,
[2025-09-06 08:50:17 TP0] Decode batch. #running-req: 1, #token: 3776, token usage: 0.00, cuda graph: True, gen throughput (token/s): 326.84, #queue-req: 0,
[2025-09-06 08:50:17 TP0] Decode batch. #running-req: 1, #token: 3840, token usage: 0.00, cuda graph: True, gen throughput (token/s): 326.78, #queue-req: 0,
[2025-09-06 08:50:17 TP0] Decode batch. #running-req: 1, #token: 3840, token usage: 0.00, cuda graph: True, gen throughput (token/s): 327.08, #queue-req: 0,
[2025-09-06 08:50:18 TP0] Decode batch. #running-req: 1, #token: 3904, token usage: 0.00, cuda graph: True, gen throughput (token/s): 328.42, #queue-req: 0,
[2025-09-06 08:50:18 TP0] Decode batch. #running-req: 1, #token: 3968, token usage: 0.00, cuda graph: True, gen throughput (token/s): 328.55, #queue-req: 0,
[2025-09-06 08:50:18 TP0] Decode batch. #running-req: 1, #token: 3968, token usage: 0.00, cuda graph: True, gen throughput (token/s): 328.20, #queue-req: 0,
[2025-09-06 08:50:18 TP0] Decode batch. #running-req: 1, #token: 4032, token usage: 0.00, cuda graph: True, gen throughput (token/s): 318.00, #queue-req: 0,
[2025-09-06 08:50:18 TP0] Decode batch. #running-req: 1, #token: 4096, token usage: 0.00, cuda graph: True, gen throughput (token/s): 328.58, #queue-req: 0,
[2025-09-06 08:50:18 TP0] Decode batch. #running-req: 1, #token: 4096, token usage: 0.00, cuda graph: True, gen throughput (token/s): 328.54, #queue-req: 0,
[2025-09-06 08:50:18 TP0] Decode batch. #running-req: 1, #token: 4160, token usage: 0.00, cuda graph: True, gen throughput (token/s): 327.81, #queue-req: 0,
[2025-09-06 08:50:18 TP0] Decode batch. #running-req: 1, #token: 4160, token usage: 0.00, cuda graph: True, gen throughput (token/s): 327.50, #queue-req: 0,
[2025-09-06 08:50:19 TP0] Decode batch. #running-req: 1, #token: 4224, token usage: 0.00, cuda graph: True, gen throughput (token/s): 326.93, #queue-req: 0,
[2025-09-06 08:50:19 TP0] Decode batch. #running-req: 1, #token: 4288, token usage: 0.00, cuda graph: True, gen throughput (token/s): 327.17, #queue-req: 0,
[2025-09-06 08:50:19 TP0] Decode batch. #running-req: 1, #token: 4288, token usage: 0.00, cuda graph: True, gen throughput (token/s): 327.33, #queue-req: 0,
[2025-09-06 08:50:19 TP0] Decode batch. #running-req: 1, #token: 4352, token usage: 0.00, cuda graph: True, gen throughput (token/s): 327.43, #queue-req: 0,
[2025-09-06 08:50:19 TP0] Decode batch. #running-req: 1, #token: 4416, token usage: 0.00, cuda graph: True, gen throughput (token/s): 327.05, #queue-req: 0,
[2025-09-06 08:50:19] INFO: 127.0.0.1:43678 - "POST /v1/chat/completions HTTP/1.1" 200 OK
56%|█████▌ | 110/198 [00:18<00:10, 8.49it/s] 100%|██████████| 198/198 [00:18<00:00, 10.57it/s]
/usr/lib/python3.12/subprocess.py:1127: ResourceWarning: subprocess 51985 is still running
_warn("subprocess %s is still running" % self.pid,
ResourceWarning: Enable tracemalloc to get the object allocation traceback
.
----------------------------------------------------------------------
Ran 1 test in 182.997s
OK
Writing report to /tmp/gpqa__home_yiliu7_models_openai_gpt-oss-120b.html
{'chars': 1658.3989898989898, 'chars:std': 1027.2337509987194, 'score:std': 0.478067053179767, 'score': 0.6464646464646465}
Writing results to /tmp/gpqa__home_yiliu7_models_openai_gpt-oss-120b.json
Total latency: 18.779 s
Score: 0.646
Evaluation end: model=/home/yiliu7/models/openai/gpt-oss-120b reasoning_effort=low expected_score=0.6 metrics={'chars': 1658.3989898989898, 'chars:std': 1027.2337509987194, 'score:std': 0.478067053179767, 'score': 0.6464646464646465}
================================================================================
Run 9:
Auto-configed device: cuda
WARNING:sglang.srt.server_args:Detected SM100 and MXFP4 quantization format for GPT-OSS model, enabling FlashInfer MXFP4 MOE kernel.
WARNING:sglang.srt.server_args:TensorRT-LLM MHA only supports page_size of 16, 32 or 64, changing page_size from None to 64.
[2025-09-06 08:50:34] server_args=ServerArgs(model_path='/home/yiliu7/models/openai/gpt-oss-120b', tokenizer_path='/home/yiliu7/models/openai/gpt-oss-120b', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=8400, skip_server_warmup=False, warmups=None, nccl_port=None, dtype='bfloat16', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', mem_fraction_static=0.93, max_running_requests=None, max_queued_requests=9223372036854775807, max_total_tokens=None, chunked_prefill_size=16384, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, page_size=64, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=True, device='cuda', tp_size=4, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=8902787, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, gc_warning_threshold_secs=0.0, api_key=None, served_model_name='/home/yiliu7/models/openai/gpt-oss-120b', weight_version='default', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_backend='triton', attention_backend='trtllm_mha', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, ep_size=1, moe_a2a_backend='none', moe_runner_backend='flashinfer_mxfp4', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=True, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', disable_radix_cache=False, cuda_graph_max_bs=200, cuda_graph_bs=None, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, torch_compile_max_bs=32, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, enable_return_hidden_states=False, scheduler_recv_interval=1, numa_node=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, debug_tensor_dump_prefill_only=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, num_reserved_decode_tokens=512, custom_weight_loader=[], weight_loader_disable_mmap=False, enable_pdmux=False, sm_group_num=3, enable_ep_moe=False, enable_deepep_moe=False, enable_flashinfer_cutlass_moe=False, enable_flashinfer_trtllm_moe=False, enable_triton_kernel_moe=False, enable_flashinfer_mxfp4_moe=False)
All deep_gemm operations loaded successfully!
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-06 08:50:34] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:50:34] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-06 08:50:34] Using default HuggingFace chat template with detected content format: string
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-06 08:50:41 TP2] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:50:41 TP2] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-06 08:50:41 TP1] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:50:41 TP1] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-06 08:50:41 TP3] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:50:41 TP3] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-06 08:50:41 TP0] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:50:41 TP0] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-06 08:50:41 TP2] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:50:41 TP2] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-06 08:50:42 TP1] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:50:42 TP1] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-06 08:50:42 TP3] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:50:42 TP3] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-06 08:50:42 TP0] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:50:42 TP0] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-06 08:50:42 TP0] Init torch distributed begin.
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[2025-09-06 08:50:44 TP0] sglang is using nccl==2.27.3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[2025-09-06 08:50:45 TP0] Init torch distributed ends. mem usage=1.46 GB
[2025-09-06 08:50:46 TP0] Load weight begin. avail mem=176.28 GB
All deep_gemm operations loaded successfully!
Loading safetensors checkpoint shards: 0% Completed | 0/15 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 15/15 [00:00<00:00, 1553.52it/s]
All deep_gemm operations loaded successfully!
[2025-09-06 08:51:01 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.0.mlp.experts), it might take a while...
All deep_gemm operations loaded successfully!
All deep_gemm operations loaded successfully!
[2025-09-06 08:51:04 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.1.mlp.experts), it might take a while...
[2025-09-06 08:51:07 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.2.mlp.experts), it might take a while...
[2025-09-06 08:51:10 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.3.mlp.experts), it might take a while...
[2025-09-06 08:51:13 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.4.mlp.experts), it might take a while...
[2025-09-06 08:51:16 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.5.mlp.experts), it might take a while...
[2025-09-06 08:51:19 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.6.mlp.experts), it might take a while...
[2025-09-06 08:51:22 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.7.mlp.experts), it might take a while...
[2025-09-06 08:51:25 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.8.mlp.experts), it might take a while...
[2025-09-06 08:51:28 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.9.mlp.experts), it might take a while...
[2025-09-06 08:51:32 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.10.mlp.experts), it might take a while...
[2025-09-06 08:51:35 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.11.mlp.experts), it might take a while...
[2025-09-06 08:51:38 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.12.mlp.experts), it might take a while...
[2025-09-06 08:51:41 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.13.mlp.experts), it might take a while...
[2025-09-06 08:51:44 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.14.mlp.experts), it might take a while...
[2025-09-06 08:51:47 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.15.mlp.experts), it might take a while...
[2025-09-06 08:51:50 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.16.mlp.experts), it might take a while...
[2025-09-06 08:51:53 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.17.mlp.experts), it might take a while...
[2025-09-06 08:51:56 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.18.mlp.experts), it might take a while...
[2025-09-06 08:51:59 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.19.mlp.experts), it might take a while...
[2025-09-06 08:52:02 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.20.mlp.experts), it might take a while...
[2025-09-06 08:52:05 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.21.mlp.experts), it might take a while...
[2025-09-06 08:52:08 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.22.mlp.experts), it might take a while...
[2025-09-06 08:52:11 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.23.mlp.experts), it might take a while...
[2025-09-06 08:52:15 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.24.mlp.experts), it might take a while...
[2025-09-06 08:52:18 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.25.mlp.experts), it might take a while...
[2025-09-06 08:52:21 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.26.mlp.experts), it might take a while...
[2025-09-06 08:52:24 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.27.mlp.experts), it might take a while...
[2025-09-06 08:52:27 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.28.mlp.experts), it might take a while...
[2025-09-06 08:52:30 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.29.mlp.experts), it might take a while...
[2025-09-06 08:52:33 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.30.mlp.experts), it might take a while...
[2025-09-06 08:52:36 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.31.mlp.experts), it might take a while...
[2025-09-06 08:52:39 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.32.mlp.experts), it might take a while...
[2025-09-06 08:52:42 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.33.mlp.experts), it might take a while...
[2025-09-06 08:52:45 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.34.mlp.experts), it might take a while...
[2025-09-06 08:52:48 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.35.mlp.experts), it might take a while...
[2025-09-06 08:52:51 TP0] Load weight end. type=GptOssForCausalLM, dtype=torch.bfloat16, avail mem=158.06 GB, mem usage=18.22 GB.
[2025-09-06 08:52:51 TP0] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB
[2025-09-06 08:52:51 TP0] Memory pool end. avail mem=10.23 GB
[2025-09-06 08:52:51 TP1] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB
[2025-09-06 08:52:51 TP3] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB
[2025-09-06 08:52:51 TP2] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB
[2025-09-06 08:52:52 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=9.54 GB
[2025-09-06 08:52:52 TP0] Capture cuda graph bs [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200]
0%| | 0/28 [00:00<?, ?it/s] Capturing batches (bs=200 avail_mem=9.39 GB): 0%| | 0/28 [00:00<?, ?it/s]rank 2 allocated ipc_handles: [['0x7db224000000', '0x7db1c0000000', '0x7dd7ba000000', '0x7db1bc000000'], ['0x7db1bf000000', '0x7db1bf200000', '0x7db1bee00000', '0x7db1bf400000'], ['0x7db1a8000000', '0x7db19e000000', '0x7db1b2000000', '0x7db194000000']]
[2025-09-06 08:52:54.237] [info] lamportInitialize start: buffer: 0x7db1b2000000, size: 71303168
rank 3 allocated ipc_handles: [['0x77f504000000', '0x77f4a8000000', '0x77f4a4000000', '0x781aa0000000'], ['0x77f4a7000000', '0x77f4a7200000', '0x77f4a7400000', '0x77f4a6e00000'], ['0x77f490000000', '0x77f486000000', '0x77f47c000000', '0x77f49a000000']]
[2025-09-06 08:52:54.285] [info] lamportInitialize start: buffer: 0x77f49a000000, size: 71303168
rank 0 allocated ipc_handles: [['0x771a6e000000', '0x76f4d2000000', '0x76f47c000000', '0x76f478000000'], ['0x76f47ae00000', '0x76f47b000000', '0x76f47b200000', '0x76f47b400000'], ['0x76f46e000000', '0x76f464000000', '0x76f45a000000', '0x76f450000000']]
[2025-09-06 08:52:54.334] [info] lamportInitialize start: buffer: 0x76f46e000000, size: 71303168
rank 1 allocated ipc_handles: [['0x7f92a2000000', '0x7fb838000000', '0x7f923c000000', '0x7f9238000000'], ['0x7f923b000000', '0x7f923ae00000', '0x7f923b200000', '0x7f923b400000'], ['0x7f9224000000', '0x7f922e000000', '0x7f921a000000', '0x7f9210000000']]
[2025-09-06 08:52:54.384] [info] lamportInitialize start: buffer: 0x7f922e000000, size: 71303168
[2025-09-06 08:52:54 TP0] FlashInfer workspace initialized for rank 0, world_size 4
[2025-09-06 08:52:54 TP1] FlashInfer workspace initialized for rank 1, world_size 4
[2025-09-06 08:52:54 TP3] FlashInfer workspace initialized for rank 3, world_size 4
[2025-09-06 08:52:54 TP2] FlashInfer workspace initialized for rank 2, world_size 4
set flag_ptr[3] = lamport_comm_size: 47185920
Rank 0 workspace[0] 0x771a6e000000
Rank 0 workspace[1] 0x76f4d2000000
Rank 0 workspace[2] 0x76f47c000000
Rank 0 workspace[3] 0x76f478000000
Rank 0 workspace[4] 0x76f47ae00000
Rank 0 workspace[5] 0x76f47b000000
Rank 0 workspace[6] 0x76f47b200000
Rank 0 workspace[7] 0x76f47b400000
Rank 0 workspace[8] 0x76f46e000000
Rank 0 workspace[9] 0x76f464000000
Rank 0 workspace[10] 0x76f45a000000
Rank 0 workspace[11] 0x76f450000000
Rank 0 workspace[12] 0x772063264400
set flag_ptr[3] = lamport_comm_size: 47185920
Rank 3 workspace[0] 0x77f504000000
Rank 3 workspace[1] 0x77f4a8000000
Rank 3 workspace[2] 0x77f4a4000000
Rank 3 workspace[3] 0x781aa0000000
Rank 3 workspace[4] 0x77f4a7000000
Rank 3 workspace[5] 0x77f4a7200000
Rank 3 workspace[6] 0x77f4a7400000
Rank 3 workspace[7] 0x77f4a6e00000
Rank 3 workspace[8] 0x77f490000000
Rank 3 workspace[9] 0x77f486000000
Rank 3 workspace[10] 0x77f47c000000
Rank 3 workspace[11] 0x77f49a000000
Rank 3 workspace[12] 0x78209b264400
set flag_ptr[3] = lamport_comm_size: 47185920
Rank 1 workspace[0] 0x7f92a2000000
Rank 1 workspace[1] 0x7fb838000000
Rank 1 workspace[2] 0x7f923c000000
Rank 1 workspace[3] 0x7f9238000000
Rank 1 workspace[4] 0x7f923b000000
Rank 1 workspace[5] 0x7f923ae00000
Rank 1 workspace[6] 0x7f923b200000
Rank 1 workspace[7] 0x7f923b400000
Rank 1 workspace[8] 0x7f9224000000
Rank 1 workspace[9] 0x7f922e000000
Rank 1 workspace[10] 0x7f921a000000
Rank 1 workspace[11] 0x7f9210000000
Rank 1 workspace[12] 0x7fbe43264400
set flag_ptr[3] = lamport_comm_size: 47185920
Rank 2 workspace[0] 0x7db224000000
Rank 2 workspace[1] 0x7db1c0000000
Rank 2 workspace[2] 0x7dd7ba000000
Rank 2 workspace[3] 0x7db1bc000000
Rank 2 workspace[4] 0x7db1bf000000
Rank 2 workspace[5] 0x7db1bf200000
Rank 2 workspace[6] 0x7db1bee00000
Rank 2 workspace[7] 0x7db1bf400000
Rank 2 workspace[8] 0x7db1a8000000
Rank 2 workspace[9] 0x7db19e000000
Rank 2 workspace[10] 0x7db1b2000000
Rank 2 workspace[11] 0x7db194000000
Rank 2 workspace[12] 0x7dddc3264400
Capturing batches (bs=200 avail_mem=9.39 GB): 4%|▎ | 1/28 [00:02<00:56, 2.08s/it] Capturing batches (bs=192 avail_mem=8.06 GB): 4%|▎ | 1/28 [00:02<00:56, 2.08s/it] Capturing batches (bs=192 avail_mem=8.06 GB): 7%|▋ | 2/28 [00:02<00:24, 1.06it/s] Capturing batches (bs=184 avail_mem=8.05 GB): 7%|▋ | 2/28 [00:02<00:24, 1.06it/s] Capturing batches (bs=176 avail_mem=8.04 GB): 7%|▋ | 2/28 [00:02<00:24, 1.06it/s] Capturing batches (bs=176 avail_mem=8.04 GB): 14%|█▍ | 4/28 [00:02<00:09, 2.47it/s] Capturing batches (bs=168 avail_mem=8.03 GB): 14%|█▍ | 4/28 [00:02<00:09, 2.47it/s] Capturing batches (bs=160 avail_mem=8.02 GB): 14%|█▍ | 4/28 [00:02<00:09, 2.47it/s] Capturing batches (bs=160 avail_mem=8.02 GB): 21%|██▏ | 6/28 [00:02<00:05, 3.95it/s] Capturing batches (bs=152 avail_mem=8.01 GB): 21%|██▏ | 6/28 [00:02<00:05, 3.95it/s] Capturing batches (bs=144 avail_mem=8.00 GB): 21%|██▏ | 6/28 [00:02<00:05, 3.95it/s] Capturing batches (bs=144 avail_mem=8.00 GB): 29%|██▊ | 8/28 [00:02<00:03, 5.41it/s] Capturing batches (bs=136 avail_mem=8.00 GB): 29%|██▊ | 8/28 [00:02<00:03, 5.41it/s] Capturing batches (bs=128 avail_mem=7.99 GB): 29%|██▊ | 8/28 [00:02<00:03, 5.41it/s] Capturing batches (bs=128 avail_mem=7.99 GB): 36%|███▌ | 10/28 [00:02<00:02, 6.69it/s] Capturing batches (bs=120 avail_mem=7.97 GB): 36%|███▌ | 10/28 [00:02<00:02, 6.69it/s] Capturing batches (bs=112 avail_mem=7.97 GB): 36%|███▌ | 10/28 [00:02<00:02, 6.69it/s] Capturing batches (bs=112 avail_mem=7.97 GB): 43%|████▎ | 12/28 [00:03<00:02, 7.89it/s] Capturing batches (bs=104 avail_mem=7.96 GB): 43%|████▎ | 12/28 [00:03<00:02, 7.89it/s] Capturing batches (bs=96 avail_mem=7.95 GB): 43%|████▎ | 12/28 [00:03<00:02, 7.89it/s] Capturing batches (bs=96 avail_mem=7.95 GB): 50%|█████ | 14/28 [00:03<00:01, 8.86it/s] Capturing batches (bs=88 avail_mem=7.94 GB): 50%|█████ | 14/28 [00:03<00:01, 8.86it/s] Capturing batches (bs=80 avail_mem=7.93 GB): 50%|█████ | 14/28 [00:03<00:01, 8.86it/s] Capturing batches (bs=80 avail_mem=7.93 GB): 57%|█████▋ | 16/28 [00:03<00:01, 9.62it/s] Capturing batches (bs=72 avail_mem=7.92 GB): 57%|█████▋ | 16/28 [00:03<00:01, 9.62it/s] Capturing batches (bs=64 avail_mem=7.91 GB): 57%|█████▋ | 16/28 [00:03<00:01, 9.62it/s] Capturing batches (bs=64 avail_mem=7.91 GB): 64%|██████▍ | 18/28 [00:03<00:00, 10.27it/s] Capturing batches (bs=56 avail_mem=7.90 GB): 64%|██████▍ | 18/28 [00:03<00:00, 10.27it/s] Capturing batches (bs=48 avail_mem=7.89 GB): 64%|██████▍ | 18/28 [00:03<00:00, 10.27it/s] Capturing batches (bs=48 avail_mem=7.89 GB): 71%|███████▏ | 20/28 [00:03<00:00, 10.84it/s] Capturing batches (bs=40 avail_mem=7.88 GB): 71%|███████▏ | 20/28 [00:03<00:00, 10.84it/s] Capturing batches (bs=32 avail_mem=7.88 GB): 71%|███████▏ | 20/28 [00:03<00:00, 10.84it/s] Capturing batches (bs=32 avail_mem=7.88 GB): 79%|███████▊ | 22/28 [00:03<00:00, 11.27it/s] Capturing batches (bs=24 avail_mem=7.87 GB): 79%|███████▊ | 22/28 [00:03<00:00, 11.27it/s] Capturing batches (bs=16 avail_mem=7.86 GB): 79%|███████▊ | 22/28 [00:03<00:00, 11.27it/s] Capturing batches (bs=16 avail_mem=7.86 GB): 86%|████████▌ | 24/28 [00:04<00:00, 11.64it/s] Capturing batches (bs=8 avail_mem=7.85 GB): 86%|████████▌ | 24/28 [00:04<00:00, 11.64it/s] Capturing batches (bs=4 avail_mem=7.84 GB): 86%|████████▌ | 24/28 [00:04<00:00, 11.64it/s] Capturing batches (bs=4 avail_mem=7.84 GB): 93%|█████████▎| 26/28 [00:04<00:00, 12.24it/s] Capturing batches (bs=2 avail_mem=7.83 GB): 93%|█████████▎| 26/28 [00:04<00:00, 12.24it/s] Capturing batches (bs=1 avail_mem=7.82 GB): 93%|█████████▎| 26/28 [00:04<00:00, 12.24it/s] Capturing batches (bs=1 avail_mem=7.82 GB): 100%|██████████| 28/28 [00:04<00:00, 11.67it/s] Capturing batches (bs=1 avail_mem=7.82 GB): 100%|██████████| 28/28 [00:04<00:00, 6.37it/s]
[2025-09-06 08:52:57 TP0] Registering 56 cuda graph addresses
[2025-09-06 08:52:57 TP1] Registering 56 cuda graph addresses
[2025-09-06 08:52:57 TP3] Registering 56 cuda graph addresses
[2025-09-06 08:52:57 TP2] Registering 56 cuda graph addresses
[2025-09-06 08:52:57 TP0] Capture cuda graph end. Time elapsed: 4.91 s. mem usage=1.73 GB. avail mem=7.81 GB.
[2025-09-06 08:52:57 TP0] max_total_num_tokens=8487040, chunked_prefill_size=16384, max_prefill_tokens=16384, max_running_requests=4096, context_len=131072, available_gpu_mem=7.81 GB
[2025-09-06 08:52:58] INFO: Started server process [54406]
[2025-09-06 08:52:58] INFO: Waiting for application startup.
[2025-09-06 08:52:58] INFO: Application startup complete.
[2025-09-06 08:52:58] INFO: Uvicorn running on http://127.0.0.1:8400 (Press CTRL+C to quit)
[2025-09-06 08:52:58] INFO: 127.0.0.1:36566 - "GET /health_generate HTTP/1.1" 503 Service Unavailable
[2025-09-06 08:52:59] INFO: 127.0.0.1:36570 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-09-06 08:52:59 TP0] Prefill batch. #new-seq: 1, #new-token: 64, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-09-06 08:53:00] INFO: 127.0.0.1:36572 - "POST /generate HTTP/1.1" 200 OK
[2025-09-06 08:53:00] The server is fired up and ready to roll!
[2025-09-06 08:53:08 TP0] Prefill batch. #new-seq: 1, #new-token: 64, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-09-06 08:53:09] INFO: 127.0.0.1:46788 - "GET /health_generate HTTP/1.1" 200 OK
command=python3 -m sglang.launch_server --model-path /home/yiliu7/models/openai/gpt-oss-120b --tp 4 --cuda-graph-max-bs 200 --mem-fraction-static 0.93 --device cuda --host 127.0.0.1 --port 8400
Evaluation start: model=/home/yiliu7/models/openai/gpt-oss-120b reasoning_effort=low expected_score=0.6
ChatCompletionSampler initialized with self.system_message=None self.temperature=0.1 self.max_tokens=4096 self.reasoning_effort='low'
0%| | 0/198 [00:00<?, ?it/s][2025-09-06 08:53:10 TP0] Prefill batch. #new-seq: 1, #new-token: 320, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-09-06 08:53:10 TP0] Prefill batch. #new-seq: 10, #new-token: 3136, #cached-token: 640, token usage: 0.00, #running-req: 1, #queue-req: 0,
[2025-09-06 08:53:10 TP0] Prefill batch. #new-seq: 1, #new-token: 256, #cached-token: 64, token usage: 0.00, #running-req: 11, #queue-req: 0,
[2025-09-06 08:53:10 TP0] Prefill batch. #new-seq: 54, #new-token: 16192, #cached-token: 3456, token usage: 0.00, #running-req: 12, #queue-req: 31,
[2025-09-06 08:53:10 TP0] Prefill batch. #new-seq: 49, #new-token: 13056, #cached-token: 3136, token usage: 0.00, #running-req: 66, #queue-req: 0,
[2025-09-06 08:53:10 TP0] Prefill batch. #new-seq: 58, #new-token: 16320, #cached-token: 3776, token usage: 0.00, #running-req: 115, #queue-req: 1,
[2025-09-06 08:53:11 TP0] Prefill batch. #new-seq: 25, #new-token: 7296, #cached-token: 1728, token usage: 0.01, #running-req: 173, #queue-req: 0,
[2025-09-06 08:53:11 TP0] Decode batch. #running-req: 198, #token: 62016, token usage: 0.01, cuda graph: True, gen throughput (token/s): 390.03, #queue-req: 0,
[2025-09-06 08:53:11] INFO: 127.0.0.1:48268 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:11] INFO: 127.0.0.1:48224 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:11 TP0] Decode batch. #running-req: 196, #token: 68800, token usage: 0.01, cuda graph: True, gen throughput (token/s): 17298.89, #queue-req: 0,
[2025-09-06 08:53:11] INFO: 127.0.0.1:47764 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:12] INFO: 127.0.0.1:47106 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:12] INFO: 127.0.0.1:48460 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:12] INFO: 127.0.0.1:48182 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:12] INFO: 127.0.0.1:46824 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:12] INFO: 127.0.0.1:46920 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:12] INFO: 127.0.0.1:48220 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:12] INFO: 127.0.0.1:47478 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:12] INFO: 127.0.0.1:47336 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:12] INFO: 127.0.0.1:46986 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:12 TP0] Decode batch. #running-req: 186, #token: 70656, token usage: 0.01, cuda graph: True, gen throughput (token/s): 16979.59, #queue-req: 0,
[2025-09-06 08:53:12] INFO: 127.0.0.1:47982 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:12] INFO: 127.0.0.1:47408 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:12] INFO: 127.0.0.1:47600 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:12] INFO: 127.0.0.1:47710 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:12] INFO: 127.0.0.1:46938 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:12] INFO: 127.0.0.1:47628 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:12] INFO: 127.0.0.1:47538 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:12] INFO: 127.0.0.1:46946 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:12] INFO: 127.0.0.1:47550 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:12] INFO: 127.0.0.1:47498 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:12] INFO: 127.0.0.1:48412 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:12] INFO: 127.0.0.1:48310 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:12] INFO: 127.0.0.1:47366 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:12 TP0] Decode batch. #running-req: 173, #token: 71168, token usage: 0.01, cuda graph: True, gen throughput (token/s): 16251.87, #queue-req: 0,
[2025-09-06 08:53:12] INFO: 127.0.0.1:47114 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:12] INFO: 127.0.0.1:48048 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:13] INFO: 127.0.0.1:47368 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:13] INFO: 127.0.0.1:46972 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:13] INFO: 127.0.0.1:47662 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:13] INFO: 127.0.0.1:47724 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:13 TP0] Decode batch. #running-req: 167, #token: 75840, token usage: 0.01, cuda graph: True, gen throughput (token/s): 15575.15, #queue-req: 0,
[2025-09-06 08:53:13] INFO: 127.0.0.1:46860 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:13] INFO: 127.0.0.1:46950 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:13] INFO: 127.0.0.1:48162 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:13] INFO: 127.0.0.1:47128 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:13] INFO: 127.0.0.1:48444 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:13] INFO: 127.0.0.1:47672 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:13] INFO: 127.0.0.1:48016 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:13] INFO: 127.0.0.1:48070 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:13 TP0] Decode batch. #running-req: 159, #token: 78272, token usage: 0.01, cuda graph: True, gen throughput (token/s): 14796.13, #queue-req: 0,
[2025-09-06 08:53:13] INFO: 127.0.0.1:47268 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:13] INFO: 127.0.0.1:48384 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:13] INFO: 127.0.0.1:47822 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:14] INFO: 127.0.0.1:47512 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:14] INFO: 127.0.0.1:47528 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:14] INFO: 127.0.0.1:47278 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:14] INFO: 127.0.0.1:47612 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:14] INFO: 127.0.0.1:47198 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:14 TP0] Decode batch. #running-req: 152, #token: 80832, token usage: 0.01, cuda graph: True, gen throughput (token/s): 14010.88, #queue-req: 0,
[2025-09-06 08:53:14] INFO: 127.0.0.1:47334 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:14] INFO: 127.0.0.1:47084 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:14] INFO: 127.0.0.1:48236 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:14] INFO: 127.0.0.1:47016 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:14] INFO: 127.0.0.1:48196 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:14] INFO: 127.0.0.1:47886 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:14 TP0] Decode batch. #running-req: 145, #token: 83648, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13505.45, #queue-req: 0,
[2025-09-06 08:53:14] INFO: 127.0.0.1:47402 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:14] INFO: 127.0.0.1:48102 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:14] INFO: 127.0.0.1:47756 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:14] INFO: 127.0.0.1:48128 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:14] INFO: 127.0.0.1:47502 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:14] INFO: 127.0.0.1:47172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:15] INFO: 127.0.0.1:47892 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:15 TP0] Decode batch. #running-req: 139, #token: 84736, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13143.80, #queue-req: 0,
[2025-09-06 08:53:15] INFO: 127.0.0.1:47994 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:15] INFO: 127.0.0.1:47630 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:15] INFO: 127.0.0.1:47174 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:15] INFO: 127.0.0.1:46852 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:15] INFO: 127.0.0.1:46998 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:15] INFO: 127.0.0.1:48274 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:15] INFO: 127.0.0.1:47870 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:15] INFO: 127.0.0.1:47980 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:15] INFO: 127.0.0.1:47300 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:15] INFO: 127.0.0.1:47316 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:15] INFO: 127.0.0.1:47976 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:15] INFO: 127.0.0.1:48346 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:15] INFO: 127.0.0.1:47806 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:15] INFO: 127.0.0.1:47138 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:15] INFO: 127.0.0.1:48352 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:15] INFO: 127.0.0.1:46794 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:15] INFO: 127.0.0.1:48138 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:15] INFO: 127.0.0.1:46818 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:15] INFO: 127.0.0.1:47226 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:15 TP0] Decode batch. #running-req: 119, #token: 78144, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13663.45, #queue-req: 0,
[2025-09-06 08:53:15] INFO: 127.0.0.1:48212 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:15] INFO: 127.0.0.1:47792 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:15] INFO: 127.0.0.1:47682 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:15 TP0] Decode batch. #running-req: 116, #token: 80704, token usage: 0.01, cuda graph: True, gen throughput (token/s): 16118.27, #queue-req: 0,
[2025-09-06 08:53:15] INFO: 127.0.0.1:47848 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:15] INFO: 127.0.0.1:47030 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:15] INFO: 127.0.0.1:46926 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:15] INFO: 127.0.0.1:47166 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:15] INFO: 127.0.0.1:46814 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:15] INFO: 127.0.0.1:47568 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:15] INFO: 127.0.0.1:48254 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:15] INFO: 127.0.0.1:46996 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:15] INFO: 127.0.0.1:48430 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:15 TP0] Decode batch. #running-req: 107, #token: 78400, token usage: 0.01, cuda graph: True, gen throughput (token/s): 15427.31, #queue-req: 0,
[2025-09-06 08:53:16] INFO: 127.0.0.1:47922 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:16] INFO: 127.0.0.1:47956 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:16] INFO: 127.0.0.1:47054 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:16] INFO: 127.0.0.1:47212 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:16] INFO: 127.0.0.1:47804 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:16] INFO: 127.0.0.1:48002 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:16] INFO: 127.0.0.1:47422 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:16] INFO: 127.0.0.1:47716 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:16] INFO: 127.0.0.1:47776 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:16] INFO: 127.0.0.1:47552 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:16] INFO: 127.0.0.1:47092 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:16] INFO: 127.0.0.1:48330 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:16] INFO: 127.0.0.1:48390 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:16 TP0] Decode batch. #running-req: 96, #token: 73408, token usage: 0.01, cuda graph: True, gen throughput (token/s): 14358.06, #queue-req: 0,
[2025-09-06 08:53:16] INFO: 127.0.0.1:47702 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:16] INFO: 127.0.0.1:48234 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:16] INFO: 127.0.0.1:47058 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:16] INFO: 127.0.0.1:47690 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:16] INFO: 127.0.0.1:46958 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:16 TP0] Decode batch. #running-req: 89, #token: 72000, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13386.14, #queue-req: 0,
[2025-09-06 08:53:16] INFO: 127.0.0.1:47606 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:16] INFO: 127.0.0.1:47436 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:16] INFO: 127.0.0.1:47964 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:16] INFO: 127.0.0.1:48114 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:16] INFO: 127.0.0.1:47486 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:16] INFO: 127.0.0.1:47930 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:16] INFO: 127.0.0.1:46868 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:16] INFO: 127.0.0.1:46890 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:16] INFO: 127.0.0.1:47556 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:16] INFO: 127.0.0.1:46836 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:16] INFO: 127.0.0.1:47832 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:16 TP0] Decode batch. #running-req: 78, #token: 67264, token usage: 0.01, cuda graph: True, gen throughput (token/s): 12540.26, #queue-req: 0,
[2025-09-06 08:53:16] INFO: 127.0.0.1:48300 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:16] INFO: 127.0.0.1:48286 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:17] INFO: 127.0.0.1:47112 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:17] INFO: 127.0.0.1:47742 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:17 TP0] Decode batch. #running-req: 74, #token: 65792, token usage: 0.01, cuda graph: True, gen throughput (token/s): 11994.78, #queue-req: 0,
[2025-09-06 08:53:17] INFO: 127.0.0.1:47880 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:17] INFO: 127.0.0.1:47646 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:17] INFO: 127.0.0.1:47492 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:17] INFO: 127.0.0.1:48414 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:17] INFO: 127.0.0.1:47196 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:17] INFO: 127.0.0.1:47186 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:17] INFO: 127.0.0.1:47938 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:17] INFO: 127.0.0.1:48042 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:17] INFO: 127.0.0.1:47350 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:17 TP0] Decode batch. #running-req: 65, #token: 60992, token usage: 0.01, cuda graph: True, gen throughput (token/s): 11233.39, #queue-req: 0,
[2025-09-06 08:53:17] INFO: 127.0.0.1:48226 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:17] INFO: 127.0.0.1:47638 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:17] INFO: 127.0.0.1:47900 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:17] INFO: 127.0.0.1:47718 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:17] INFO: 127.0.0.1:46908 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:17 TP0] Decode batch. #running-req: 60, #token: 59008, token usage: 0.01, cuda graph: True, gen throughput (token/s): 10051.77, #queue-req: 0,
[2025-09-06 08:53:17] INFO: 127.0.0.1:48210 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:17] INFO: 127.0.0.1:47514 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:17] INFO: 127.0.0.1:46892 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:17] INFO: 127.0.0.1:48370 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:17] INFO: 127.0.0.1:47696 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:17] INFO: 127.0.0.1:47464 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:17] INFO: 127.0.0.1:47570 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:17] INFO: 127.0.0.1:47990 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:17] INFO: 127.0.0.1:48246 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:17] INFO: 127.0.0.1:46880 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:17] INFO: 127.0.0.1:47234 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:17 TP0] Decode batch. #running-req: 49, #token: 49856, token usage: 0.01, cuda graph: True, gen throughput (token/s): 9153.33, #queue-req: 0,
[2025-09-06 08:53:17] INFO: 127.0.0.1:48160 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:17] INFO: 127.0.0.1:47284 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:17] INFO: 127.0.0.1:47070 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:17] INFO: 127.0.0.1:48146 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:17] INFO: 127.0.0.1:48320 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:17] INFO: 127.0.0.1:47390 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:18] INFO: 127.0.0.1:47584 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:18] INFO: 127.0.0.1:47432 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:18 TP0] Decode batch. #running-req: 42, #token: 43328, token usage: 0.01, cuda graph: True, gen throughput (token/s): 7998.64, #queue-req: 0,
[2025-09-06 08:53:18] INFO: 127.0.0.1:48054 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:18] INFO: 127.0.0.1:48040 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:18 TP0] Decode batch. #running-req: 39, #token: 43008, token usage: 0.01, cuda graph: True, gen throughput (token/s): 7167.57, #queue-req: 0,
[2025-09-06 08:53:18] INFO: 127.0.0.1:47116 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:18] INFO: 127.0.0.1:47244 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:18] INFO: 127.0.0.1:47406 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:18] INFO: 127.0.0.1:47452 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:18] INFO: 127.0.0.1:47004 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:18] INFO: 127.0.0.1:47828 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:18 TP0] Decode batch. #running-req: 33, #token: 37888, token usage: 0.00, cuda graph: True, gen throughput (token/s): 6712.80, #queue-req: 0,
[2025-09-06 08:53:18] INFO: 127.0.0.1:48216 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:18] INFO: 127.0.0.1:47608 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:18] INFO: 127.0.0.1:47940 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:18] INFO: 127.0.0.1:47626 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:18 TP0] Decode batch. #running-req: 30, #token: 34560, token usage: 0.00, cuda graph: True, gen throughput (token/s): 5671.46, #queue-req: 0,
[2025-09-06 08:53:18] INFO: 127.0.0.1:47706 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:18] INFO: 127.0.0.1:48086 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:18] INFO: 127.0.0.1:47862 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:18] INFO: 127.0.0.1:48380 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:18] INFO: 127.0.0.1:48046 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:18] INFO: 127.0.0.1:47082 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:18] INFO: 127.0.0.1:46806 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:18 TP0] Decode batch. #running-req: 22, #token: 27456, token usage: 0.00, cuda graph: True, gen throughput (token/s): 4806.00, #queue-req: 0,
[2025-09-06 08:53:18] INFO: 127.0.0.1:48032 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:18] INFO: 127.0.0.1:48358 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:18] INFO: 127.0.0.1:47332 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:19] INFO: 127.0.0.1:47384 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:19 TP0] Decode batch. #running-req: 18, #token: 23360, token usage: 0.00, cuda graph: True, gen throughput (token/s): 3878.30, #queue-req: 0,
[2025-09-06 08:53:19] INFO: 127.0.0.1:47950 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:19] INFO: 127.0.0.1:47748 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:19 TP0] Decode batch. #running-req: 16, #token: 21504, token usage: 0.00, cuda graph: True, gen throughput (token/s): 3382.74, #queue-req: 0,
[2025-09-06 08:53:19] INFO: 127.0.0.1:47388 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:19] INFO: 127.0.0.1:47668 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:19 TP0] Decode batch. #running-req: 14, #token: 19584, token usage: 0.00, cuda graph: True, gen throughput (token/s): 3287.37, #queue-req: 0,
[2025-09-06 08:53:19] INFO: 127.0.0.1:48170 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:19 TP0] Decode batch. #running-req: 13, #token: 18624, token usage: 0.00, cuda graph: True, gen throughput (token/s): 2878.57, #queue-req: 0,
[2025-09-06 08:53:19] INFO: 127.0.0.1:46886 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:19 TP0] Decode batch. #running-req: 12, #token: 17792, token usage: 0.00, cuda graph: True, gen throughput (token/s): 2658.46, #queue-req: 0,
[2025-09-06 08:53:19] INFO: 127.0.0.1:47152 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:19] INFO: 127.0.0.1:47494 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:20 TP0] Decode batch. #running-req: 10, #token: 15296, token usage: 0.00, cuda graph: True, gen throughput (token/s): 2273.44, #queue-req: 0,
[2025-09-06 08:53:20] INFO: 127.0.0.1:47578 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:20] INFO: 127.0.0.1:47694 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:20 TP0] Decode batch. #running-req: 8, #token: 12416, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1975.55, #queue-req: 0,
[2025-09-06 08:53:20 TP0] Decode batch. #running-req: 8, #token: 12864, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1922.54, #queue-req: 0,
[2025-09-06 08:53:20] INFO: 127.0.0.1:47380 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:20 TP0] Decode batch. #running-req: 7, #token: 11392, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1674.58, #queue-req: 0,
[2025-09-06 08:53:20] INFO: 127.0.0.1:48404 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:20 TP0] Decode batch. #running-req: 6, #token: 10048, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1481.07, #queue-req: 0,
[2025-09-06 08:53:20] INFO: 127.0.0.1:46874 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:20 TP0] Decode batch. #running-req: 5, #token: 8640, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1395.08, #queue-req: 0,
[2025-09-06 08:53:20] INFO: 127.0.0.1:48090 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:21 TP0] Decode batch. #running-req: 4, #token: 7040, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1068.41, #queue-req: 0,
[2025-09-06 08:53:21] INFO: 127.0.0.1:47908 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:53:21] INFO: 127.0.0.1:47736 - "POST /v1/chat/completions HTTP/1.1" 200 OK
1%| | 1/198 [00:10<35:11, 10.72s/it][2025-09-06 08:53:21 TP0] Decode batch. #running-req: 2, #token: 3456, token usage: 0.00, cuda graph: True, gen throughput (token/s): 814.09, #queue-req: 0,
[2025-09-06 08:53:21 TP0] Decode batch. #running-req: 2, #token: 3584, token usage: 0.00, cuda graph: True, gen throughput (token/s): 576.20, #queue-req: 0,
[2025-09-06 08:53:21 TP0] Decode batch. #running-req: 2, #token: 3648, token usage: 0.00, cuda graph: True, gen throughput (token/s): 580.52, #queue-req: 0,
[2025-09-06 08:53:21 TP0] Decode batch. #running-req: 2, #token: 3712, token usage: 0.00, cuda graph: True, gen throughput (token/s): 579.37, #queue-req: 0,
[2025-09-06 08:53:21 TP0] Decode batch. #running-req: 2, #token: 3776, token usage: 0.00, cuda graph: True, gen throughput (token/s): 581.49, #queue-req: 0,
[2025-09-06 08:53:21 TP0] Decode batch. #running-req: 2, #token: 3840, token usage: 0.00, cuda graph: True, gen throughput (token/s): 580.49, #queue-req: 0,
[2025-09-06 08:53:22 TP0] Decode batch. #running-req: 2, #token: 3968, token usage: 0.00, cuda graph: True, gen throughput (token/s): 580.24, #queue-req: 0,
[2025-09-06 08:53:22 TP0] Decode batch. #running-req: 2, #token: 4032, token usage: 0.00, cuda graph: True, gen throughput (token/s): 581.03, #queue-req: 0,
[2025-09-06 08:53:22 TP0] Decode batch. #running-req: 2, #token: 4096, token usage: 0.00, cuda graph: True, gen throughput (token/s): 579.36, #queue-req: 0,
[2025-09-06 08:53:22] INFO: 127.0.0.1:47040 - "POST /v1/chat/completions HTTP/1.1" 200 OK
56%|█████▌ | 110/198 [00:11<00:06, 12.59it/s][2025-09-06 08:53:22 TP0] Decode batch. #running-req: 1, #token: 2112, token usage: 0.00, cuda graph: True, gen throughput (token/s): 414.27, #queue-req: 0,
[2025-09-06 08:53:22 TP0] Decode batch. #running-req: 1, #token: 2176, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.78, #queue-req: 0,
[2025-09-06 08:53:22 TP0] Decode batch. #running-req: 1, #token: 2176, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.47, #queue-req: 0,
[2025-09-06 08:53:22 TP0] Decode batch. #running-req: 1, #token: 2240, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.03, #queue-req: 0,
[2025-09-06 08:53:22 TP0] Decode batch. #running-req: 1, #token: 2240, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.20, #queue-req: 0,
[2025-09-06 08:53:23 TP0] Decode batch. #running-req: 1, #token: 2304, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.86, #queue-req: 0,
[2025-09-06 08:53:23 TP0] Decode batch. #running-req: 1, #token: 2368, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.89, #queue-req: 0,
[2025-09-06 08:53:23 TP0] Decode batch. #running-req: 1, #token: 2368, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.49, #queue-req: 0,
[2025-09-06 08:53:23 TP0] Decode batch. #running-req: 1, #token: 2432, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.50, #queue-req: 0,
[2025-09-06 08:53:23] INFO: 127.0.0.1:47256 - "POST /v1/chat/completions HTTP/1.1" 200 OK
69%|██████▊ | 136/198 [00:13<00:04, 14.30it/s] 100%|██████████| 198/198 [00:13<00:00, 15.21it/s]
/usr/lib/python3.12/subprocess.py:1127: ResourceWarning: subprocess 54406 is still running
_warn("subprocess %s is still running" % self.pid,
ResourceWarning: Enable tracemalloc to get the object allocation traceback
.
----------------------------------------------------------------------
Ran 1 test in 177.303s
OK
Writing report to /tmp/gpqa__home_yiliu7_models_openai_gpt-oss-120b.html
{'chars': 1690.560606060606, 'chars:std': 967.3517968413342, 'score:std': 0.48749802152178456, 'score': 0.6111111111111112}
Writing results to /tmp/gpqa__home_yiliu7_models_openai_gpt-oss-120b.json
Total latency: 13.068 s
Score: 0.611
Evaluation end: model=/home/yiliu7/models/openai/gpt-oss-120b reasoning_effort=low expected_score=0.6 metrics={'chars': 1690.560606060606, 'chars:std': 967.3517968413342, 'score:std': 0.48749802152178456, 'score': 0.6111111111111112}
================================================================================
Run 10:
Auto-configed device: cuda
WARNING:sglang.srt.server_args:Detected SM100 and MXFP4 quantization format for GPT-OSS model, enabling FlashInfer MXFP4 MOE kernel.
WARNING:sglang.srt.server_args:TensorRT-LLM MHA only supports page_size of 16, 32 or 64, changing page_size from None to 64.
[2025-09-06 08:53:38] server_args=ServerArgs(model_path='/home/yiliu7/models/openai/gpt-oss-120b', tokenizer_path='/home/yiliu7/models/openai/gpt-oss-120b', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=8400, skip_server_warmup=False, warmups=None, nccl_port=None, dtype='bfloat16', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', mem_fraction_static=0.93, max_running_requests=None, max_queued_requests=9223372036854775807, max_total_tokens=None, chunked_prefill_size=16384, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, page_size=64, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=True, device='cuda', tp_size=4, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=460438675, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, gc_warning_threshold_secs=0.0, api_key=None, served_model_name='/home/yiliu7/models/openai/gpt-oss-120b', weight_version='default', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_backend='triton', attention_backend='trtllm_mha', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, ep_size=1, moe_a2a_backend='none', moe_runner_backend='flashinfer_mxfp4', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=True, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', disable_radix_cache=False, cuda_graph_max_bs=200, cuda_graph_bs=None, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, torch_compile_max_bs=32, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, enable_return_hidden_states=False, scheduler_recv_interval=1, numa_node=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, debug_tensor_dump_prefill_only=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, num_reserved_decode_tokens=512, custom_weight_loader=[], weight_loader_disable_mmap=False, enable_pdmux=False, sm_group_num=3, enable_ep_moe=False, enable_deepep_moe=False, enable_flashinfer_cutlass_moe=False, enable_flashinfer_trtllm_moe=False, enable_triton_kernel_moe=False, enable_flashinfer_mxfp4_moe=False)
All deep_gemm operations loaded successfully!
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-06 08:53:38] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:53:38] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-06 08:53:39] Using default HuggingFace chat template with detected content format: string
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-06 08:53:45 TP0] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:53:45 TP0] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-06 08:53:45 TP3] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:53:45 TP3] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-06 08:53:45 TP1] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:53:45 TP1] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-06 08:53:45 TP0] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:53:45 TP0] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-06 08:53:45 TP0] Init torch distributed begin.
`torch_dtype` is deprecated! Use `dtype` instead!
[2025-09-06 08:53:45 TP2] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:53:45 TP2] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-06 08:53:46 TP3] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:53:46 TP3] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-06 08:53:46 TP1] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:53:46 TP1] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[2025-09-06 08:53:46 TP2] Downcasting torch.float32 to torch.bfloat16.
[2025-09-06 08:53:46 TP2] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[2025-09-06 08:53:47 TP0] sglang is using nccl==2.27.3
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[2025-09-06 08:53:50 TP0] Init torch distributed ends. mem usage=1.46 GB
[2025-09-06 08:53:50 TP0] Load weight begin. avail mem=176.28 GB
All deep_gemm operations loaded successfully!
Loading safetensors checkpoint shards: 0% Completed | 0/15 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 15/15 [00:00<00:00, 1587.43it/s]
All deep_gemm operations loaded successfully!
[2025-09-06 08:54:05 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.0.mlp.experts), it might take a while...
All deep_gemm operations loaded successfully!
All deep_gemm operations loaded successfully!
[2025-09-06 08:54:08 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.1.mlp.experts), it might take a while...
[2025-09-06 08:54:11 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.2.mlp.experts), it might take a while...
[2025-09-06 08:54:14 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.3.mlp.experts), it might take a while...
[2025-09-06 08:54:17 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.4.mlp.experts), it might take a while...
[2025-09-06 08:54:20 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.5.mlp.experts), it might take a while...
[2025-09-06 08:54:23 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.6.mlp.experts), it might take a while...
[2025-09-06 08:54:26 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.7.mlp.experts), it might take a while...
[2025-09-06 08:54:30 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.8.mlp.experts), it might take a while...
[2025-09-06 08:54:33 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.9.mlp.experts), it might take a while...
[2025-09-06 08:54:36 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.10.mlp.experts), it might take a while...
[2025-09-06 08:54:39 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.11.mlp.experts), it might take a while...
[2025-09-06 08:54:42 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.12.mlp.experts), it might take a while...
[2025-09-06 08:54:45 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.13.mlp.experts), it might take a while...
[2025-09-06 08:54:48 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.14.mlp.experts), it might take a while...
[2025-09-06 08:54:51 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.15.mlp.experts), it might take a while...
[2025-09-06 08:54:54 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.16.mlp.experts), it might take a while...
[2025-09-06 08:54:57 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.17.mlp.experts), it might take a while...
[2025-09-06 08:55:00 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.18.mlp.experts), it might take a while...
[2025-09-06 08:55:03 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.19.mlp.experts), it might take a while...
[2025-09-06 08:55:06 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.20.mlp.experts), it might take a while...
[2025-09-06 08:55:09 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.21.mlp.experts), it might take a while...
[2025-09-06 08:55:12 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.22.mlp.experts), it might take a while...
[2025-09-06 08:55:15 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.23.mlp.experts), it might take a while...
[2025-09-06 08:55:18 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.24.mlp.experts), it might take a while...
[2025-09-06 08:55:22 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.25.mlp.experts), it might take a while...
[2025-09-06 08:55:25 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.26.mlp.experts), it might take a while...
[2025-09-06 08:55:28 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.27.mlp.experts), it might take a while...
[2025-09-06 08:55:31 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.28.mlp.experts), it might take a while...
[2025-09-06 08:55:34 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.29.mlp.experts), it might take a while...
[2025-09-06 08:55:37 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.30.mlp.experts), it might take a while...
[2025-09-06 08:55:40 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.31.mlp.experts), it might take a while...
[2025-09-06 08:55:43 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.32.mlp.experts), it might take a while...
[2025-09-06 08:55:46 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.33.mlp.experts), it might take a while...
[2025-09-06 08:55:49 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.34.mlp.experts), it might take a while...
[2025-09-06 08:55:52 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.35.mlp.experts), it might take a while...
[2025-09-06 08:55:55 TP0] Load weight end. type=GptOssForCausalLM, dtype=torch.bfloat16, avail mem=158.06 GB, mem usage=18.22 GB.
[2025-09-06 08:55:57 TP0] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB
[2025-09-06 08:55:57 TP2] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB
[2025-09-06 08:55:57 TP3] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB
[2025-09-06 08:55:57 TP0] Memory pool end. avail mem=10.23 GB
[2025-09-06 08:55:57 TP1] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB
[2025-09-06 08:55:57 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=9.54 GB
[2025-09-06 08:55:57 TP0] Capture cuda graph bs [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200]
0%| | 0/28 [00:00<?, ?it/s] Capturing batches (bs=200 avail_mem=9.39 GB): 0%| | 0/28 [00:00<?, ?it/s]rank 1 allocated ipc_handles: [['0x709c34000000', '0x70c1ca000000', '0x709bd0000000', '0x709bcc000000'], ['0x709bcf000000', '0x709bcee00000', '0x709bcf200000', '0x709bcf400000'], ['0x709bb8000000', '0x709bc2000000', '0x709bae000000', '0x709ba4000000']]
[2025-09-06 08:55:59.485] [info] lamportInitialize start: buffer: 0x709bc2000000, size: 71303168
rank 0 allocated ipc_handles: [['0x707408000000', '0x704e16000000', '0x704e12000000', '0x704e0e000000'], ['0x704e10e00000', '0x704e11000000', '0x704e11200000', '0x704e11400000'], ['0x704e04000000', '0x704dfa000000', '0x704df0000000', '0x704de6000000']]
[2025-09-06 08:55:59.537] [info] lamportInitialize start: buffer: 0x704e04000000, size: 71303168
rank 2 allocated ipc_handles: [['0x73c650000000', '0x73c5ec000000', '0x73ebe6000000', '0x73c5e8000000'], ['0x73c5eb000000', '0x73c5eb200000', '0x73c5eae00000', '0x73c5eb400000'], ['0x73c5d4000000', '0x73c5ca000000', '0x73c5de000000', '0x73c5c0000000']]
[2025-09-06 08:55:59.586] [info] lamportInitialize start: buffer: 0x73c5de000000, size: 71303168
rank 3 allocated ipc_handles: [['0x795490000000', '0x795432000000', '0x79542e000000', '0x797a26000000'], ['0x795431000000', '0x795431200000', '0x795431400000', '0x795430e00000'], ['0x79541a000000', '0x795410000000', '0x795406000000', '0x795424000000']]
[2025-09-06 08:55:59.636] [info] lamportInitialize start: buffer: 0x795424000000, size: 71303168
[2025-09-06 08:55:59 TP0] FlashInfer workspace initialized for rank 0, world_size 4
[2025-09-06 08:55:59 TP1] FlashInfer workspace initialized for rank 1, world_size 4
[2025-09-06 08:55:59 TP3] FlashInfer workspace initialized for rank 3, world_size 4
[2025-09-06 08:55:59 TP2] FlashInfer workspace initialized for rank 2, world_size 4
set flag_ptr[3] = lamport_comm_size: 47185920
Rank 1 workspace[0] 0x709c34000000
Rank 1 workspace[1] 0x70c1ca000000
Rank 1 workspace[2] 0x709bd0000000
Rank 1 workspace[3] 0x709bcc000000
Rank 1 workspace[4] 0x709bcf000000
Rank 1 workspace[5] 0x709bcee00000
Rank 1 workspace[6] 0x709bcf200000
Rank 1 workspace[7] 0x709bcf400000
Rank 1 workspace[8] 0x709bb8000000
Rank 1 workspace[9] 0x709bc2000000
Rank 1 workspace[10] 0x709bae000000
Rank 1 workspace[11] 0x709ba4000000
Rank 1 workspace[12] 0x70c7d3264400
set flag_ptr[3] = lamport_comm_size: 47185920
Rank 3 workspace[0] 0x795490000000
Rank 3 workspace[1] 0x795432000000
Rank 3 workspace[2] 0x79542e000000
Rank 3 workspace[3] 0x797a26000000
Rank 3 workspace[4] 0x795431000000
Rank 3 workspace[5] 0x795431200000
Rank 3 workspace[6] 0x795431400000
Rank 3 workspace[7] 0x795430e00000
Rank 3 workspace[8] 0x79541a000000
Rank 3 workspace[9] 0x795410000000
Rank 3 workspace[10] 0x795406000000
Rank 3 workspace[11] 0x795424000000
Rank 3 workspace[12] 0x798021264400
set flag_ptr[3] = lamport_comm_size: 47185920
Rank 0 workspace[0] 0x707408000000
Rank 0 workspace[1] 0x704e16000000
Rank 0 workspace[2] 0x704e12000000
Rank 0 workspace[3] 0x704e0e000000
Rank 0 workspace[4] 0x704e10e00000
Rank 0 workspace[5] 0x704e11000000
Rank 0 workspace[6] 0x704e11200000
Rank 0 workspace[7] 0x704e11400000
Rank 0 workspace[8] 0x704e04000000
Rank 0 workspace[9] 0x704dfa000000
Rank 0 workspace[10] 0x704df0000000
Rank 0 workspace[11] 0x704de6000000
Rank 0 workspace[12] 0x7079ff264400
set flag_ptr[3] = lamport_comm_size: 47185920
Rank 2 workspace[0] 0x73c650000000
Rank 2 workspace[1] 0x73c5ec000000
Rank 2 workspace[2] 0x73ebe6000000
Rank 2 workspace[3] 0x73c5e8000000
Rank 2 workspace[4] 0x73c5eb000000
Rank 2 workspace[5] 0x73c5eb200000
Rank 2 workspace[6] 0x73c5eae00000
Rank 2 workspace[7] 0x73c5eb400000
Rank 2 workspace[8] 0x73c5d4000000
Rank 2 workspace[9] 0x73c5ca000000
Rank 2 workspace[10] 0x73c5de000000
Rank 2 workspace[11] 0x73c5c0000000
Rank 2 workspace[12] 0x73f1f3264400
Capturing batches (bs=200 avail_mem=9.39 GB): 4%|▎ | 1/28 [00:02<00:57, 2.14s/it] Capturing batches (bs=192 avail_mem=8.06 GB): 4%|▎ | 1/28 [00:02<00:57, 2.14s/it] Capturing batches (bs=192 avail_mem=8.06 GB): 7%|▋ | 2/28 [00:02<00:25, 1.00it/s] Capturing batches (bs=184 avail_mem=8.05 GB): 7%|▋ | 2/28 [00:02<00:25, 1.00it/s] Capturing batches (bs=176 avail_mem=8.04 GB): 7%|▋ | 2/28 [00:02<00:25, 1.00it/s] Capturing batches (bs=176 avail_mem=8.04 GB): 14%|█▍ | 4/28 [00:02<00:10, 2.36it/s] Capturing batches (bs=168 avail_mem=8.03 GB): 14%|█▍ | 4/28 [00:02<00:10, 2.36it/s] Capturing batches (bs=160 avail_mem=8.02 GB): 14%|█▍ | 4/28 [00:02<00:10, 2.36it/s] Capturing batches (bs=160 avail_mem=8.02 GB): 21%|██▏ | 6/28 [00:02<00:05, 3.81it/s] Capturing batches (bs=152 avail_mem=8.01 GB): 21%|██▏ | 6/28 [00:02<00:05, 3.81it/s] Capturing batches (bs=144 avail_mem=8.00 GB): 21%|██▏ | 6/28 [00:02<00:05, 3.81it/s] Capturing batches (bs=144 avail_mem=8.00 GB): 29%|██▊ | 8/28 [00:02<00:03, 5.25it/s] Capturing batches (bs=136 avail_mem=8.00 GB): 29%|██▊ | 8/28 [00:02<00:03, 5.25it/s] Capturing batches (bs=128 avail_mem=7.99 GB): 29%|██▊ | 8/28 [00:02<00:03, 5.25it/s] Capturing batches (bs=128 avail_mem=7.99 GB): 36%|███▌ | 10/28 [00:03<00:02, 6.54it/s] Capturing batches (bs=120 avail_mem=7.97 GB): 36%|███▌ | 10/28 [00:03<00:02, 6.54it/s] Capturing batches (bs=112 avail_mem=7.97 GB): 36%|███▌ | 10/28 [00:03<00:02, 6.54it/s] Capturing batches (bs=112 avail_mem=7.97 GB): 43%|████▎ | 12/28 [00:03<00:02, 7.71it/s] Capturing batches (bs=104 avail_mem=7.96 GB): 43%|████▎ | 12/28 [00:03<00:02, 7.71it/s] Capturing batches (bs=96 avail_mem=7.95 GB): 43%|████▎ | 12/28 [00:03<00:02, 7.71it/s] Capturing batches (bs=96 avail_mem=7.95 GB): 50%|█████ | 14/28 [00:03<00:01, 8.70it/s] Capturing batches (bs=88 avail_mem=7.94 GB): 50%|█████ | 14/28 [00:03<00:01, 8.70it/s] Capturing batches (bs=80 avail_mem=7.93 GB): 50%|█████ | 14/28 [00:03<00:01, 8.70it/s] Capturing batches (bs=80 avail_mem=7.93 GB): 57%|█████▋ | 16/28 [00:03<00:01, 9.50it/s] Capturing batches (bs=72 avail_mem=7.92 GB): 57%|█████▋ | 16/28 [00:03<00:01, 9.50it/s] Capturing batches (bs=64 avail_mem=7.91 GB): 57%|█████▋ | 16/28 [00:03<00:01, 9.50it/s] Capturing batches (bs=64 avail_mem=7.91 GB): 64%|██████▍ | 18/28 [00:03<00:00, 10.17it/s] Capturing batches (bs=56 avail_mem=7.90 GB): 64%|██████▍ | 18/28 [00:03<00:00, 10.17it/s] Capturing batches (bs=48 avail_mem=7.89 GB): 64%|██████▍ | 18/28 [00:03<00:00, 10.17it/s] Capturing batches (bs=48 avail_mem=7.89 GB): 71%|███████▏ | 20/28 [00:03<00:00, 10.78it/s] Capturing batches (bs=40 avail_mem=7.88 GB): 71%|███████▏ | 20/28 [00:03<00:00, 10.78it/s] Capturing batches (bs=32 avail_mem=7.88 GB): 71%|███████▏ | 20/28 [00:03<00:00, 10.78it/s] Capturing batches (bs=32 avail_mem=7.88 GB): 79%|███████▊ | 22/28 [00:04<00:00, 11.20it/s] Capturing batches (bs=24 avail_mem=7.87 GB): 79%|███████▊ | 22/28 [00:04<00:00, 11.20it/s] Capturing batches (bs=16 avail_mem=7.86 GB): 79%|███████▊ | 22/28 [00:04<00:00, 11.20it/s] Capturing batches (bs=16 avail_mem=7.86 GB): 86%|████████▌ | 24/28 [00:04<00:00, 11.60it/s] Capturing batches (bs=8 avail_mem=7.85 GB): 86%|████████▌ | 24/28 [00:04<00:00, 11.60it/s] Capturing batches (bs=4 avail_mem=7.84 GB): 86%|████████▌ | 24/28 [00:04<00:00, 11.60it/s] Capturing batches (bs=4 avail_mem=7.84 GB): 93%|█████████▎| 26/28 [00:04<00:00, 12.19it/s] Capturing batches (bs=2 avail_mem=7.83 GB): 93%|█████████▎| 26/28 [00:04<00:00, 12.19it/s] Capturing batches (bs=1 avail_mem=7.82 GB): 93%|█████████▎| 26/28 [00:04<00:00, 12.19it/s][2025-09-06 08:56:02 TP3] Registering 56 cuda graph addresses
Capturing batches (bs=1 avail_mem=7.82 GB): 100%|██████████| 28/28 [00:04<00:00, 11.67it/s] Capturing batches (bs=1 avail_mem=7.82 GB): 100%|██████████| 28/28 [00:04<00:00, 6.22it/s]
[2025-09-06 08:56:02 TP0] Registering 56 cuda graph addresses
[2025-09-06 08:56:02 TP1] Registering 56 cuda graph addresses
[2025-09-06 08:56:02 TP2] Registering 56 cuda graph addresses
[2025-09-06 08:56:02 TP0] Capture cuda graph end. Time elapsed: 5.03 s. mem usage=1.73 GB. avail mem=7.81 GB.
[2025-09-06 08:56:02 TP0] max_total_num_tokens=8487040, chunked_prefill_size=16384, max_prefill_tokens=16384, max_running_requests=4096, context_len=131072, available_gpu_mem=7.81 GB
[2025-09-06 08:56:03] INFO: Started server process [56872]
[2025-09-06 08:56:03] INFO: Waiting for application startup.
[2025-09-06 08:56:03] INFO: Application startup complete.
[2025-09-06 08:56:03] INFO: Uvicorn running on http://127.0.0.1:8400 (Press CTRL+C to quit)
[2025-09-06 08:56:04] INFO: 127.0.0.1:33132 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-09-06 08:56:04 TP0] Prefill batch. #new-seq: 1, #new-token: 64, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-09-06 08:56:05] INFO: 127.0.0.1:33148 - "POST /generate HTTP/1.1" 200 OK
[2025-09-06 08:56:05] The server is fired up and ready to roll!
[2025-09-06 08:56:12 TP0] Prefill batch. #new-seq: 1, #new-token: 64, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-09-06 08:56:13] INFO: 127.0.0.1:34932 - "GET /health_generate HTTP/1.1" 200 OK
command=python3 -m sglang.launch_server --model-path /home/yiliu7/models/openai/gpt-oss-120b --tp 4 --cuda-graph-max-bs 200 --mem-fraction-static 0.93 --device cuda --host 127.0.0.1 --port 8400
Evaluation start: model=/home/yiliu7/models/openai/gpt-oss-120b reasoning_effort=low expected_score=0.6
ChatCompletionSampler initialized with self.system_message=None self.temperature=0.1 self.max_tokens=4096 self.reasoning_effort='low'
0%| | 0/198 [00:00<?, ?it/s][2025-09-06 08:56:14 TP0] Prefill batch. #new-seq: 1, #new-token: 256, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-09-06 08:56:14 TP0] Prefill batch. #new-seq: 2, #new-token: 640, #cached-token: 0, token usage: 0.00, #running-req: 1, #queue-req: 0,
[2025-09-06 08:56:14 TP0] Prefill batch. #new-seq: 9, #new-token: 3328, #cached-token: 576, token usage: 0.00, #running-req: 3, #queue-req: 0,
[2025-09-06 08:56:14 TP0] Prefill batch. #new-seq: 7, #new-token: 2304, #cached-token: 448, token usage: 0.00, #running-req: 12, #queue-req: 0,
[2025-09-06 08:56:14 TP0] Prefill batch. #new-seq: 52, #new-token: 16192, #cached-token: 3328, token usage: 0.00, #running-req: 19, #queue-req: 35,
[2025-09-06 08:56:15 TP0] Prefill batch. #new-seq: 60, #new-token: 16128, #cached-token: 3840, token usage: 0.00, #running-req: 71, #queue-req: 3,
[2025-09-06 08:56:15 TP0] Prefill batch. #new-seq: 60, #new-token: 16000, #cached-token: 3968, token usage: 0.00, #running-req: 131, #queue-req: 0,
[2025-09-06 08:56:15 TP0] Prefill batch. #new-seq: 7, #new-token: 1920, #cached-token: 448, token usage: 0.01, #running-req: 191, #queue-req: 0,
[2025-09-06 08:56:15 TP0] Decode batch. #running-req: 198, #token: 62848, token usage: 0.01, cuda graph: True, gen throughput (token/s): 471.38, #queue-req: 0,
[2025-09-06 08:56:15] INFO: 127.0.0.1:34996 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:15] INFO: 127.0.0.1:34970 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:15] INFO: 127.0.0.1:36314 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:16 TP0] Decode batch. #running-req: 195, #token: 69056, token usage: 0.01, cuda graph: True, gen throughput (token/s): 17246.54, #queue-req: 0,
[2025-09-06 08:56:16] INFO: 127.0.0.1:35984 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:16] INFO: 127.0.0.1:35740 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:16] INFO: 127.0.0.1:35238 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:16] INFO: 127.0.0.1:34938 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:16] INFO: 127.0.0.1:36640 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:16] INFO: 127.0.0.1:35298 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:16] INFO: 127.0.0.1:35828 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:16] INFO: 127.0.0.1:36500 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:16] INFO: 127.0.0.1:35320 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:16] INFO: 127.0.0.1:35844 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:16 TP0] Decode batch. #running-req: 187, #token: 73600, token usage: 0.01, cuda graph: True, gen throughput (token/s): 16961.41, #queue-req: 0,
[2025-09-06 08:56:16] INFO: 127.0.0.1:36112 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:16] INFO: 127.0.0.1:35928 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:16] INFO: 127.0.0.1:35336 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:16] INFO: 127.0.0.1:35972 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:16] INFO: 127.0.0.1:35698 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:16] INFO: 127.0.0.1:35370 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:16 TP0] Decode batch. #running-req: 179, #token: 76032, token usage: 0.01, cuda graph: True, gen throughput (token/s): 16309.95, #queue-req: 0,
[2025-09-06 08:56:16] INFO: 127.0.0.1:35012 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:16] INFO: 127.0.0.1:35156 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:17] INFO: 127.0.0.1:36228 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:17] INFO: 127.0.0.1:36050 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:17] INFO: 127.0.0.1:35464 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:17] INFO: 127.0.0.1:35730 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:17] INFO: 127.0.0.1:35342 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:17] INFO: 127.0.0.1:36282 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:17] INFO: 127.0.0.1:36592 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:17] INFO: 127.0.0.1:36056 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:17] INFO: 127.0.0.1:36468 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:17] INFO: 127.0.0.1:35766 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:17] INFO: 127.0.0.1:35260 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:17 TP0] Decode batch. #running-req: 166, #token: 77760, token usage: 0.01, cuda graph: True, gen throughput (token/s): 15304.53, #queue-req: 0,
[2025-09-06 08:56:17] INFO: 127.0.0.1:36100 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:17] INFO: 127.0.0.1:35098 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:17] INFO: 127.0.0.1:35462 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:17] INFO: 127.0.0.1:35932 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:17] INFO: 127.0.0.1:35660 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:17] INFO: 127.0.0.1:35070 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:17 TP0] Decode batch. #running-req: 160, #token: 79232, token usage: 0.01, cuda graph: True, gen throughput (token/s): 14478.16, #queue-req: 0,
[2025-09-06 08:56:17] INFO: 127.0.0.1:35326 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:17] INFO: 127.0.0.1:35500 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:17] INFO: 127.0.0.1:36370 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:18] INFO: 127.0.0.1:36614 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:18] INFO: 127.0.0.1:36668 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:18] INFO: 127.0.0.1:36246 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:18] INFO: 127.0.0.1:35104 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:18] INFO: 127.0.0.1:35218 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:18] INFO: 127.0.0.1:36070 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:18] INFO: 127.0.0.1:35408 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:18] INFO: 127.0.0.1:35726 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:18 TP0] Decode batch. #running-req: 149, #token: 79552, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13743.11, #queue-req: 0,
[2025-09-06 08:56:18] INFO: 127.0.0.1:36358 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:18] INFO: 127.0.0.1:35644 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:18] INFO: 127.0.0.1:36182 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:18] INFO: 127.0.0.1:36278 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:18] INFO: 127.0.0.1:35560 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:18] INFO: 127.0.0.1:36158 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:18] INFO: 127.0.0.1:35374 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:18] INFO: 127.0.0.1:35814 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:18 TP0] Decode batch. #running-req: 141, #token: 81216, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13274.29, #queue-req: 0,
[2025-09-06 08:56:18] INFO: 127.0.0.1:36116 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:18] INFO: 127.0.0.1:34964 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:18] INFO: 127.0.0.1:36140 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:18] INFO: 127.0.0.1:35166 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:18] INFO: 127.0.0.1:35228 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:18] INFO: 127.0.0.1:35936 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:18] INFO: 127.0.0.1:36518 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:18] INFO: 127.0.0.1:36414 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:19] INFO: 127.0.0.1:35444 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:19] INFO: 127.0.0.1:35480 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:19] INFO: 127.0.0.1:35552 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:19] INFO: 127.0.0.1:34972 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:19] INFO: 127.0.0.1:35484 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:19] INFO: 127.0.0.1:36472 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:19] INFO: 127.0.0.1:35124 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:19 TP0] Decode batch. #running-req: 126, #token: 77248, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13223.64, #queue-req: 0,
[2025-09-06 08:56:19] INFO: 127.0.0.1:35172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:19] INFO: 127.0.0.1:36692 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:19] INFO: 127.0.0.1:35564 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:19] INFO: 127.0.0.1:36580 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:19] INFO: 127.0.0.1:35856 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:19] INFO: 127.0.0.1:36464 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:19] INFO: 127.0.0.1:36566 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:19] INFO: 127.0.0.1:36624 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:19] INFO: 127.0.0.1:36654 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:19] INFO: 127.0.0.1:36430 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:19] INFO: 127.0.0.1:36234 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:19 TP0] Decode batch. #running-req: 115, #token: 75712, token usage: 0.01, cuda graph: True, gen throughput (token/s): 16263.58, #queue-req: 0,
[2025-09-06 08:56:19] INFO: 127.0.0.1:35538 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:19] INFO: 127.0.0.1:35046 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:19] INFO: 127.0.0.1:36630 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:19] INFO: 127.0.0.1:35684 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:19] INFO: 127.0.0.1:36404 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:19] INFO: 127.0.0.1:35614 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:19] INFO: 127.0.0.1:35364 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:19] INFO: 127.0.0.1:36172 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:19 TP0] Decode batch. #running-req: 107, #token: 74240, token usage: 0.01, cuda graph: True, gen throughput (token/s): 15262.25, #queue-req: 0,
[2025-09-06 08:56:19] INFO: 127.0.0.1:35368 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:19] INFO: 127.0.0.1:35148 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:19] INFO: 127.0.0.1:35958 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:19] INFO: 127.0.0.1:35024 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:19] INFO: 127.0.0.1:35088 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:19] INFO: 127.0.0.1:35574 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:19] INFO: 127.0.0.1:36208 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:19] INFO: 127.0.0.1:35192 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:20] INFO: 127.0.0.1:35038 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:20] INFO: 127.0.0.1:35008 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:20 TP0] Decode batch. #running-req: 97, #token: 71808, token usage: 0.01, cuda graph: True, gen throughput (token/s): 14424.31, #queue-req: 0,
[2025-09-06 08:56:20] INFO: 127.0.0.1:36300 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:20] INFO: 127.0.0.1:35870 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:20] INFO: 127.0.0.1:36258 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:20] INFO: 127.0.0.1:35528 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:20] INFO: 127.0.0.1:36336 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:20] INFO: 127.0.0.1:35268 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:20] INFO: 127.0.0.1:35416 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:20] INFO: 127.0.0.1:36138 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:20 TP0] Decode batch. #running-req: 89, #token: 69952, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13512.18, #queue-req: 0,
[2025-09-06 08:56:20] INFO: 127.0.0.1:36298 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:20] INFO: 127.0.0.1:35020 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:20] INFO: 127.0.0.1:35452 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:20] INFO: 127.0.0.1:36132 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:20] INFO: 127.0.0.1:36000 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:20] INFO: 127.0.0.1:35474 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:20] INFO: 127.0.0.1:36408 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:20 TP0] Decode batch. #running-req: 82, #token: 67136, token usage: 0.01, cuda graph: True, gen throughput (token/s): 12757.55, #queue-req: 0,
[2025-09-06 08:56:20] INFO: 127.0.0.1:35208 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:20] INFO: 127.0.0.1:36504 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:20] INFO: 127.0.0.1:36388 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:20] INFO: 127.0.0.1:35782 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:20] INFO: 127.0.0.1:35312 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:20 TP0] Decode batch. #running-req: 77, #token: 65984, token usage: 0.01, cuda graph: True, gen throughput (token/s): 12161.34, #queue-req: 0,
[2025-09-06 08:56:20] INFO: 127.0.0.1:35584 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:20] INFO: 127.0.0.1:36210 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:20] INFO: 127.0.0.1:34952 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:20] INFO: 127.0.0.1:36528 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:20] INFO: 127.0.0.1:35942 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:20] INFO: 127.0.0.1:36288 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:21] INFO: 127.0.0.1:36026 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:21] INFO: 127.0.0.1:36280 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:21] INFO: 127.0.0.1:35252 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:21] INFO: 127.0.0.1:36554 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:21 TP0] Decode batch. #running-req: 67, #token: 59968, token usage: 0.01, cuda graph: True, gen throughput (token/s): 11380.33, #queue-req: 0,
[2025-09-06 08:56:21] INFO: 127.0.0.1:36348 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:21] INFO: 127.0.0.1:35598 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:21] INFO: 127.0.0.1:36596 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:21] INFO: 127.0.0.1:34958 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:21] INFO: 127.0.0.1:35276 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:21] INFO: 127.0.0.1:36236 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:21] INFO: 127.0.0.1:35756 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:21] INFO: 127.0.0.1:36454 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:21 TP0] Decode batch. #running-req: 59, #token: 54592, token usage: 0.01, cuda graph: True, gen throughput (token/s): 9912.63, #queue-req: 0,
[2025-09-06 08:56:21] INFO: 127.0.0.1:36446 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:21] INFO: 127.0.0.1:35896 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:21] INFO: 127.0.0.1:35286 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:21 TP0] Decode batch. #running-req: 56, #token: 54912, token usage: 0.01, cuda graph: True, gen throughput (token/s): 9541.72, #queue-req: 0,
[2025-09-06 08:56:21] INFO: 127.0.0.1:36010 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:21] INFO: 127.0.0.1:35902 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:21] INFO: 127.0.0.1:35672 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:21] INFO: 127.0.0.1:35146 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:21] INFO: 127.0.0.1:36086 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:21] INFO: 127.0.0.1:34994 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:21 TP0] Decode batch. #running-req: 50, #token: 50112, token usage: 0.01, cuda graph: True, gen throughput (token/s): 9111.40, #queue-req: 0,
[2025-09-06 08:56:21] INFO: 127.0.0.1:36684 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:21] INFO: 127.0.0.1:35112 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:21] INFO: 127.0.0.1:36480 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:21] INFO: 127.0.0.1:35712 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:21] INFO: 127.0.0.1:36708 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:21] INFO: 127.0.0.1:35282 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:21] INFO: 127.0.0.1:35392 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:22] INFO: 127.0.0.1:36036 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:22] INFO: 127.0.0.1:35884 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:22 TP0] Decode batch. #running-req: 41, #token: 42432, token usage: 0.00, cuda graph: True, gen throughput (token/s): 7750.34, #queue-req: 0,
[2025-09-06 08:56:22] INFO: 127.0.0.1:35798 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:22] INFO: 127.0.0.1:36326 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:22] INFO: 127.0.0.1:36702 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:22] INFO: 127.0.0.1:35206 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:22] INFO: 127.0.0.1:35914 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:22] INFO: 127.0.0.1:34988 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:22 TP0] Decode batch. #running-req: 35, #token: 38464, token usage: 0.00, cuda graph: True, gen throughput (token/s): 6543.68, #queue-req: 0,
[2025-09-06 08:56:22] INFO: 127.0.0.1:35352 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:22] INFO: 127.0.0.1:36542 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:22] INFO: 127.0.0.1:36078 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:22 TP0] Decode batch. #running-req: 32, #token: 35520, token usage: 0.00, cuda graph: True, gen throughput (token/s): 6058.73, #queue-req: 0,
[2025-09-06 08:56:22] INFO: 127.0.0.1:35860 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:22] INFO: 127.0.0.1:34976 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:22] INFO: 127.0.0.1:36490 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:22] INFO: 127.0.0.1:35628 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:22] INFO: 127.0.0.1:35784 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:22 TP0] Decode batch. #running-req: 28, #token: 31872, token usage: 0.00, cuda graph: True, gen throughput (token/s): 5606.79, #queue-req: 0,
[2025-09-06 08:56:22] INFO: 127.0.0.1:36218 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:22] INFO: 127.0.0.1:35882 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:22] INFO: 127.0.0.1:35450 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:22 TP0] Decode batch. #running-req: 24, #token: 29888, token usage: 0.00, cuda graph: True, gen throughput (token/s): 4809.02, #queue-req: 0,
[2025-09-06 08:56:23] INFO: 127.0.0.1:35770 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:23] INFO: 127.0.0.1:36156 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:23] INFO: 127.0.0.1:36474 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:23 TP0] Decode batch. #running-req: 21, #token: 26880, token usage: 0.00, cuda graph: True, gen throughput (token/s): 4527.52, #queue-req: 0,
[2025-09-06 08:56:23] INFO: 127.0.0.1:35056 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:23] INFO: 127.0.0.1:36622 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:23] INFO: 127.0.0.1:36276 - "POST /v1/chat/completions HTTP/1.1" 200 OK
1%| | 1/198 [00:08<28:36, 8.71s/it][2025-09-06 08:56:23] INFO: 127.0.0.1:35514 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:23 TP0] Decode batch. #running-req: 17, #token: 22400, token usage: 0.00, cuda graph: True, gen throughput (token/s): 3719.03, #queue-req: 0,
[2025-09-06 08:56:23] INFO: 127.0.0.1:35640 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:23 TP0] Decode batch. #running-req: 16, #token: 21760, token usage: 0.00, cuda graph: True, gen throughput (token/s): 3341.24, #queue-req: 0,
[2025-09-06 08:56:23] INFO: 127.0.0.1:36546 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:23] INFO: 127.0.0.1:35820 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:23] INFO: 127.0.0.1:35186 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:23] INFO: 127.0.0.1:35226 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:23 TP0] Decode batch. #running-req: 12, #token: 16896, token usage: 0.00, cuda graph: True, gen throughput (token/s): 2893.92, #queue-req: 0,
[2025-09-06 08:56:23] INFO: 127.0.0.1:36448 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:23 TP0] Decode batch. #running-req: 11, #token: 16192, token usage: 0.00, cuda graph: True, gen throughput (token/s): 2433.17, #queue-req: 0,
[2025-09-06 08:56:23] INFO: 127.0.0.1:36386 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:24] INFO: 127.0.0.1:35996 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:24 TP0] Decode batch. #running-req: 9, #token: 13376, token usage: 0.00, cuda graph: True, gen throughput (token/s): 2083.63, #queue-req: 0,
[2025-09-06 08:56:24] INFO: 127.0.0.1:36194 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:24] INFO: 127.0.0.1:35428 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:24 TP0] Decode batch. #running-req: 7, #token: 9408, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1850.52, #queue-req: 0,
[2025-09-06 08:56:24] INFO: 127.0.0.1:35130 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:24] INFO: 127.0.0.1:35488 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:24 TP0] Decode batch. #running-req: 5, #token: 8064, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1423.57, #queue-req: 0,
[2025-09-06 08:56:24 TP0] Decode batch. #running-req: 5, #token: 8192, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1208.97, #queue-req: 0,
[2025-09-06 08:56:24] INFO: 127.0.0.1:36600 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:24 TP0] Decode batch. #running-req: 4, #token: 6784, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1136.89, #queue-req: 0,
[2025-09-06 08:56:24 TP0] Decode batch. #running-req: 4, #token: 6912, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1051.13, #queue-req: 0,
[2025-09-06 08:56:25 TP0] Decode batch. #running-req: 4, #token: 7040, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1060.20, #queue-req: 0,
[2025-09-06 08:56:25 TP0] Decode batch. #running-req: 4, #token: 7296, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1063.30, #queue-req: 0,
[2025-09-06 08:56:25] INFO: 127.0.0.1:35380 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:25 TP0] Decode batch. #running-req: 3, #token: 5568, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1036.20, #queue-req: 0,
[2025-09-06 08:56:25] INFO: 127.0.0.1:36696 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:25 TP0] Decode batch. #running-req: 2, #token: 3968, token usage: 0.00, cuda graph: True, gen throughput (token/s): 634.01, #queue-req: 0,
[2025-09-06 08:56:25 TP0] Decode batch. #running-req: 2, #token: 3968, token usage: 0.00, cuda graph: True, gen throughput (token/s): 570.67, #queue-req: 0,
[2025-09-06 08:56:25 TP0] Decode batch. #running-req: 2, #token: 4096, token usage: 0.00, cuda graph: True, gen throughput (token/s): 578.23, #queue-req: 0,
[2025-09-06 08:56:25 TP0] Decode batch. #running-req: 2, #token: 4224, token usage: 0.00, cuda graph: True, gen throughput (token/s): 578.82, #queue-req: 0,
[2025-09-06 08:56:25] INFO: 127.0.0.1:35086 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-09-06 08:56:26] INFO: 127.0.0.1:36272 - "POST /v1/chat/completions HTTP/1.1" 200 OK
1%| | 2/198 [00:11<17:06, 5.24s/it] 100%|██████████| 198/198 [00:11<00:00, 17.18it/s]
/usr/lib/python3.12/subprocess.py:1127: ResourceWarning: subprocess 56872 is still running
_warn("subprocess %s is still running" % self.pid,
ResourceWarning: Enable tracemalloc to get the object allocation traceback
.
----------------------------------------------------------------------
Ran 1 test in 175.811s
OK
Writing report to /tmp/gpqa__home_yiliu7_models_openai_gpt-oss-120b.html
{'chars': 1651.1565656565656, 'chars:std': 960.4908734139215, 'score:std': 0.47958527980756577, 'score': 0.6414141414141414}
Writing results to /tmp/gpqa__home_yiliu7_models_openai_gpt-oss-120b.json
Total latency: 11.574 s
Score: 0.641
Evaluation end: model=/home/yiliu7/models/openai/gpt-oss-120b reasoning_effort=low expected_score=0.6 metrics={'chars': 1651.1565656565656, 'chars:std': 960.4908734139215, 'score:std': 0.47958527980756577, 'score': 0.6414141414141414}
================================================================================
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment