Created
September 6, 2025 09:13
-
-
Save yiliu30/a6fc29772477457dc59525e72b7eec00 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Run 1: | |
| Auto-configed device: cuda | |
| WARNING:sglang.srt.server_args:Detected SM100 and MXFP4 quantization format for GPT-OSS model, enabling FlashInfer MXFP4 MOE kernel. | |
| WARNING:sglang.srt.server_args:TensorRT-LLM MHA only supports page_size of 16, 32 or 64, changing page_size from None to 64. | |
| [2025-09-06 08:26:09] server_args=ServerArgs(model_path='/home/yiliu7/models/openai/gpt-oss-120b', tokenizer_path='/home/yiliu7/models/openai/gpt-oss-120b', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=8400, skip_server_warmup=False, warmups=None, nccl_port=None, dtype='bfloat16', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', mem_fraction_static=0.93, max_running_requests=None, max_queued_requests=9223372036854775807, max_total_tokens=None, chunked_prefill_size=16384, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, page_size=64, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=True, device='cuda', tp_size=4, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=950273309, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, gc_warning_threshold_secs=0.0, api_key=None, served_model_name='/home/yiliu7/models/openai/gpt-oss-120b', weight_version='default', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_backend='triton', attention_backend='trtllm_mha', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, ep_size=1, moe_a2a_backend='none', moe_runner_backend='flashinfer_mxfp4', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=True, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', disable_radix_cache=False, cuda_graph_max_bs=200, cuda_graph_bs=None, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, torch_compile_max_bs=32, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, enable_return_hidden_states=False, scheduler_recv_interval=1, numa_node=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, debug_tensor_dump_prefill_only=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, num_reserved_decode_tokens=512, custom_weight_loader=[], weight_loader_disable_mmap=False, enable_pdmux=False, sm_group_num=3, enable_ep_moe=False, enable_deepep_moe=False, enable_flashinfer_cutlass_moe=False, enable_flashinfer_trtllm_moe=False, enable_triton_kernel_moe=False, enable_flashinfer_mxfp4_moe=False) | |
| All deep_gemm operations loaded successfully! | |
| `torch_dtype` is deprecated! Use `dtype` instead! | |
| [2025-09-06 08:26:09] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:26:09] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [2025-09-06 08:26:10] Using default HuggingFace chat template with detected content format: string | |
| `torch_dtype` is deprecated! Use `dtype` instead! | |
| [2025-09-06 08:26:17 TP2] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:26:17 TP2] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| `torch_dtype` is deprecated! Use `dtype` instead! | |
| [2025-09-06 08:26:17 TP0] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:26:17 TP0] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| `torch_dtype` is deprecated! Use `dtype` instead! | |
| [2025-09-06 08:26:17 TP1] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:26:17 TP1] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| `torch_dtype` is deprecated! Use `dtype` instead! | |
| [2025-09-06 08:26:17 TP3] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:26:17 TP3] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [2025-09-06 08:26:17 TP2] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:26:17 TP2] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [2025-09-06 08:26:17 TP0] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:26:17 TP0] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [2025-09-06 08:26:17 TP0] Init torch distributed begin. | |
| [2025-09-06 08:26:17 TP1] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:26:17 TP1] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [2025-09-06 08:26:17 TP3] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:26:17 TP3] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [2025-09-06 08:26:19 TP0] sglang is using nccl==2.27.3 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [2025-09-06 08:26:21 TP0] Init torch distributed ends. mem usage=1.46 GB | |
| [2025-09-06 08:26:22 TP0] Load weight begin. avail mem=176.28 GB | |
| All deep_gemm operations loaded successfully! | |
| Loading safetensors checkpoint shards: 0% Completed | 0/15 [00:00<?, ?it/s] | |
| Loading safetensors checkpoint shards: 100% Completed | 15/15 [00:00<00:00, 1690.57it/s] | |
| All deep_gemm operations loaded successfully! | |
| [2025-09-06 08:26:35 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.0.mlp.experts), it might take a while... | |
| All deep_gemm operations loaded successfully! | |
| All deep_gemm operations loaded successfully! | |
| [2025-09-06 08:26:38 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.1.mlp.experts), it might take a while... | |
| [2025-09-06 08:26:41 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.2.mlp.experts), it might take a while... | |
| [2025-09-06 08:26:44 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.3.mlp.experts), it might take a while... | |
| [2025-09-06 08:26:47 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.4.mlp.experts), it might take a while... | |
| [2025-09-06 08:26:50 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.5.mlp.experts), it might take a while... | |
| [2025-09-06 08:26:53 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.6.mlp.experts), it might take a while... | |
| [2025-09-06 08:26:56 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.7.mlp.experts), it might take a while... | |
| [2025-09-06 08:26:59 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.8.mlp.experts), it might take a while... | |
| [2025-09-06 08:27:03 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.9.mlp.experts), it might take a while... | |
| [2025-09-06 08:27:06 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.10.mlp.experts), it might take a while... | |
| [2025-09-06 08:27:09 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.11.mlp.experts), it might take a while... | |
| [2025-09-06 08:27:12 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.12.mlp.experts), it might take a while... | |
| [2025-09-06 08:27:15 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.13.mlp.experts), it might take a while... | |
| [2025-09-06 08:27:18 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.14.mlp.experts), it might take a while... | |
| [2025-09-06 08:27:21 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.15.mlp.experts), it might take a while... | |
| [2025-09-06 08:27:24 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.16.mlp.experts), it might take a while... | |
| [2025-09-06 08:27:27 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.17.mlp.experts), it might take a while... | |
| [2025-09-06 08:27:30 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.18.mlp.experts), it might take a while... | |
| [2025-09-06 08:27:33 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.19.mlp.experts), it might take a while... | |
| [2025-09-06 08:27:36 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.20.mlp.experts), it might take a while... | |
| [2025-09-06 08:27:39 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.21.mlp.experts), it might take a while... | |
| [2025-09-06 08:27:42 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.22.mlp.experts), it might take a while... | |
| [2025-09-06 08:27:46 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.23.mlp.experts), it might take a while... | |
| [2025-09-06 08:27:49 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.24.mlp.experts), it might take a while... | |
| [2025-09-06 08:27:52 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.25.mlp.experts), it might take a while... | |
| [2025-09-06 08:27:55 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.26.mlp.experts), it might take a while... | |
| [2025-09-06 08:27:58 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.27.mlp.experts), it might take a while... | |
| [2025-09-06 08:28:01 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.28.mlp.experts), it might take a while... | |
| [2025-09-06 08:28:04 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.29.mlp.experts), it might take a while... | |
| [2025-09-06 08:28:07 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.30.mlp.experts), it might take a while... | |
| [2025-09-06 08:28:10 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.31.mlp.experts), it might take a while... | |
| [2025-09-06 08:28:13 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.32.mlp.experts), it might take a while... | |
| [2025-09-06 08:28:16 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.33.mlp.experts), it might take a while... | |
| [2025-09-06 08:28:19 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.34.mlp.experts), it might take a while... | |
| [2025-09-06 08:28:22 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.35.mlp.experts), it might take a while... | |
| [2025-09-06 08:28:26 TP0] Load weight end. type=GptOssForCausalLM, dtype=torch.bfloat16, avail mem=158.06 GB, mem usage=18.22 GB. | |
| [2025-09-06 08:28:26 TP3] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB | |
| [2025-09-06 08:28:26 TP0] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB | |
| [2025-09-06 08:28:26 TP2] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB | |
| [2025-09-06 08:28:26 TP0] Memory pool end. avail mem=10.23 GB | |
| [2025-09-06 08:28:26 TP1] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB | |
| [2025-09-06 08:28:26 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=9.54 GB | |
| [2025-09-06 08:28:26 TP0] Capture cuda graph bs [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200] | |
| 0%| | 0/28 [00:00<?, ?it/s] Capturing batches (bs=200 avail_mem=9.39 GB): 0%| | 0/28 [00:00<?, ?it/s]rank 1 allocated ipc_handles: [['0x777854000000', '0x779dea000000', '0x7777f0000000', '0x7777ec000000'], ['0x7777ef000000', '0x7777eee00000', '0x7777ef200000', '0x7777ef400000'], ['0x7777d8000000', '0x7777e2000000', '0x7777ce000000', '0x7777c4000000']] | |
| [2025-09-06 08:28:28.689] [info] lamportInitialize start: buffer: 0x7777e2000000, size: 71303168 | |
| rank 0 allocated ipc_handles: [['0x76738e000000', '0x764df2000000', '0x764d9c000000', '0x764d98000000'], ['0x764d9ae00000', '0x764d9b000000', '0x764d9b200000', '0x764d9b400000'], ['0x764d8e000000', '0x764d84000000', '0x764d7a000000', '0x764d70000000']] | |
| [2025-09-06 08:28:28.738] [info] lamportInitialize start: buffer: 0x764d8e000000, size: 71303168 | |
| rank 3 allocated ipc_handles: [['0x7b1fe4000000', '0x7b1f88000000', '0x7b1f84000000', '0x7b4580000000'], ['0x7b1f87000000', '0x7b1f87200000', '0x7b1f87400000', '0x7b1f86e00000'], ['0x7b1f70000000', '0x7b1f66000000', '0x7b1f5c000000', '0x7b1f7a000000']] | |
| [2025-09-06 08:28:28.787] [info] lamportInitialize start: buffer: 0x7b1f7a000000, size: 71303168 | |
| rank 2 allocated ipc_handles: [['0x7fae78000000', '0x7fae14000000', '0x7fd40e000000', '0x7fae10000000'], ['0x7fae13000000', '0x7fae13200000', '0x7fae12e00000', '0x7fae13400000'], ['0x7fadfc000000', '0x7fadf2000000', '0x7fae06000000', '0x7fade8000000']] | |
| [2025-09-06 08:28:28.838] [info] lamportInitialize start: buffer: 0x7fae06000000, size: 71303168 | |
| [2025-09-06 08:28:28 TP0] FlashInfer workspace initialized for rank 0, world_size 4 | |
| [2025-09-06 08:28:28 TP1] FlashInfer workspace initialized for rank 1, world_size 4 | |
| [2025-09-06 08:28:28 TP2] FlashInfer workspace initialized for rank 2, world_size 4 | |
| [2025-09-06 08:28:28 TP3] FlashInfer workspace initialized for rank 3, world_size 4 | |
| set flag_ptr[3] = lamport_comm_size: 47185920 | |
| Rank 0 workspace[0] 0x76738e000000 | |
| Rank 0 workspace[1] 0x764df2000000 | |
| Rank 0 workspace[2] 0x764d9c000000 | |
| Rank 0 workspace[3] 0x764d98000000 | |
| Rank 0 workspace[4] 0x764d9ae00000 | |
| Rank 0 workspace[5] 0x764d9b000000 | |
| Rank 0 workspace[6] 0x764d9b200000 | |
| Rank 0 workspace[7] 0x764d9b400000 | |
| Rank 0 workspace[8] 0x764d8e000000 | |
| Rank 0 workspace[9] 0x764d84000000 | |
| Rank 0 workspace[10] 0x764d7a000000 | |
| Rank 0 workspace[11] 0x764d70000000 | |
| Rank 0 workspace[12] 0x767987264400 | |
| set flag_ptr[3] = lamport_comm_size: 47185920 | |
| Rank 2 workspace[0] 0x7fae78000000 | |
| Rank 2 workspace[1] 0x7fae14000000 | |
| Rank 2 workspace[2] 0x7fd40e000000 | |
| Rank 2 workspace[3] 0x7fae10000000 | |
| Rank 2 workspace[4] 0x7fae13000000 | |
| Rank 2 workspace[5] 0x7fae13200000 | |
| Rank 2 workspace[6] 0x7fae12e00000 | |
| Rank 2 workspace[7] 0x7fae13400000 | |
| Rank 2 workspace[8] 0x7fadfc000000 | |
| Rank 2 workspace[9] 0x7fadf2000000 | |
| Rank 2 workspace[10] 0x7fae06000000 | |
| Rank 2 workspace[11] 0x7fade8000000 | |
| Rank 2 workspace[12] 0x7fda1b264400 | |
| set flag_ptr[3] = lamport_comm_size: 47185920 | |
| Rank 3 workspace[0] 0x7b1fe4000000 | |
| Rank 3 workspace[1] 0x7b1f88000000 | |
| Rank 3 workspace[2] 0x7b1f84000000 | |
| Rank 3 workspace[3] 0x7b4580000000 | |
| Rank 3 workspace[4] 0x7b1f87000000 | |
| Rank 3 workspace[5] 0x7b1f87200000 | |
| Rank 3 workspace[6] 0x7b1f87400000 | |
| Rank 3 workspace[7] 0x7b1f86e00000 | |
| Rank 3 workspace[8] 0x7b1f70000000 | |
| Rank 3 workspace[9] 0x7b1f66000000 | |
| Rank 3 workspace[10] 0x7b1f5c000000 | |
| Rank 3 workspace[11] 0x7b1f7a000000 | |
| Rank 3 workspace[12] 0x7b4b7b264400 | |
| set flag_ptr[3] = lamport_comm_size: 47185920 | |
| Rank 1 workspace[0] 0x777854000000 | |
| Rank 1 workspace[1] 0x779dea000000 | |
| Rank 1 workspace[2] 0x7777f0000000 | |
| Rank 1 workspace[3] 0x7777ec000000 | |
| Rank 1 workspace[4] 0x7777ef000000 | |
| Rank 1 workspace[5] 0x7777eee00000 | |
| Rank 1 workspace[6] 0x7777ef200000 | |
| Rank 1 workspace[7] 0x7777ef400000 | |
| Rank 1 workspace[8] 0x7777d8000000 | |
| Rank 1 workspace[9] 0x7777e2000000 | |
| Rank 1 workspace[10] 0x7777ce000000 | |
| Rank 1 workspace[11] 0x7777c4000000 | |
| Rank 1 workspace[12] 0x77a3f3264400 | |
| Capturing batches (bs=200 avail_mem=9.39 GB): 4%|▎ | 1/28 [00:02<00:59, 2.20s/it] Capturing batches (bs=192 avail_mem=8.06 GB): 4%|▎ | 1/28 [00:02<00:59, 2.20s/it] Capturing batches (bs=192 avail_mem=8.06 GB): 7%|▋ | 2/28 [00:02<00:25, 1.00it/s] Capturing batches (bs=184 avail_mem=8.05 GB): 7%|▋ | 2/28 [00:02<00:25, 1.00it/s] Capturing batches (bs=176 avail_mem=8.04 GB): 7%|▋ | 2/28 [00:02<00:25, 1.00it/s] Capturing batches (bs=176 avail_mem=8.04 GB): 14%|█▍ | 4/28 [00:02<00:10, 2.34it/s] Capturing batches (bs=168 avail_mem=8.03 GB): 14%|█▍ | 4/28 [00:02<00:10, 2.34it/s] Capturing batches (bs=160 avail_mem=8.02 GB): 14%|█▍ | 4/28 [00:02<00:10, 2.34it/s] Capturing batches (bs=160 avail_mem=8.02 GB): 21%|██▏ | 6/28 [00:02<00:05, 3.75it/s] Capturing batches (bs=152 avail_mem=8.01 GB): 21%|██▏ | 6/28 [00:02<00:05, 3.75it/s] Capturing batches (bs=144 avail_mem=8.00 GB): 21%|██▏ | 6/28 [00:02<00:05, 3.75it/s] Capturing batches (bs=144 avail_mem=8.00 GB): 29%|██▊ | 8/28 [00:02<00:03, 5.13it/s] Capturing batches (bs=136 avail_mem=8.00 GB): 29%|██▊ | 8/28 [00:02<00:03, 5.13it/s] Capturing batches (bs=128 avail_mem=7.99 GB): 29%|██▊ | 8/28 [00:02<00:03, 5.13it/s] Capturing batches (bs=128 avail_mem=7.99 GB): 36%|███▌ | 10/28 [00:03<00:02, 6.34it/s] Capturing batches (bs=120 avail_mem=7.97 GB): 36%|███▌ | 10/28 [00:03<00:02, 6.34it/s] Capturing batches (bs=112 avail_mem=7.97 GB): 36%|███▌ | 10/28 [00:03<00:02, 6.34it/s] Capturing batches (bs=112 avail_mem=7.97 GB): 43%|████▎ | 12/28 [00:03<00:02, 7.47it/s] Capturing batches (bs=104 avail_mem=7.96 GB): 43%|████▎ | 12/28 [00:03<00:02, 7.47it/s] Capturing batches (bs=96 avail_mem=7.95 GB): 43%|████▎ | 12/28 [00:03<00:02, 7.47it/s] Capturing batches (bs=96 avail_mem=7.95 GB): 50%|█████ | 14/28 [00:03<00:01, 8.36it/s] Capturing batches (bs=88 avail_mem=7.94 GB): 50%|█████ | 14/28 [00:03<00:01, 8.36it/s] Capturing batches (bs=80 avail_mem=7.93 GB): 50%|█████ | 14/28 [00:03<00:01, 8.36it/s] Capturing batches (bs=80 avail_mem=7.93 GB): 57%|█████▋ | 16/28 [00:03<00:01, 9.10it/s] Capturing batches (bs=72 avail_mem=7.92 GB): 57%|█████▋ | 16/28 [00:03<00:01, 9.10it/s] Capturing batches (bs=64 avail_mem=7.91 GB): 57%|█████▋ | 16/28 [00:03<00:01, 9.10it/s] Capturing batches (bs=64 avail_mem=7.91 GB): 64%|██████▍ | 18/28 [00:03<00:01, 9.71it/s] Capturing batches (bs=56 avail_mem=7.90 GB): 64%|██████▍ | 18/28 [00:03<00:01, 9.71it/s] Capturing batches (bs=48 avail_mem=7.89 GB): 64%|██████▍ | 18/28 [00:03<00:01, 9.71it/s] Capturing batches (bs=48 avail_mem=7.89 GB): 71%|███████▏ | 20/28 [00:03<00:00, 10.20it/s] Capturing batches (bs=40 avail_mem=7.88 GB): 71%|███████▏ | 20/28 [00:03<00:00, 10.20it/s] Capturing batches (bs=32 avail_mem=7.88 GB): 71%|███████▏ | 20/28 [00:04<00:00, 10.20it/s] Capturing batches (bs=32 avail_mem=7.88 GB): 79%|███████▊ | 22/28 [00:04<00:00, 10.63it/s] Capturing batches (bs=24 avail_mem=7.87 GB): 79%|███████▊ | 22/28 [00:04<00:00, 10.63it/s] Capturing batches (bs=16 avail_mem=7.86 GB): 79%|███████▊ | 22/28 [00:04<00:00, 10.63it/s] Capturing batches (bs=16 avail_mem=7.86 GB): 86%|████████▌ | 24/28 [00:04<00:00, 10.93it/s] Capturing batches (bs=8 avail_mem=7.85 GB): 86%|████████▌ | 24/28 [00:04<00:00, 10.93it/s] Capturing batches (bs=4 avail_mem=7.84 GB): 86%|████████▌ | 24/28 [00:04<00:00, 10.93it/s] Capturing batches (bs=4 avail_mem=7.84 GB): 93%|█████████▎| 26/28 [00:04<00:00, 11.38it/s] Capturing batches (bs=2 avail_mem=7.83 GB): 93%|█████████▎| 26/28 [00:04<00:00, 11.38it/s] Capturing batches (bs=1 avail_mem=7.82 GB): 93%|█████████▎| 26/28 [00:04<00:00, 11.38it/s] Capturing batches (bs=1 avail_mem=7.82 GB): 100%|██████████| 28/28 [00:04<00:00, 10.86it/s] Capturing batches (bs=1 avail_mem=7.82 GB): 100%|██████████| 28/28 [00:04<00:00, 6.00it/s] | |
| [2025-09-06 08:28:31 TP0] Registering 56 cuda graph addresses | |
| [2025-09-06 08:28:31 TP2] Registering 56 cuda graph addresses | |
| [2025-09-06 08:28:31 TP3] Registering 56 cuda graph addresses | |
| [2025-09-06 08:28:31 TP1] Registering 56 cuda graph addresses | |
| [2025-09-06 08:28:31 TP0] Capture cuda graph end. Time elapsed: 5.17 s. mem usage=1.73 GB. avail mem=7.81 GB. | |
| [2025-09-06 08:28:32 TP0] max_total_num_tokens=8487040, chunked_prefill_size=16384, max_prefill_tokens=16384, max_running_requests=4096, context_len=131072, available_gpu_mem=7.81 GB | |
| [2025-09-06 08:28:33] INFO: Started server process [34489] | |
| [2025-09-06 08:28:33] INFO: Waiting for application startup. | |
| [2025-09-06 08:28:33] INFO: Application startup complete. | |
| [2025-09-06 08:28:33] INFO: Uvicorn running on http://127.0.0.1:8400 (Press CTRL+C to quit) | |
| [2025-09-06 08:28:34] INFO: 127.0.0.1:46012 - "GET /get_model_info HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:34] INFO: 127.0.0.1:46014 - "GET /health_generate HTTP/1.1" 503 Service Unavailable | |
| [2025-09-06 08:28:34 TP0] Prefill batch. #new-seq: 1, #new-token: 64, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, | |
| [2025-09-06 08:28:35] INFO: 127.0.0.1:46026 - "POST /generate HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:35] The server is fired up and ready to roll! | |
| [2025-09-06 08:28:44 TP0] Prefill batch. #new-seq: 1, #new-token: 64, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, | |
| [2025-09-06 08:28:45] INFO: 127.0.0.1:56076 - "GET /health_generate HTTP/1.1" 200 OK | |
| command=python3 -m sglang.launch_server --model-path /home/yiliu7/models/openai/gpt-oss-120b --tp 4 --cuda-graph-max-bs 200 --mem-fraction-static 0.93 --device cuda --host 127.0.0.1 --port 8400 | |
| Evaluation start: model=/home/yiliu7/models/openai/gpt-oss-120b reasoning_effort=low expected_score=0.6 | |
| ChatCompletionSampler initialized with self.system_message=None self.temperature=0.1 self.max_tokens=4096 self.reasoning_effort='low' | |
| 0%| | 0/198 [00:00<?, ?it/s][2025-09-06 08:28:46 TP0] Prefill batch. #new-seq: 1, #new-token: 256, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, | |
| [2025-09-06 08:28:46 TP0] Prefill batch. #new-seq: 9, #new-token: 2304, #cached-token: 576, token usage: 0.00, #running-req: 1, #queue-req: 0, | |
| [2025-09-06 08:28:46 TP0] Prefill batch. #new-seq: 3, #new-token: 768, #cached-token: 192, token usage: 0.00, #running-req: 10, #queue-req: 0, | |
| [2025-09-06 08:28:46 TP0] Prefill batch. #new-seq: 56, #new-token: 16064, #cached-token: 3584, token usage: 0.00, #running-req: 13, #queue-req: 39, | |
| [2025-09-06 08:28:46 TP0] Prefill batch. #new-seq: 54, #new-token: 16320, #cached-token: 3456, token usage: 0.00, #running-req: 69, #queue-req: 6, | |
| [2025-09-06 08:28:46 TP0] Prefill batch. #new-seq: 56, #new-token: 16320, #cached-token: 3648, token usage: 0.00, #running-req: 123, #queue-req: 4, | |
| [2025-09-06 08:28:46 TP0] Prefill batch. #new-seq: 19, #new-token: 4672, #cached-token: 1216, token usage: 0.01, #running-req: 179, #queue-req: 0, | |
| [2025-09-06 08:28:47 TP0] Decode batch. #running-req: 198, #token: 61888, token usage: 0.01, cuda graph: True, gen throughput (token/s): 347.01, #queue-req: 0, | |
| [2025-09-06 08:28:47] INFO: 127.0.0.1:56212 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:47 TP0] Decode batch. #running-req: 197, #token: 68928, token usage: 0.01, cuda graph: True, gen throughput (token/s): 17331.29, #queue-req: 0, | |
| [2025-09-06 08:28:47] INFO: 127.0.0.1:56736 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:47] INFO: 127.0.0.1:56152 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:47] INFO: 127.0.0.1:56114 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:48] INFO: 127.0.0.1:56536 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:48] INFO: 127.0.0.1:57098 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:48] INFO: 127.0.0.1:57084 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:48] INFO: 127.0.0.1:57222 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:48 TP0] Decode batch. #running-req: 190, #token: 72064, token usage: 0.01, cuda graph: True, gen throughput (token/s): 17223.27, #queue-req: 0, | |
| [2025-09-06 08:28:48] INFO: 127.0.0.1:56472 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:48] INFO: 127.0.0.1:56622 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:48] INFO: 127.0.0.1:57690 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:48] INFO: 127.0.0.1:57758 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:48] INFO: 127.0.0.1:57572 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:48] INFO: 127.0.0.1:56564 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:48] INFO: 127.0.0.1:57002 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:48] INFO: 127.0.0.1:56574 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:48] INFO: 127.0.0.1:56594 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:48] INFO: 127.0.0.1:57218 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:48] INFO: 127.0.0.1:56978 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:48] INFO: 127.0.0.1:56338 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:48] INFO: 127.0.0.1:57338 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:48] INFO: 127.0.0.1:57158 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:48] INFO: 127.0.0.1:56762 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:48 TP0] Decode batch. #running-req: 175, #token: 72832, token usage: 0.01, cuda graph: True, gen throughput (token/s): 16416.05, #queue-req: 0, | |
| [2025-09-06 08:28:48] INFO: 127.0.0.1:57452 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:48] INFO: 127.0.0.1:57804 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:48] INFO: 127.0.0.1:57294 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:48] INFO: 127.0.0.1:56488 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:49] INFO: 127.0.0.1:56324 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:49] INFO: 127.0.0.1:57374 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:49] INFO: 127.0.0.1:57022 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:49 TP0] Decode batch. #running-req: 168, #token: 76544, token usage: 0.01, cuda graph: True, gen throughput (token/s): 15784.57, #queue-req: 0, | |
| [2025-09-06 08:28:49] INFO: 127.0.0.1:56164 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:49] INFO: 127.0.0.1:56438 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:49] INFO: 127.0.0.1:57172 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:49] INFO: 127.0.0.1:56116 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:49] INFO: 127.0.0.1:57332 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:49] INFO: 127.0.0.1:56950 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:49] INFO: 127.0.0.1:57792 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:49 TP0] Decode batch. #running-req: 161, #token: 79744, token usage: 0.01, cuda graph: True, gen throughput (token/s): 14688.78, #queue-req: 0, | |
| [2025-09-06 08:28:49] INFO: 127.0.0.1:56584 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:49] INFO: 127.0.0.1:57510 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:49] INFO: 127.0.0.1:56658 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:49] INFO: 127.0.0.1:57862 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:49] INFO: 127.0.0.1:56220 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:49] INFO: 127.0.0.1:57704 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:49] INFO: 127.0.0.1:57424 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:49] INFO: 127.0.0.1:56344 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:49] INFO: 127.0.0.1:56844 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:49 TP0] Decode batch. #running-req: 153, #token: 81536, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13916.02, #queue-req: 0, | |
| [2025-09-06 08:28:49] INFO: 127.0.0.1:57474 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:50] INFO: 127.0.0.1:56940 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:50] INFO: 127.0.0.1:57602 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:50] INFO: 127.0.0.1:57816 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:50] INFO: 127.0.0.1:57870 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:50] INFO: 127.0.0.1:56822 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:50] INFO: 127.0.0.1:57614 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:50] INFO: 127.0.0.1:57496 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:50] INFO: 127.0.0.1:57388 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:50 TP0] Decode batch. #running-req: 143, #token: 82432, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13423.86, #queue-req: 0, | |
| [2025-09-06 08:28:50] INFO: 127.0.0.1:56836 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:50] INFO: 127.0.0.1:57050 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:50] INFO: 127.0.0.1:56208 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:50] INFO: 127.0.0.1:57780 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:50] INFO: 127.0.0.1:56380 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:50] INFO: 127.0.0.1:57340 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:50] INFO: 127.0.0.1:57208 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:50] INFO: 127.0.0.1:57412 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:50] INFO: 127.0.0.1:56794 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:50 TP0] Decode batch. #running-req: 134, #token: 83136, token usage: 0.01, cuda graph: True, gen throughput (token/s): 12796.65, #queue-req: 0, | |
| [2025-09-06 08:28:50] INFO: 127.0.0.1:57112 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:50] INFO: 127.0.0.1:57202 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:50] INFO: 127.0.0.1:56870 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:50] INFO: 127.0.0.1:56974 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:50] INFO: 127.0.0.1:56138 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:51] INFO: 127.0.0.1:56724 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:51] INFO: 127.0.0.1:56378 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:51] INFO: 127.0.0.1:56968 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:51] INFO: 127.0.0.1:57880 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:51] INFO: 127.0.0.1:56086 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:51] INFO: 127.0.0.1:56348 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:51] INFO: 127.0.0.1:57624 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:51] INFO: 127.0.0.1:57640 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:51] INFO: 127.0.0.1:56248 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:51] INFO: 127.0.0.1:57394 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:51 TP0] Decode batch. #running-req: 119, #token: 78976, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13816.37, #queue-req: 0, | |
| [2025-09-06 08:28:51] INFO: 127.0.0.1:57250 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:51] INFO: 127.0.0.1:57670 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:51] INFO: 127.0.0.1:56452 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:51] INFO: 127.0.0.1:57464 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:51] INFO: 127.0.0.1:56880 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:51] INFO: 127.0.0.1:57420 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:51 TP0] Decode batch. #running-req: 113, #token: 78784, token usage: 0.01, cuda graph: True, gen throughput (token/s): 15745.19, #queue-req: 0, | |
| [2025-09-06 08:28:51] INFO: 127.0.0.1:56614 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:51] INFO: 127.0.0.1:56522 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:51] INFO: 127.0.0.1:56422 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:51] INFO: 127.0.0.1:56280 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:51 TP0] Decode batch. #running-req: 109, #token: 81024, token usage: 0.01, cuda graph: True, gen throughput (token/s): 15465.52, #queue-req: 0, | |
| [2025-09-06 08:28:51] INFO: 127.0.0.1:57000 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:51] INFO: 127.0.0.1:56602 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:51] INFO: 127.0.0.1:56730 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:51] INFO: 127.0.0.1:56264 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:51] INFO: 127.0.0.1:56608 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:52] INFO: 127.0.0.1:56204 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:52] INFO: 127.0.0.1:57540 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:52] INFO: 127.0.0.1:57744 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:52 TP0] Decode batch. #running-req: 102, #token: 78848, token usage: 0.01, cuda graph: True, gen throughput (token/s): 14793.92, #queue-req: 0, | |
| [2025-09-06 08:28:52] INFO: 127.0.0.1:57674 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:52] INFO: 127.0.0.1:56308 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:52] INFO: 127.0.0.1:57542 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:52] INFO: 127.0.0.1:57134 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:52] INFO: 127.0.0.1:56776 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:52] INFO: 127.0.0.1:57860 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:52 TP0] Decode batch. #running-req: 97, #token: 78016, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13885.71, #queue-req: 0, | |
| [2025-09-06 08:28:52] INFO: 127.0.0.1:56126 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:52] INFO: 127.0.0.1:57660 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:52] INFO: 127.0.0.1:57798 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:52] INFO: 127.0.0.1:56682 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:52] INFO: 127.0.0.1:57586 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:52] INFO: 127.0.0.1:56318 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:52] INFO: 127.0.0.1:57622 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:52] INFO: 127.0.0.1:57280 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:52] INFO: 127.0.0.1:57560 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:52 TP0] Decode batch. #running-req: 86, #token: 73984, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13284.89, #queue-req: 0, | |
| [2025-09-06 08:28:52] INFO: 127.0.0.1:56556 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:52] INFO: 127.0.0.1:56314 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:52] INFO: 127.0.0.1:56240 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:52] INFO: 127.0.0.1:57186 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:52] INFO: 127.0.0.1:56710 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:52] INFO: 127.0.0.1:57328 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:52] INFO: 127.0.0.1:56814 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:52] INFO: 127.0.0.1:57718 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:52 TP0] Decode batch. #running-req: 78, #token: 70208, token usage: 0.01, cuda graph: True, gen throughput (token/s): 12384.46, #queue-req: 0, | |
| [2025-09-06 08:28:52] INFO: 127.0.0.1:56912 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:52] INFO: 127.0.0.1:57522 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:52] INFO: 127.0.0.1:56394 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:52] INFO: 127.0.0.1:56632 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:52] INFO: 127.0.0.1:57018 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:52] INFO: 127.0.0.1:57470 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:52] INFO: 127.0.0.1:56364 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:52] INFO: 127.0.0.1:56458 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:53] INFO: 127.0.0.1:56752 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:53] INFO: 127.0.0.1:56782 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:53] INFO: 127.0.0.1:57354 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:53] INFO: 127.0.0.1:56360 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:53 TP0] Decode batch. #running-req: 66, #token: 62592, token usage: 0.01, cuda graph: True, gen throughput (token/s): 11198.65, #queue-req: 0, | |
| [2025-09-06 08:28:53] INFO: 127.0.0.1:57040 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:53] INFO: 127.0.0.1:57170 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:53] INFO: 127.0.0.1:56296 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:53] INFO: 127.0.0.1:57766 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:53 TP0] Decode batch. #running-req: 62, #token: 61312, token usage: 0.01, cuda graph: True, gen throughput (token/s): 10345.65, #queue-req: 0, | |
| [2025-09-06 08:28:53] INFO: 127.0.0.1:57146 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:53] INFO: 127.0.0.1:57382 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:53] INFO: 127.0.0.1:57574 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:53] INFO: 127.0.0.1:57300 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:53] INFO: 127.0.0.1:56896 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:53] INFO: 127.0.0.1:57546 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:53 TP0] Decode batch. #running-req: 56, #token: 57344, token usage: 0.01, cuda graph: True, gen throughput (token/s): 10066.95, #queue-req: 0, | |
| [2025-09-06 08:28:53] INFO: 127.0.0.1:56504 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:53 TP0] Decode batch. #running-req: 55, #token: 58432, token usage: 0.01, cuda graph: True, gen throughput (token/s): 9385.78, #queue-req: 0, | |
| [2025-09-06 08:28:53] INFO: 127.0.0.1:57148 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:53] INFO: 127.0.0.1:56962 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:53] INFO: 127.0.0.1:56904 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:54] INFO: 127.0.0.1:57698 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:54] INFO: 127.0.0.1:56656 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:54] INFO: 127.0.0.1:57836 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:54 TP0] Decode batch. #running-req: 49, #token: 52608, token usage: 0.01, cuda graph: True, gen throughput (token/s): 8793.59, #queue-req: 0, | |
| [2025-09-06 08:28:54] INFO: 127.0.0.1:56334 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:54] INFO: 127.0.0.1:57042 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:54] INFO: 127.0.0.1:57312 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:54] INFO: 127.0.0.1:57276 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:54] INFO: 127.0.0.1:57822 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:54] INFO: 127.0.0.1:56928 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:54] INFO: 127.0.0.1:57528 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:54] INFO: 127.0.0.1:57124 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:54] INFO: 127.0.0.1:57028 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:54] INFO: 127.0.0.1:57138 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:54 TP0] Decode batch. #running-req: 39, #token: 44288, token usage: 0.01, cuda graph: True, gen throughput (token/s): 7588.36, #queue-req: 0, | |
| [2025-09-06 08:28:54] INFO: 127.0.0.1:56346 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:54] INFO: 127.0.0.1:56304 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:54] INFO: 127.0.0.1:56552 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:54] INFO: 127.0.0.1:57440 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:54] INFO: 127.0.0.1:57370 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:54] INFO: 127.0.0.1:57490 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:54] INFO: 127.0.0.1:56098 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:54 TP0] Decode batch. #running-req: 33, #token: 37568, token usage: 0.00, cuda graph: True, gen throughput (token/s): 6607.97, #queue-req: 0, | |
| [2025-09-06 08:28:54] INFO: 127.0.0.1:56230 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:54] INFO: 127.0.0.1:56448 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:54] INFO: 127.0.0.1:57848 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:54] INFO: 127.0.0.1:56190 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:54] INFO: 127.0.0.1:57872 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:54 TP0] Decode batch. #running-req: 27, #token: 33088, token usage: 0.00, cuda graph: True, gen throughput (token/s): 5465.59, #queue-req: 0, | |
| [2025-09-06 08:28:54] INFO: 127.0.0.1:57142 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:54] INFO: 127.0.0.1:57080 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:54] INFO: 127.0.0.1:57234 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:54 TP0] Decode batch. #running-req: 24, #token: 30464, token usage: 0.00, cuda graph: True, gen throughput (token/s): 4766.63, #queue-req: 0, | |
| [2025-09-06 08:28:55] INFO: 127.0.0.1:57700 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:55] INFO: 127.0.0.1:56994 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:55] INFO: 127.0.0.1:57064 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:55 TP0] Decode batch. #running-req: 21, #token: 27392, token usage: 0.00, cuda graph: True, gen throughput (token/s): 4507.33, #queue-req: 0, | |
| [2025-09-06 08:28:55] INFO: 127.0.0.1:57258 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:55] INFO: 127.0.0.1:56174 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:55] INFO: 127.0.0.1:56516 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:55 TP0] Decode batch. #running-req: 18, #token: 24256, token usage: 0.00, cuda graph: True, gen throughput (token/s): 3896.98, #queue-req: 0, | |
| [2025-09-06 08:28:55] INFO: 127.0.0.1:56860 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:55] INFO: 127.0.0.1:57788 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:55] INFO: 127.0.0.1:56800 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:55] INFO: 127.0.0.1:56118 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:55 TP0] Decode batch. #running-req: 14, #token: 19456, token usage: 0.00, cuda graph: True, gen throughput (token/s): 3175.32, #queue-req: 0, | |
| [2025-09-06 08:28:55] INFO: 127.0.0.1:56454 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:55] INFO: 127.0.0.1:56428 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:55] INFO: 127.0.0.1:57728 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:55 TP0] Decode batch. #running-req: 11, #token: 15808, token usage: 0.00, cuda graph: True, gen throughput (token/s): 2722.85, #queue-req: 0, | |
| [2025-09-06 08:28:55] INFO: 127.0.0.1:57644 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:55 TP0] Decode batch. #running-req: 10, #token: 14720, token usage: 0.00, cuda graph: True, gen throughput (token/s): 2218.75, #queue-req: 0, | |
| [2025-09-06 08:28:56 TP0] Decode batch. #running-req: 10, #token: 14848, token usage: 0.00, cuda graph: True, gen throughput (token/s): 2126.31, #queue-req: 0, | |
| [2025-09-06 08:28:56] INFO: 127.0.0.1:56696 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:56] INFO: 127.0.0.1:56354 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:56 TP0] Decode batch. #running-req: 8, #token: 12352, token usage: 0.00, cuda graph: True, gen throughput (token/s): 2016.74, #queue-req: 0, | |
| [2025-09-06 08:28:56] INFO: 127.0.0.1:57772 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:56 TP0] Decode batch. #running-req: 7, #token: 11008, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1879.52, #queue-req: 0, | |
| [2025-09-06 08:28:56 TP0] Decode batch. #running-req: 7, #token: 11328, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1672.65, #queue-req: 0, | |
| [2025-09-06 08:28:56] INFO: 127.0.0.1:57398 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:56] INFO: 127.0.0.1:57484 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| 1%| | 1/198 [00:10<35:11, 10.72s/it][2025-09-06 08:28:56 TP0] Decode batch. #running-req: 5, #token: 8320, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1373.21, #queue-req: 0, | |
| [2025-09-06 08:28:57 TP0] Decode batch. #running-req: 5, #token: 8384, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1212.19, #queue-req: 0, | |
| [2025-09-06 08:28:57 TP0] Decode batch. #running-req: 5, #token: 8704, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1206.92, #queue-req: 0, | |
| [2025-09-06 08:28:57] INFO: 127.0.0.1:57268 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:57 TP0] Decode batch. #running-req: 5, #token: 7232, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1188.35, #queue-req: 0, | |
| [2025-09-06 08:28:57 TP0] Decode batch. #running-req: 4, #token: 7232, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1062.48, #queue-req: 0, | |
| [2025-09-06 08:28:57 TP0] Decode batch. #running-req: 4, #token: 7488, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1055.36, #queue-req: 0, | |
| [2025-09-06 08:28:57 TP0] Decode batch. #running-req: 4, #token: 7616, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1066.69, #queue-req: 0, | |
| [2025-09-06 08:28:57 TP0] Decode batch. #running-req: 4, #token: 7744, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1058.33, #queue-req: 0, | |
| [2025-09-06 08:28:58] INFO: 127.0.0.1:57696 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| 12%|█▏ | 24/198 [00:11<01:04, 2.70it/s][2025-09-06 08:28:58] INFO: 127.0.0.1:56406 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:28:58 TP0] Decode batch. #running-req: 2, #token: 4032, token usage: 0.00, cuda graph: True, gen throughput (token/s): 774.95, #queue-req: 0, | |
| [2025-09-06 08:28:58 TP0] Decode batch. #running-req: 2, #token: 4032, token usage: 0.00, cuda graph: True, gen throughput (token/s): 580.43, #queue-req: 0, | |
| [2025-09-06 08:28:58] INFO: 127.0.0.1:56640 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| 56%|█████▌ | 110/198 [00:12<00:05, 15.97it/s][2025-09-06 08:28:58 TP0] Decode batch. #running-req: 1, #token: 2112, token usage: 0.00, cuda graph: True, gen throughput (token/s): 449.97, #queue-req: 0, | |
| [2025-09-06 08:28:58] INFO: 127.0.0.1:56666 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| 65%|██████▌ | 129/198 [00:12<00:03, 19.74it/s] 100%|██████████| 198/198 [00:12<00:00, 16.00it/s] | |
| /usr/lib/python3.12/subprocess.py:1127: ResourceWarning: subprocess 34489 is still running | |
| _warn("subprocess %s is still running" % self.pid, | |
| ResourceWarning: Enable tracemalloc to get the object allocation traceback | |
| . | |
| ---------------------------------------------------------------------- | |
| Ran 1 test in 176.620s | |
| OK | |
| Writing report to /tmp/gpqa__home_yiliu7_models_openai_gpt-oss-120b.html | |
| {'chars': 1754.6666666666667, 'chars:std': 1020.4785765769535, 'score:std': 0.48631931786709987, 'score': 0.6161616161616161} | |
| Writing results to /tmp/gpqa__home_yiliu7_models_openai_gpt-oss-120b.json | |
| Total latency: 12.432 s | |
| Score: 0.616 | |
| Evaluation end: model=/home/yiliu7/models/openai/gpt-oss-120b reasoning_effort=low expected_score=0.6 metrics={'chars': 1754.6666666666667, 'chars:std': 1020.4785765769535, 'score:std': 0.48631931786709987, 'score': 0.6161616161616161} | |
| ================================================================================ | |
| Run 2: | |
| Auto-configed device: cuda | |
| WARNING:sglang.srt.server_args:Detected SM100 and MXFP4 quantization format for GPT-OSS model, enabling FlashInfer MXFP4 MOE kernel. | |
| WARNING:sglang.srt.server_args:TensorRT-LLM MHA only supports page_size of 16, 32 or 64, changing page_size from None to 64. | |
| [2025-09-06 08:29:13] server_args=ServerArgs(model_path='/home/yiliu7/models/openai/gpt-oss-120b', tokenizer_path='/home/yiliu7/models/openai/gpt-oss-120b', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=8400, skip_server_warmup=False, warmups=None, nccl_port=None, dtype='bfloat16', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', mem_fraction_static=0.93, max_running_requests=None, max_queued_requests=9223372036854775807, max_total_tokens=None, chunked_prefill_size=16384, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, page_size=64, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=True, device='cuda', tp_size=4, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=613488540, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, gc_warning_threshold_secs=0.0, api_key=None, served_model_name='/home/yiliu7/models/openai/gpt-oss-120b', weight_version='default', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_backend='triton', attention_backend='trtllm_mha', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, ep_size=1, moe_a2a_backend='none', moe_runner_backend='flashinfer_mxfp4', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=True, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', disable_radix_cache=False, cuda_graph_max_bs=200, cuda_graph_bs=None, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, torch_compile_max_bs=32, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, enable_return_hidden_states=False, scheduler_recv_interval=1, numa_node=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, debug_tensor_dump_prefill_only=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, num_reserved_decode_tokens=512, custom_weight_loader=[], weight_loader_disable_mmap=False, enable_pdmux=False, sm_group_num=3, enable_ep_moe=False, enable_deepep_moe=False, enable_flashinfer_cutlass_moe=False, enable_flashinfer_trtllm_moe=False, enable_triton_kernel_moe=False, enable_flashinfer_mxfp4_moe=False) | |
| All deep_gemm operations loaded successfully! | |
| `torch_dtype` is deprecated! Use `dtype` instead! | |
| [2025-09-06 08:29:13] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:29:13] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [2025-09-06 08:29:13] Using default HuggingFace chat template with detected content format: string | |
| `torch_dtype` is deprecated! Use `dtype` instead! | |
| [2025-09-06 08:29:20 TP2] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:29:20 TP2] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| `torch_dtype` is deprecated! Use `dtype` instead! | |
| [2025-09-06 08:29:20 TP0] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:29:20 TP0] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| `torch_dtype` is deprecated! Use `dtype` instead! | |
| [2025-09-06 08:29:20 TP1] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:29:20 TP1] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| `torch_dtype` is deprecated! Use `dtype` instead! | |
| [2025-09-06 08:29:20 TP3] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:29:20 TP3] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [2025-09-06 08:29:20 TP2] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:29:20 TP2] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [2025-09-06 08:29:20 TP0] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:29:20 TP0] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [2025-09-06 08:29:20 TP0] Init torch distributed begin. | |
| [2025-09-06 08:29:21 TP1] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:29:21 TP1] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [2025-09-06 08:29:21 TP3] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:29:21 TP3] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [2025-09-06 08:29:22 TP0] sglang is using nccl==2.27.3 | |
| [Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [2025-09-06 08:29:25 TP0] Init torch distributed ends. mem usage=1.46 GB | |
| [2025-09-06 08:29:25 TP0] Load weight begin. avail mem=176.28 GB | |
| All deep_gemm operations loaded successfully! | |
| Loading safetensors checkpoint shards: 0% Completed | 0/15 [00:00<?, ?it/s] | |
| Loading safetensors checkpoint shards: 100% Completed | 15/15 [00:00<00:00, 1585.51it/s] | |
| All deep_gemm operations loaded successfully! | |
| All deep_gemm operations loaded successfully! | |
| All deep_gemm operations loaded successfully! | |
| [2025-09-06 08:29:36 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.0.mlp.experts), it might take a while... | |
| [2025-09-06 08:29:39 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.1.mlp.experts), it might take a while... | |
| [2025-09-06 08:29:42 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.2.mlp.experts), it might take a while... | |
| [2025-09-06 08:29:45 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.3.mlp.experts), it might take a while... | |
| [2025-09-06 08:29:48 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.4.mlp.experts), it might take a while... | |
| [2025-09-06 08:29:51 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.5.mlp.experts), it might take a while... | |
| [2025-09-06 08:29:54 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.6.mlp.experts), it might take a while... | |
| [2025-09-06 08:29:58 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.7.mlp.experts), it might take a while... | |
| [2025-09-06 08:30:01 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.8.mlp.experts), it might take a while... | |
| [2025-09-06 08:30:04 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.9.mlp.experts), it might take a while... | |
| [2025-09-06 08:30:07 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.10.mlp.experts), it might take a while... | |
| [2025-09-06 08:30:10 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.11.mlp.experts), it might take a while... | |
| [2025-09-06 08:30:13 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.12.mlp.experts), it might take a while... | |
| [2025-09-06 08:30:16 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.13.mlp.experts), it might take a while... | |
| [2025-09-06 08:30:19 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.14.mlp.experts), it might take a while... | |
| [2025-09-06 08:30:22 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.15.mlp.experts), it might take a while... | |
| [2025-09-06 08:30:25 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.16.mlp.experts), it might take a while... | |
| [2025-09-06 08:30:28 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.17.mlp.experts), it might take a while... | |
| [2025-09-06 08:30:31 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.18.mlp.experts), it might take a while... | |
| [2025-09-06 08:30:34 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.19.mlp.experts), it might take a while... | |
| [2025-09-06 08:30:38 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.20.mlp.experts), it might take a while... | |
| [2025-09-06 08:30:41 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.21.mlp.experts), it might take a while... | |
| [2025-09-06 08:30:44 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.22.mlp.experts), it might take a while... | |
| [2025-09-06 08:30:47 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.23.mlp.experts), it might take a while... | |
| [2025-09-06 08:30:50 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.24.mlp.experts), it might take a while... | |
| [2025-09-06 08:30:53 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.25.mlp.experts), it might take a while... | |
| [2025-09-06 08:30:56 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.26.mlp.experts), it might take a while... | |
| [2025-09-06 08:30:59 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.27.mlp.experts), it might take a while... | |
| [2025-09-06 08:31:02 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.28.mlp.experts), it might take a while... | |
| [2025-09-06 08:31:05 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.29.mlp.experts), it might take a while... | |
| [2025-09-06 08:31:08 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.30.mlp.experts), it might take a while... | |
| [2025-09-06 08:31:11 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.31.mlp.experts), it might take a while... | |
| [2025-09-06 08:31:15 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.32.mlp.experts), it might take a while... | |
| [2025-09-06 08:31:18 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.33.mlp.experts), it might take a while... | |
| [2025-09-06 08:31:21 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.34.mlp.experts), it might take a while... | |
| [2025-09-06 08:31:24 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.35.mlp.experts), it might take a while... | |
| [2025-09-06 08:31:27 TP0] Load weight end. type=GptOssForCausalLM, dtype=torch.bfloat16, avail mem=158.06 GB, mem usage=18.22 GB. | |
| [2025-09-06 08:31:28 TP0] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB | |
| [2025-09-06 08:31:28 TP3] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB | |
| [2025-09-06 08:31:28 TP0] Memory pool end. avail mem=10.23 GB | |
| [2025-09-06 08:31:28 TP2] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB | |
| [2025-09-06 08:31:28 TP1] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB | |
| [2025-09-06 08:31:28 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=9.54 GB | |
| [2025-09-06 08:31:28 TP0] Capture cuda graph bs [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200] | |
| 0%| | 0/28 [00:00<?, ?it/s] Capturing batches (bs=200 avail_mem=9.39 GB): 0%| | 0/28 [00:00<?, ?it/s]rank 1 allocated ipc_handles: [['0x77f2a8000000', '0x7818a4000000', '0x77f2a4000000', '0x77f2a0000000'], ['0x77f2a3000000', '0x77f2a2e00000', '0x77f2a3200000', '0x77f2a3400000'], ['0x77f28c000000', '0x77f296000000', '0x77f282000000', '0x77f278000000']] | |
| [2025-09-06 08:31:30.684] [info] lamportInitialize start: buffer: 0x77f296000000, size: 71303168 | |
| rank 3 allocated ipc_handles: [['0x7e64a4000000', '0x7e6448000000', '0x7e6444000000', '0x7e8a40000000'], ['0x7e6447000000', '0x7e6447200000', '0x7e6447400000', '0x7e6446e00000'], ['0x7e6430000000', '0x7e6426000000', '0x7e641c000000', '0x7e643a000000']] | |
| [2025-09-06 08:31:30.734] [info] lamportInitialize start: buffer: 0x7e643a000000, size: 71303168 | |
| rank 0 allocated ipc_handles: [['0x77a1e2000000', '0x777bf6000000', '0x777bf2000000', '0x777bee000000'], ['0x777bf0e00000', '0x777bf1000000', '0x777bf1200000', '0x777bf1400000'], ['0x777be4000000', '0x777bda000000', '0x777bd0000000', '0x777bc6000000']] | |
| [2025-09-06 08:31:30.783] [info] lamportInitialize start: buffer: 0x777be4000000, size: 71303168 | |
| rank 2 allocated ipc_handles: [['0x719f74000000', '0x719f10000000', '0x71c50a000000', '0x719f0c000000'], ['0x719f0f000000', '0x719f0f200000', '0x719f0ee00000', '0x719f0f400000'], ['0x719ef8000000', '0x719eee000000', '0x719f02000000', '0x719ee4000000']] | |
| [2025-09-06 08:31:30.833] [info] lamportInitialize start: buffer: 0x719f02000000, size: 71303168 | |
| [2025-09-06 08:31:30 TP0] FlashInfer workspace initialized for rank 0, world_size 4 | |
| [2025-09-06 08:31:30 TP1] FlashInfer workspace initialized for rank 1, world_size 4 | |
| [2025-09-06 08:31:30 TP3] FlashInfer workspace initialized for rank 3, world_size 4 | |
| [2025-09-06 08:31:30 TP2] FlashInfer workspace initialized for rank 2, world_size 4 | |
| set flag_ptr[3] = lamport_comm_size: 47185920 | |
| Rank 0 workspace[0] 0x77a1e2000000 | |
| Rank 0 workspace[1] 0x777bf6000000 | |
| Rank 0 workspace[2] 0x777bf2000000 | |
| Rank 0 workspace[3] 0x777bee000000 | |
| Rank 0 workspace[4] 0x777bf0e00000 | |
| Rank 0 workspace[5] 0x777bf1000000 | |
| Rank 0 workspace[6] 0x777bf1200000 | |
| Rank 0 workspace[7] 0x777bf1400000 | |
| Rank 0 workspace[8] 0x777be4000000 | |
| Rank 0 workspace[9] 0x777bda000000 | |
| Rank 0 workspace[10] 0x777bd0000000 | |
| Rank 0 workspace[11] 0x777bc6000000 | |
| Rank 0 workspace[12] 0x77a7db264400 | |
| set flag_ptr[3] = lamport_comm_size: 47185920 | |
| Rank 3 workspace[0] 0x7e64a4000000 | |
| Rank 3 workspace[1] 0x7e6448000000 | |
| Rank 3 workspace[2] 0x7e6444000000 | |
| Rank 3 workspace[3] 0x7e8a40000000 | |
| Rank 3 workspace[4] 0x7e6447000000 | |
| Rank 3 workspace[5] 0x7e6447200000 | |
| Rank 3 workspace[6] 0x7e6447400000 | |
| Rank 3 workspace[7] 0x7e6446e00000 | |
| Rank 3 workspace[8] 0x7e6430000000 | |
| Rank 3 workspace[9] 0x7e6426000000 | |
| Rank 3 workspace[10] 0x7e641c000000 | |
| Rank 3 workspace[11] 0x7e643a000000 | |
| Rank 3 workspace[12] 0x7e903b264400 | |
| set flag_ptr[3] = lamport_comm_size: 47185920 | |
| Rank 2 workspace[0] 0x719f74000000 | |
| Rank 2 workspace[1] 0x719f10000000 | |
| Rank 2 workspace[2] 0x71c50a000000 | |
| Rank 2 workspace[3] 0x719f0c000000 | |
| Rank 2 workspace[4] 0x719f0f000000 | |
| Rank 2 workspace[5] 0x719f0f200000 | |
| Rank 2 workspace[6] 0x719f0ee00000 | |
| Rank 2 workspace[7] 0x719f0f400000 | |
| Rank 2 workspace[8] 0x719ef8000000 | |
| Rank 2 workspace[9] 0x719eee000000 | |
| Rank 2 workspace[10] 0x719f02000000 | |
| Rank 2 workspace[11] 0x719ee4000000 | |
| Rank 2 workspace[12] 0x71cb13264400 | |
| set flag_ptr[3] = lamport_comm_size: 47185920 | |
| Rank 1 workspace[0] 0x77f2a8000000 | |
| Rank 1 workspace[1] 0x7818a4000000 | |
| Rank 1 workspace[2] 0x77f2a4000000 | |
| Rank 1 workspace[3] 0x77f2a0000000 | |
| Rank 1 workspace[4] 0x77f2a3000000 | |
| Rank 1 workspace[5] 0x77f2a2e00000 | |
| Rank 1 workspace[6] 0x77f2a3200000 | |
| Rank 1 workspace[7] 0x77f2a3400000 | |
| Rank 1 workspace[8] 0x77f28c000000 | |
| Rank 1 workspace[9] 0x77f296000000 | |
| Rank 1 workspace[10] 0x77f282000000 | |
| Rank 1 workspace[11] 0x77f278000000 | |
| Rank 1 workspace[12] 0x781ead264400 | |
| Capturing batches (bs=200 avail_mem=9.39 GB): 4%|▎ | 1/28 [00:02<00:56, 2.10s/it] Capturing batches (bs=192 avail_mem=8.06 GB): 4%|▎ | 1/28 [00:02<00:56, 2.10s/it] Capturing batches (bs=192 avail_mem=8.06 GB): 7%|▋ | 2/28 [00:02<00:24, 1.05it/s] Capturing batches (bs=184 avail_mem=8.05 GB): 7%|▋ | 2/28 [00:02<00:24, 1.05it/s] Capturing batches (bs=176 avail_mem=8.04 GB): 7%|▋ | 2/28 [00:02<00:24, 1.05it/s] Capturing batches (bs=176 avail_mem=8.04 GB): 14%|█▍ | 4/28 [00:02<00:09, 2.45it/s] Capturing batches (bs=168 avail_mem=8.03 GB): 14%|█▍ | 4/28 [00:02<00:09, 2.45it/s] Capturing batches (bs=160 avail_mem=8.02 GB): 14%|█▍ | 4/28 [00:02<00:09, 2.45it/s] Capturing batches (bs=160 avail_mem=8.02 GB): 21%|██▏ | 6/28 [00:02<00:05, 3.92it/s] Capturing batches (bs=152 avail_mem=8.01 GB): 21%|██▏ | 6/28 [00:02<00:05, 3.92it/s] Capturing batches (bs=144 avail_mem=8.00 GB): 21%|██▏ | 6/28 [00:02<00:05, 3.92it/s] Capturing batches (bs=144 avail_mem=8.00 GB): 29%|██▊ | 8/28 [00:02<00:03, 5.38it/s] Capturing batches (bs=136 avail_mem=8.00 GB): 29%|██▊ | 8/28 [00:02<00:03, 5.38it/s] Capturing batches (bs=128 avail_mem=7.99 GB): 29%|██▊ | 8/28 [00:02<00:03, 5.38it/s] Capturing batches (bs=128 avail_mem=7.99 GB): 36%|███▌ | 10/28 [00:02<00:02, 6.66it/s] Capturing batches (bs=120 avail_mem=7.97 GB): 36%|███▌ | 10/28 [00:02<00:02, 6.66it/s] Capturing batches (bs=112 avail_mem=7.97 GB): 36%|███▌ | 10/28 [00:03<00:02, 6.66it/s] Capturing batches (bs=112 avail_mem=7.97 GB): 43%|████▎ | 12/28 [00:03<00:02, 7.81it/s] Capturing batches (bs=104 avail_mem=7.96 GB): 43%|████▎ | 12/28 [00:03<00:02, 7.81it/s] Capturing batches (bs=96 avail_mem=7.95 GB): 43%|████▎ | 12/28 [00:03<00:02, 7.81it/s] Capturing batches (bs=96 avail_mem=7.95 GB): 50%|█████ | 14/28 [00:03<00:01, 8.79it/s] Capturing batches (bs=88 avail_mem=7.94 GB): 50%|█████ | 14/28 [00:03<00:01, 8.79it/s] Capturing batches (bs=80 avail_mem=7.93 GB): 50%|█████ | 14/28 [00:03<00:01, 8.79it/s] Capturing batches (bs=80 avail_mem=7.93 GB): 57%|█████▋ | 16/28 [00:03<00:01, 9.56it/s] Capturing batches (bs=72 avail_mem=7.92 GB): 57%|█████▋ | 16/28 [00:03<00:01, 9.56it/s] Capturing batches (bs=64 avail_mem=7.91 GB): 57%|█████▋ | 16/28 [00:03<00:01, 9.56it/s] Capturing batches (bs=64 avail_mem=7.91 GB): 64%|██████▍ | 18/28 [00:03<00:00, 10.22it/s] Capturing batches (bs=56 avail_mem=7.90 GB): 64%|██████▍ | 18/28 [00:03<00:00, 10.22it/s] Capturing batches (bs=48 avail_mem=7.89 GB): 64%|██████▍ | 18/28 [00:03<00:00, 10.22it/s] Capturing batches (bs=48 avail_mem=7.89 GB): 71%|███████▏ | 20/28 [00:03<00:00, 10.77it/s] Capturing batches (bs=40 avail_mem=7.88 GB): 71%|███████▏ | 20/28 [00:03<00:00, 10.77it/s] Capturing batches (bs=32 avail_mem=7.88 GB): 71%|███████▏ | 20/28 [00:03<00:00, 10.77it/s] Capturing batches (bs=32 avail_mem=7.88 GB): 79%|███████▊ | 22/28 [00:03<00:00, 11.19it/s] Capturing batches (bs=24 avail_mem=7.87 GB): 79%|███████▊ | 22/28 [00:03<00:00, 11.19it/s] Capturing batches (bs=16 avail_mem=7.86 GB): 79%|███████▊ | 22/28 [00:04<00:00, 11.19it/s] Capturing batches (bs=16 avail_mem=7.86 GB): 86%|████████▌ | 24/28 [00:04<00:00, 11.53it/s] Capturing batches (bs=8 avail_mem=7.85 GB): 86%|████████▌ | 24/28 [00:04<00:00, 11.53it/s] Capturing batches (bs=4 avail_mem=7.84 GB): 86%|████████▌ | 24/28 [00:04<00:00, 11.53it/s] Capturing batches (bs=4 avail_mem=7.84 GB): 93%|█████████▎| 26/28 [00:04<00:00, 12.12it/s] Capturing batches (bs=2 avail_mem=7.83 GB): 93%|█████████▎| 26/28 [00:04<00:00, 12.12it/s] Capturing batches (bs=1 avail_mem=7.82 GB): 93%|█████████▎| 26/28 [00:04<00:00, 12.12it/s] Capturing batches (bs=1 avail_mem=7.82 GB): 100%|██████████| 28/28 [00:04<00:00, 11.50it/s] Capturing batches (bs=1 avail_mem=7.82 GB): 100%|██████████| 28/28 [00:04<00:00, 6.31it/s] | |
| [2025-09-06 08:31:33 TP0] Registering 56 cuda graph addresses | |
| [2025-09-06 08:31:33 TP3] Registering 56 cuda graph addresses | |
| [2025-09-06 08:31:33 TP2] Registering 56 cuda graph addresses | |
| [2025-09-06 08:31:33 TP1] Registering 56 cuda graph addresses | |
| [2025-09-06 08:31:33 TP0] Capture cuda graph end. Time elapsed: 4.95 s. mem usage=1.73 GB. avail mem=7.81 GB. | |
| [2025-09-06 08:31:34 TP0] max_total_num_tokens=8487040, chunked_prefill_size=16384, max_prefill_tokens=16384, max_running_requests=4096, context_len=131072, available_gpu_mem=7.81 GB | |
| [2025-09-06 08:31:34] INFO: Started server process [36995] | |
| [2025-09-06 08:31:34] INFO: Waiting for application startup. | |
| [2025-09-06 08:31:35] INFO: Application startup complete. | |
| [2025-09-06 08:31:35] INFO: Uvicorn running on http://127.0.0.1:8400 (Press CTRL+C to quit) | |
| [2025-09-06 08:31:36] INFO: 127.0.0.1:51504 - "GET /get_model_info HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:36 TP0] Prefill batch. #new-seq: 1, #new-token: 64, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, | |
| [2025-09-06 08:31:37] INFO: 127.0.0.1:51520 - "POST /generate HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:37] The server is fired up and ready to roll! | |
| [2025-09-06 08:31:37 TP0] Prefill batch. #new-seq: 1, #new-token: 64, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, | |
| [2025-09-06 08:31:38] INFO: 127.0.0.1:51528 - "GET /health_generate HTTP/1.1" 200 OK | |
| command=python3 -m sglang.launch_server --model-path /home/yiliu7/models/openai/gpt-oss-120b --tp 4 --cuda-graph-max-bs 200 --mem-fraction-static 0.93 --device cuda --host 127.0.0.1 --port 8400 | |
| Evaluation start: model=/home/yiliu7/models/openai/gpt-oss-120b reasoning_effort=low expected_score=0.6 | |
| ChatCompletionSampler initialized with self.system_message=None self.temperature=0.1 self.max_tokens=4096 self.reasoning_effort='low' | |
| 0%| | 0/198 [00:00<?, ?it/s][2025-09-06 08:31:39 TP0] Prefill batch. #new-seq: 1, #new-token: 512, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, | |
| [2025-09-06 08:31:39 TP0] Prefill batch. #new-seq: 3, #new-token: 896, #cached-token: 0, token usage: 0.00, #running-req: 1, #queue-req: 0, | |
| [2025-09-06 08:31:39 TP0] Prefill batch. #new-seq: 14, #new-token: 4160, #cached-token: 896, token usage: 0.00, #running-req: 4, #queue-req: 0, | |
| [2025-09-06 08:31:39 TP0] Prefill batch. #new-seq: 17, #new-token: 3840, #cached-token: 1088, token usage: 0.00, #running-req: 18, #queue-req: 0, | |
| [2025-09-06 08:31:39 TP0] Prefill batch. #new-seq: 55, #new-token: 16128, #cached-token: 3520, token usage: 0.00, #running-req: 35, #queue-req: 23, | |
| [2025-09-06 08:31:39 TP0] Prefill batch. #new-seq: 46, #new-token: 14208, #cached-token: 2944, token usage: 0.00, #running-req: 90, #queue-req: 0, | |
| [2025-09-06 08:31:40 TP0] Prefill batch. #new-seq: 57, #new-token: 16000, #cached-token: 3712, token usage: 0.00, #running-req: 136, #queue-req: 0, | |
| [2025-09-06 08:31:40 TP0] Prefill batch. #new-seq: 5, #new-token: 1152, #cached-token: 320, token usage: 0.01, #running-req: 193, #queue-req: 0, | |
| [2025-09-06 08:31:40 TP0] Decode batch. #running-req: 198, #token: 62976, token usage: 0.01, cuda graph: True, gen throughput (token/s): 922.40, #queue-req: 0, | |
| [2025-09-06 08:31:40] INFO: 127.0.0.1:41312 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:40] INFO: 127.0.0.1:40926 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:41 TP0] Decode batch. #running-req: 196, #token: 70272, token usage: 0.01, cuda graph: True, gen throughput (token/s): 17188.20, #queue-req: 0, | |
| [2025-09-06 08:31:41] INFO: 127.0.0.1:40114 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:41] INFO: 127.0.0.1:40706 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:41] INFO: 127.0.0.1:40156 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:41] INFO: 127.0.0.1:41216 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:41] INFO: 127.0.0.1:41038 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:41] INFO: 127.0.0.1:40550 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:41] INFO: 127.0.0.1:39952 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:41] INFO: 127.0.0.1:40468 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:41 TP0] Decode batch. #running-req: 188, #token: 72704, token usage: 0.01, cuda graph: True, gen throughput (token/s): 17082.06, #queue-req: 0, | |
| [2025-09-06 08:31:41] INFO: 127.0.0.1:41490 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:41] INFO: 127.0.0.1:40572 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:41] INFO: 127.0.0.1:41026 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:41] INFO: 127.0.0.1:40216 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:41] INFO: 127.0.0.1:41222 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:41] INFO: 127.0.0.1:40902 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:41] INFO: 127.0.0.1:41350 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:41] INFO: 127.0.0.1:40590 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:41] INFO: 127.0.0.1:41608 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:41] INFO: 127.0.0.1:40338 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:41] INFO: 127.0.0.1:40722 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:42] INFO: 127.0.0.1:40620 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:42] INFO: 127.0.0.1:41118 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:42 TP0] Decode batch. #running-req: 175, #token: 73280, token usage: 0.01, cuda graph: True, gen throughput (token/s): 16450.85, #queue-req: 0, | |
| [2025-09-06 08:31:42] INFO: 127.0.0.1:40004 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:42] INFO: 127.0.0.1:40324 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:42] INFO: 127.0.0.1:41550 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:42] INFO: 127.0.0.1:40594 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:42] INFO: 127.0.0.1:41424 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:42] INFO: 127.0.0.1:41528 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:42 TP0] Decode batch. #running-req: 169, #token: 76992, token usage: 0.01, cuda graph: True, gen throughput (token/s): 15874.65, #queue-req: 0, | |
| [2025-09-06 08:31:42] INFO: 127.0.0.1:40272 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:42] INFO: 127.0.0.1:40146 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:42] INFO: 127.0.0.1:40950 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:42] INFO: 127.0.0.1:40396 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:42] INFO: 127.0.0.1:40014 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:42] INFO: 127.0.0.1:39990 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:43 TP0] Decode batch. #running-req: 163, #token: 81216, token usage: 0.01, cuda graph: True, gen throughput (token/s): 14920.13, #queue-req: 0, | |
| [2025-09-06 08:31:43] INFO: 127.0.0.1:40582 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:43] INFO: 127.0.0.1:40868 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:43] INFO: 127.0.0.1:40492 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:43] INFO: 127.0.0.1:40742 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:43] INFO: 127.0.0.1:39878 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:43] INFO: 127.0.0.1:40910 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:43] INFO: 127.0.0.1:40870 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:43] INFO: 127.0.0.1:41634 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:43] INFO: 127.0.0.1:40340 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:43] INFO: 127.0.0.1:41368 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:43] INFO: 127.0.0.1:40652 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:43] INFO: 127.0.0.1:41138 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:43] INFO: 127.0.0.1:40406 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:43 TP0] Decode batch. #running-req: 151, #token: 80704, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13889.09, #queue-req: 0, | |
| [2025-09-06 08:31:43] INFO: 127.0.0.1:40782 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:43] INFO: 127.0.0.1:41124 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:43] INFO: 127.0.0.1:40698 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:43] INFO: 127.0.0.1:41040 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:43] INFO: 127.0.0.1:41560 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:43] INFO: 127.0.0.1:41012 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:43] INFO: 127.0.0.1:39824 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:43] INFO: 127.0.0.1:40062 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:43] INFO: 127.0.0.1:41210 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:43 TP0] Decode batch. #running-req: 141, #token: 81792, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13339.47, #queue-req: 0, | |
| [2025-09-06 08:31:43] INFO: 127.0.0.1:41456 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:44] INFO: 127.0.0.1:41196 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:44] INFO: 127.0.0.1:40154 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:44] INFO: 127.0.0.1:40350 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:44] INFO: 127.0.0.1:40064 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:44] INFO: 127.0.0.1:40768 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:44] INFO: 127.0.0.1:39862 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:44] INFO: 127.0.0.1:41380 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:44] INFO: 127.0.0.1:40798 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:44] INFO: 127.0.0.1:40130 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:44] INFO: 127.0.0.1:41252 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:44] INFO: 127.0.0.1:40838 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:44] INFO: 127.0.0.1:41476 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:44 TP0] Decode batch. #running-req: 128, #token: 79296, token usage: 0.01, cuda graph: True, gen throughput (token/s): 12714.51, #queue-req: 0, | |
| [2025-09-06 08:31:44] INFO: 127.0.0.1:40200 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:44] INFO: 127.0.0.1:40306 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:44] INFO: 127.0.0.1:41624 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:44] INFO: 127.0.0.1:40882 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:44] INFO: 127.0.0.1:40102 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:44] INFO: 127.0.0.1:40432 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:44] INFO: 127.0.0.1:40762 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:44] INFO: 127.0.0.1:40460 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:44] INFO: 127.0.0.1:40420 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:44 TP0] Decode batch. #running-req: 119, #token: 78528, token usage: 0.01, cuda graph: True, gen throughput (token/s): 16715.06, #queue-req: 0, | |
| [2025-09-06 08:31:44] INFO: 127.0.0.1:40112 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:44] INFO: 127.0.0.1:40848 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:44] INFO: 127.0.0.1:40610 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:44] INFO: 127.0.0.1:40226 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:44] INFO: 127.0.0.1:41536 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:44] INFO: 127.0.0.1:40250 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:44] INFO: 127.0.0.1:40098 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:44] INFO: 127.0.0.1:41596 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:44 TP0] Decode batch. #running-req: 112, #token: 77760, token usage: 0.01, cuda graph: True, gen throughput (token/s): 15766.71, #queue-req: 0, | |
| [2025-09-06 08:31:44] INFO: 127.0.0.1:40384 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:45] INFO: 127.0.0.1:40968 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:45] INFO: 127.0.0.1:40448 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:45 TP0] Decode batch. #running-req: 108, #token: 79872, token usage: 0.01, cuda graph: True, gen throughput (token/s): 15274.40, #queue-req: 0, | |
| [2025-09-06 08:31:45] INFO: 127.0.0.1:40302 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:45] INFO: 127.0.0.1:40700 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:45] INFO: 127.0.0.1:41298 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:45] INFO: 127.0.0.1:40184 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:45] INFO: 127.0.0.1:41278 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:45] INFO: 127.0.0.1:41304 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:45] INFO: 127.0.0.1:41352 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:45] INFO: 127.0.0.1:40532 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:45] INFO: 127.0.0.1:39892 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:45] INFO: 127.0.0.1:40618 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:45] INFO: 127.0.0.1:40670 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:45] INFO: 127.0.0.1:40812 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:45 TP0] Decode batch. #running-req: 96, #token: 73856, token usage: 0.01, cuda graph: True, gen throughput (token/s): 14447.63, #queue-req: 0, | |
| [2025-09-06 08:31:45] INFO: 127.0.0.1:41588 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:45] INFO: 127.0.0.1:41268 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:45] INFO: 127.0.0.1:39842 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:45] INFO: 127.0.0.1:40650 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:45] INFO: 127.0.0.1:40408 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:45] INFO: 127.0.0.1:40888 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:45] INFO: 127.0.0.1:40630 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:45] INFO: 127.0.0.1:41066 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:45] INFO: 127.0.0.1:40296 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:45] INFO: 127.0.0.1:41180 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:45 TP0] Decode batch. #running-req: 86, #token: 69632, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13476.54, #queue-req: 0, | |
| [2025-09-06 08:31:45] INFO: 127.0.0.1:39908 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:45] INFO: 127.0.0.1:40560 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:45] INFO: 127.0.0.1:40040 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:45] INFO: 127.0.0.1:40732 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:45] INFO: 127.0.0.1:39950 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:45] INFO: 127.0.0.1:39976 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:45] INFO: 127.0.0.1:41294 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:45] INFO: 127.0.0.1:40278 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:46] INFO: 127.0.0.1:41378 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:46 TP0] Decode batch. #running-req: 78, #token: 66048, token usage: 0.01, cuda graph: True, gen throughput (token/s): 12489.34, #queue-req: 0, | |
| [2025-09-06 08:31:46] INFO: 127.0.0.1:41094 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:46] INFO: 127.0.0.1:40258 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:46] INFO: 127.0.0.1:40520 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:46] INFO: 127.0.0.1:40828 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:46] INFO: 127.0.0.1:41406 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:46] INFO: 127.0.0.1:40484 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:46] INFO: 127.0.0.1:41166 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:46 TP0] Decode batch. #running-req: 70, #token: 63360, token usage: 0.01, cuda graph: True, gen throughput (token/s): 11531.30, #queue-req: 0, | |
| [2025-09-06 08:31:46] INFO: 127.0.0.1:41632 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:46] INFO: 127.0.0.1:40936 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:46] INFO: 127.0.0.1:40714 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:46] INFO: 127.0.0.1:40294 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:46] INFO: 127.0.0.1:39814 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:46 TP0] Decode batch. #running-req: 65, #token: 61248, token usage: 0.01, cuda graph: True, gen throughput (token/s): 10813.85, #queue-req: 0, | |
| [2025-09-06 08:31:46] INFO: 127.0.0.1:41098 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:46] INFO: 127.0.0.1:41464 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:46] INFO: 127.0.0.1:41232 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:46 TP0] Decode batch. #running-req: 63, #token: 60800, token usage: 0.01, cuda graph: True, gen throughput (token/s): 10310.24, #queue-req: 0, | |
| [2025-09-06 08:31:46] INFO: 127.0.0.1:39940 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:47 TP0] Decode batch. #running-req: 61, #token: 62464, token usage: 0.01, cuda graph: True, gen throughput (token/s): 10214.19, #queue-req: 0, | |
| [2025-09-06 08:31:47] INFO: 127.0.0.1:40370 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:47] INFO: 127.0.0.1:40466 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:47] INFO: 127.0.0.1:40658 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:47] INFO: 127.0.0.1:40426 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:47] INFO: 127.0.0.1:40846 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:47 TP0] Decode batch. #running-req: 56, #token: 57984, token usage: 0.01, cuda graph: True, gen throughput (token/s): 9650.02, #queue-req: 0, | |
| [2025-09-06 08:31:47] INFO: 127.0.0.1:41440 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:47] INFO: 127.0.0.1:39826 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:47] INFO: 127.0.0.1:40028 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:47] INFO: 127.0.0.1:40608 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:47] INFO: 127.0.0.1:40676 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:47] INFO: 127.0.0.1:39958 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:47] INFO: 127.0.0.1:41290 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:47] INFO: 127.0.0.1:40178 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:47] INFO: 127.0.0.1:41602 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:47] INFO: 127.0.0.1:41078 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:47] INFO: 127.0.0.1:41338 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:47 TP0] Decode batch. #running-req: 46, #token: 49408, token usage: 0.01, cuda graph: True, gen throughput (token/s): 8536.76, #queue-req: 0, | |
| [2025-09-06 08:31:47] INFO: 127.0.0.1:41412 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:47] INFO: 127.0.0.1:40414 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:47] INFO: 127.0.0.1:41392 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:47] INFO: 127.0.0.1:40736 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:47] INFO: 127.0.0.1:41504 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:47] INFO: 127.0.0.1:40240 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:47] INFO: 127.0.0.1:41056 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:47] INFO: 127.0.0.1:40958 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:47] INFO: 127.0.0.1:40542 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:47] INFO: 127.0.0.1:40086 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:47 TP0] Decode batch. #running-req: 35, #token: 39616, token usage: 0.00, cuda graph: True, gen throughput (token/s): 7112.32, #queue-req: 0, | |
| [2025-09-06 08:31:47] INFO: 127.0.0.1:41154 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:47] INFO: 127.0.0.1:41122 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| 1%| | 1/198 [00:08<27:51, 8.49s/it][2025-09-06 08:31:47] INFO: 127.0.0.1:39856 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:47 TP0] Decode batch. #running-req: 32, #token: 37376, token usage: 0.00, cuda graph: True, gen throughput (token/s): 6197.00, #queue-req: 0, | |
| [2025-09-06 08:31:48] INFO: 127.0.0.1:41014 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:48] INFO: 127.0.0.1:41086 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:48 TP0] Decode batch. #running-req: 30, #token: 36480, token usage: 0.00, cuda graph: True, gen throughput (token/s): 5783.69, #queue-req: 0, | |
| [2025-09-06 08:31:48] INFO: 127.0.0.1:41114 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:48] INFO: 127.0.0.1:39964 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:48] INFO: 127.0.0.1:40880 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:48] INFO: 127.0.0.1:40312 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:48 TP0] Decode batch. #running-req: 26, #token: 32512, token usage: 0.00, cuda graph: True, gen throughput (token/s): 5250.98, #queue-req: 0, | |
| [2025-09-06 08:31:48] INFO: 127.0.0.1:40800 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:48] INFO: 127.0.0.1:40140 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:48] INFO: 127.0.0.1:40980 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:48] INFO: 127.0.0.1:41330 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| 6%|▌ | 12/198 [00:09<01:44, 1.78it/s][2025-09-06 08:31:48 TP0] Decode batch. #running-req: 22, #token: 28672, token usage: 0.00, cuda graph: True, gen throughput (token/s): 4662.53, #queue-req: 0, | |
| [2025-09-06 08:31:48] INFO: 127.0.0.1:39946 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:48] INFO: 127.0.0.1:39924 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:48] INFO: 127.0.0.1:41244 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:48 TP0] Decode batch. #running-req: 19, #token: 25472, token usage: 0.00, cuda graph: True, gen throughput (token/s): 4022.45, #queue-req: 0, | |
| [2025-09-06 08:31:48] INFO: 127.0.0.1:40996 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:48] INFO: 127.0.0.1:40360 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:48] INFO: 127.0.0.1:40906 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:48] INFO: 127.0.0.1:39900 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:48] INFO: 127.0.0.1:40458 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| 14%|█▎ | 27/198 [00:09<00:36, 4.63it/s][2025-09-06 08:31:48] INFO: 127.0.0.1:41514 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:48 TP0] Decode batch. #running-req: 13, #token: 17856, token usage: 0.00, cuda graph: True, gen throughput (token/s): 3308.91, #queue-req: 0, | |
| [2025-09-06 08:31:48] INFO: 127.0.0.1:40864 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:49] INFO: 127.0.0.1:40332 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:49 TP0] Decode batch. #running-req: 11, #token: 15360, token usage: 0.00, cuda graph: True, gen throughput (token/s): 2396.18, #queue-req: 0, | |
| [2025-09-06 08:31:49] INFO: 127.0.0.1:40172 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:49 TP0] Decode batch. #running-req: 10, #token: 14464, token usage: 0.00, cuda graph: True, gen throughput (token/s): 2326.20, #queue-req: 0, | |
| [2025-09-06 08:31:49] INFO: 127.0.0.1:40054 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| 21%|██ | 42/198 [00:09<00:19, 7.89it/s][2025-09-06 08:31:49 TP0] Decode batch. #running-req: 9, #token: 13248, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1980.44, #queue-req: 0, | |
| [2025-09-06 08:31:49] INFO: 127.0.0.1:41576 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:49] INFO: 127.0.0.1:40750 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:49] INFO: 127.0.0.1:41360 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:49 TP0] Decode batch. #running-req: 6, #token: 9088, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1691.00, #queue-req: 0, | |
| [2025-09-06 08:31:49 TP0] Decode batch. #running-req: 6, #token: 9280, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1448.62, #queue-req: 0, | |
| [2025-09-06 08:31:49] INFO: 127.0.0.1:41322 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:49] INFO: 127.0.0.1:40076 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| 23%|██▎ | 45/198 [00:10<00:20, 7.40it/s][2025-09-06 08:31:50 TP0] Decode batch. #running-req: 4, #token: 6336, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1187.08, #queue-req: 0, | |
| [2025-09-06 08:31:50] INFO: 127.0.0.1:40504 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:50 TP0] Decode batch. #running-req: 3, #token: 4864, token usage: 0.00, cuda graph: True, gen throughput (token/s): 873.48, #queue-req: 0, | |
| [2025-09-06 08:31:50 TP0] Decode batch. #running-req: 3, #token: 4928, token usage: 0.00, cuda graph: True, gen throughput (token/s): 794.66, #queue-req: 0, | |
| [2025-09-06 08:31:50 TP0] Decode batch. #running-req: 3, #token: 5056, token usage: 0.00, cuda graph: True, gen throughput (token/s): 793.97, #queue-req: 0, | |
| [2025-09-06 08:31:50] INFO: 127.0.0.1:40686 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:31:50 TP0] Decode batch. #running-req: 2, #token: 3584, token usage: 0.00, cuda graph: True, gen throughput (token/s): 654.65, #queue-req: 0, | |
| [2025-09-06 08:31:50 TP0] Decode batch. #running-req: 2, #token: 3584, token usage: 0.00, cuda graph: True, gen throughput (token/s): 579.22, #queue-req: 0, | |
| [2025-09-06 08:31:50] INFO: 127.0.0.1:40152 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| 27%|██▋ | 54/198 [00:11<00:17, 8.07it/s][2025-09-06 08:31:50 TP0] Decode batch. #running-req: 1, #token: 1920, token usage: 0.00, cuda graph: True, gen throughput (token/s): 553.68, #queue-req: 0, | |
| [2025-09-06 08:31:51 TP0] Decode batch. #running-req: 1, #token: 1920, token usage: 0.00, cuda graph: True, gen throughput (token/s): 331.43, #queue-req: 0, | |
| [2025-09-06 08:31:51 TP0] Decode batch. #running-req: 1, #token: 1984, token usage: 0.00, cuda graph: True, gen throughput (token/s): 331.17, #queue-req: 0, | |
| [2025-09-06 08:31:51 TP0] Decode batch. #running-req: 1, #token: 2048, token usage: 0.00, cuda graph: True, gen throughput (token/s): 331.30, #queue-req: 0, | |
| [2025-09-06 08:31:51 TP0] Decode batch. #running-req: 1, #token: 2048, token usage: 0.00, cuda graph: True, gen throughput (token/s): 331.20, #queue-req: 0, | |
| [2025-09-06 08:31:51 TP0] Decode batch. #running-req: 1, #token: 2112, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.45, #queue-req: 0, | |
| [2025-09-06 08:31:51 TP0] Decode batch. #running-req: 1, #token: 2176, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.09, #queue-req: 0, | |
| [2025-09-06 08:31:51 TP0] Decode batch. #running-req: 1, #token: 2176, token usage: 0.00, cuda graph: True, gen throughput (token/s): 327.97, #queue-req: 0, | |
| [2025-09-06 08:31:51 TP0] Decode batch. #running-req: 1, #token: 2240, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.19, #queue-req: 0, | |
| [2025-09-06 08:31:52 TP0] Decode batch. #running-req: 1, #token: 2240, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.97, #queue-req: 0, | |
| [2025-09-06 08:31:52] INFO: 127.0.0.1:40646 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| 56%|█████▌ | 110/198 [00:12<00:04, 21.58it/s] 100%|██████████| 198/198 [00:12<00:00, 15.65it/s] | |
| /usr/lib/python3.12/subprocess.py:1127: ResourceWarning: subprocess 36995 is still running | |
| _warn("subprocess %s is still running" % self.pid, | |
| ResourceWarning: Enable tracemalloc to get the object allocation traceback | |
| . | |
| ---------------------------------------------------------------------- | |
| Ran 1 test in 166.818s | |
| OK | |
| Writing report to /tmp/gpqa__home_yiliu7_models_openai_gpt-oss-120b.html | |
| {'chars': 1722.030303030303, 'chars:std': 987.6499088827325, 'score:std': 0.4824488175389596, 'score': 0.6313131313131313} | |
| Writing results to /tmp/gpqa__home_yiliu7_models_openai_gpt-oss-120b.json | |
| Total latency: 12.707 s | |
| Score: 0.631 | |
| Evaluation end: model=/home/yiliu7/models/openai/gpt-oss-120b reasoning_effort=low expected_score=0.6 metrics={'chars': 1722.030303030303, 'chars:std': 987.6499088827325, 'score:std': 0.4824488175389596, 'score': 0.6313131313131313} | |
| ================================================================================ | |
| Run 3: | |
| Auto-configed device: cuda | |
| WARNING:sglang.srt.server_args:Detected SM100 and MXFP4 quantization format for GPT-OSS model, enabling FlashInfer MXFP4 MOE kernel. | |
| WARNING:sglang.srt.server_args:TensorRT-LLM MHA only supports page_size of 16, 32 or 64, changing page_size from None to 64. | |
| [2025-09-06 08:32:06] server_args=ServerArgs(model_path='/home/yiliu7/models/openai/gpt-oss-120b', tokenizer_path='/home/yiliu7/models/openai/gpt-oss-120b', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=8400, skip_server_warmup=False, warmups=None, nccl_port=None, dtype='bfloat16', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', mem_fraction_static=0.93, max_running_requests=None, max_queued_requests=9223372036854775807, max_total_tokens=None, chunked_prefill_size=16384, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, page_size=64, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=True, device='cuda', tp_size=4, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=713630635, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, gc_warning_threshold_secs=0.0, api_key=None, served_model_name='/home/yiliu7/models/openai/gpt-oss-120b', weight_version='default', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_backend='triton', attention_backend='trtllm_mha', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, ep_size=1, moe_a2a_backend='none', moe_runner_backend='flashinfer_mxfp4', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=True, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', disable_radix_cache=False, cuda_graph_max_bs=200, cuda_graph_bs=None, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, torch_compile_max_bs=32, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, enable_return_hidden_states=False, scheduler_recv_interval=1, numa_node=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, debug_tensor_dump_prefill_only=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, num_reserved_decode_tokens=512, custom_weight_loader=[], weight_loader_disable_mmap=False, enable_pdmux=False, sm_group_num=3, enable_ep_moe=False, enable_deepep_moe=False, enable_flashinfer_cutlass_moe=False, enable_flashinfer_trtllm_moe=False, enable_triton_kernel_moe=False, enable_flashinfer_mxfp4_moe=False) | |
| All deep_gemm operations loaded successfully! | |
| `torch_dtype` is deprecated! Use `dtype` instead! | |
| [2025-09-06 08:32:06] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:32:06] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [2025-09-06 08:32:06] Using default HuggingFace chat template with detected content format: string | |
| `torch_dtype` is deprecated! Use `dtype` instead! | |
| [2025-09-06 08:32:13 TP0] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:32:13 TP0] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| `torch_dtype` is deprecated! Use `dtype` instead! | |
| [2025-09-06 08:32:13 TP3] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:32:13 TP3] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| `torch_dtype` is deprecated! Use `dtype` instead! | |
| [2025-09-06 08:32:13 TP1] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:32:13 TP1] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| `torch_dtype` is deprecated! Use `dtype` instead! | |
| [2025-09-06 08:32:13 TP2] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:32:13 TP2] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [2025-09-06 08:32:13 TP0] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:32:13 TP0] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [2025-09-06 08:32:13 TP0] Init torch distributed begin. | |
| [2025-09-06 08:32:14 TP3] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:32:14 TP3] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [2025-09-06 08:32:14 TP1] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:32:14 TP1] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [2025-09-06 08:32:14 TP2] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:32:14 TP2] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [2025-09-06 08:32:15 TP0] sglang is using nccl==2.27.3 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [2025-09-06 08:32:18 TP0] Init torch distributed ends. mem usage=1.46 GB | |
| [2025-09-06 08:32:18 TP0] Load weight begin. avail mem=176.28 GB | |
| All deep_gemm operations loaded successfully! | |
| Loading safetensors checkpoint shards: 0% Completed | 0/15 [00:00<?, ?it/s] | |
| Loading safetensors checkpoint shards: 100% Completed | 15/15 [00:00<00:00, 1662.52it/s] | |
| [2025-09-06 08:32:29 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.0.mlp.experts), it might take a while... | |
| All deep_gemm operations loaded successfully! | |
| All deep_gemm operations loaded successfully! | |
| All deep_gemm operations loaded successfully! | |
| [2025-09-06 08:32:33 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.1.mlp.experts), it might take a while... | |
| [2025-09-06 08:32:36 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.2.mlp.experts), it might take a while... | |
| [2025-09-06 08:32:39 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.3.mlp.experts), it might take a while... | |
| [2025-09-06 08:32:42 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.4.mlp.experts), it might take a while... | |
| [2025-09-06 08:32:45 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.5.mlp.experts), it might take a while... | |
| [2025-09-06 08:32:48 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.6.mlp.experts), it might take a while... | |
| [2025-09-06 08:32:51 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.7.mlp.experts), it might take a while... | |
| [2025-09-06 08:32:54 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.8.mlp.experts), it might take a while... | |
| [2025-09-06 08:32:57 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.9.mlp.experts), it might take a while... | |
| [2025-09-06 08:33:00 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.10.mlp.experts), it might take a while... | |
| [2025-09-06 08:33:03 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.11.mlp.experts), it might take a while... | |
| [2025-09-06 08:33:06 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.12.mlp.experts), it might take a while... | |
| [2025-09-06 08:33:09 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.13.mlp.experts), it might take a while... | |
| [2025-09-06 08:33:12 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.14.mlp.experts), it might take a while... | |
| [2025-09-06 08:33:15 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.15.mlp.experts), it might take a while... | |
| [2025-09-06 08:33:18 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.16.mlp.experts), it might take a while... | |
| [2025-09-06 08:33:21 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.17.mlp.experts), it might take a while... | |
| [2025-09-06 08:33:24 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.18.mlp.experts), it might take a while... | |
| [2025-09-06 08:33:27 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.19.mlp.experts), it might take a while... | |
| [2025-09-06 08:33:30 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.20.mlp.experts), it might take a while... | |
| [2025-09-06 08:33:33 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.21.mlp.experts), it might take a while... | |
| [2025-09-06 08:33:36 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.22.mlp.experts), it might take a while... | |
| [2025-09-06 08:33:39 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.23.mlp.experts), it might take a while... | |
| [2025-09-06 08:33:42 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.24.mlp.experts), it might take a while... | |
| [2025-09-06 08:33:45 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.25.mlp.experts), it might take a while... | |
| [2025-09-06 08:33:49 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.26.mlp.experts), it might take a while... | |
| [2025-09-06 08:33:52 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.27.mlp.experts), it might take a while... | |
| [2025-09-06 08:33:55 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.28.mlp.experts), it might take a while... | |
| [2025-09-06 08:33:58 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.29.mlp.experts), it might take a while... | |
| [2025-09-06 08:34:01 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.30.mlp.experts), it might take a while... | |
| [2025-09-06 08:34:04 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.31.mlp.experts), it might take a while... | |
| [2025-09-06 08:34:07 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.32.mlp.experts), it might take a while... | |
| [2025-09-06 08:34:10 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.33.mlp.experts), it might take a while... | |
| [2025-09-06 08:34:13 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.34.mlp.experts), it might take a while... | |
| [2025-09-06 08:34:16 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.35.mlp.experts), it might take a while... | |
| [2025-09-06 08:34:19 TP0] Load weight end. type=GptOssForCausalLM, dtype=torch.bfloat16, avail mem=158.06 GB, mem usage=18.22 GB. | |
| [2025-09-06 08:34:23 TP2] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB | |
| [2025-09-06 08:34:23 TP3] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB | |
| [2025-09-06 08:34:23 TP0] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB | |
| [2025-09-06 08:34:23 TP0] Memory pool end. avail mem=10.23 GB | |
| [2025-09-06 08:34:23 TP1] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB | |
| [2025-09-06 08:34:23 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=9.54 GB | |
| [2025-09-06 08:34:24 TP0] Capture cuda graph bs [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200] | |
| 0%| | 0/28 [00:00<?, ?it/s] Capturing batches (bs=200 avail_mem=9.39 GB): 0%| | 0/28 [00:00<?, ?it/s]rank 1 allocated ipc_handles: [['0x7eb6fc000000', '0x7edcbc000000', '0x7eb6bc000000', '0x7eb6b8000000'], ['0x7eb6bb000000', '0x7eb6bae00000', '0x7eb6bb200000', '0x7eb6bb400000'], ['0x7eb6a4000000', '0x7eb6ae000000', '0x7eb69a000000', '0x7eb690000000']] | |
| [2025-09-06 08:34:26.044] [info] lamportInitialize start: buffer: 0x7eb6ae000000, size: 71303168 | |
| rank 0 allocated ipc_handles: [['0x774d82000000', '0x772796000000', '0x772792000000', '0x77278e000000'], ['0x772790e00000', '0x772791000000', '0x772791200000', '0x772791400000'], ['0x772784000000', '0x77277a000000', '0x772770000000', '0x772766000000']] | |
| [2025-09-06 08:34:26.092] [info] lamportInitialize start: buffer: 0x772784000000, size: 71303168 | |
| rank 3 allocated ipc_handles: [['0x76aaf0000000', '0x76aa8c000000', '0x76aa88000000', '0x76d08c000000'], ['0x76aa8b000000', '0x76aa8b200000', '0x76aa8b400000', '0x76aa8ae00000'], ['0x76aa74000000', '0x76aa6a000000', '0x76aa60000000', '0x76aa7e000000']] | |
| [2025-09-06 08:34:26.142] [info] lamportInitialize start: buffer: 0x76aa7e000000, size: 71303168 | |
| rank 2 allocated ipc_handles: [['0x7db43c000000', '0x7db3f4000000', '0x7dd9f2000000', '0x7db3f0000000'], ['0x7db3f3000000', '0x7db3f3200000', '0x7db3f2e00000', '0x7db3f3400000'], ['0x7db3dc000000', '0x7db3d2000000', '0x7db3e6000000', '0x7db3c8000000']] | |
| [2025-09-06 08:34:26.192] [info] lamportInitialize start: buffer: 0x7db3e6000000, size: 71303168 | |
| [2025-09-06 08:34:26 TP0] FlashInfer workspace initialized for rank 0, world_size 4 | |
| [2025-09-06 08:34:26 TP1] FlashInfer workspace initialized for rank 1, world_size 4 | |
| [2025-09-06 08:34:26 TP2] FlashInfer workspace initialized for rank 2, world_size 4 | |
| [2025-09-06 08:34:26 TP3] FlashInfer workspace initialized for rank 3, world_size 4 | |
| set flag_ptr[3] = lamport_comm_size: 47185920 | |
| Rank 0 workspace[0] 0x774d82000000 | |
| Rank 0 workspace[1] 0x772796000000 | |
| Rank 0 workspace[2] 0x772792000000 | |
| Rank 0 workspace[3] 0x77278e000000 | |
| Rank 0 workspace[4] 0x772790e00000 | |
| Rank 0 workspace[5] 0x772791000000 | |
| Rank 0 workspace[6] 0x772791200000 | |
| Rank 0 workspace[7] 0x772791400000 | |
| Rank 0 workspace[8] 0x772784000000 | |
| Rank 0 workspace[9] 0x77277a000000 | |
| Rank 0 workspace[10] 0x772770000000 | |
| Rank 0 workspace[11] 0x772766000000 | |
| Rank 0 workspace[12] 0x77537b264400 | |
| set flag_ptr[3] = lamport_comm_size: 47185920 | |
| Rank 3 workspace[0] 0x76aaf0000000 | |
| Rank 3 workspace[1] 0x76aa8c000000 | |
| Rank 3 workspace[2] 0x76aa88000000 | |
| Rank 3 workspace[3] 0x76d08c000000 | |
| Rank 3 workspace[4] 0x76aa8b000000 | |
| Rank 3 workspace[5] 0x76aa8b200000 | |
| Rank 3 workspace[6] 0x76aa8b400000 | |
| Rank 3 workspace[7] 0x76aa8ae00000 | |
| Rank 3 workspace[8] 0x76aa74000000 | |
| Rank 3 workspace[9] 0x76aa6a000000 | |
| Rank 3 workspace[10] 0x76aa60000000 | |
| Rank 3 workspace[11] 0x76aa7e000000 | |
| Rank 3 workspace[12] 0x76d685264400 | |
| set flag_ptr[3] = lamport_comm_size: 47185920 | |
| Rank 2 workspace[0] 0x7db43c000000 | |
| Rank 2 workspace[1] 0x7db3f4000000 | |
| Rank 2 workspace[2] 0x7dd9f2000000 | |
| Rank 2 workspace[3] 0x7db3f0000000 | |
| Rank 2 workspace[4] 0x7db3f3000000 | |
| Rank 2 workspace[5] 0x7db3f3200000 | |
| Rank 2 workspace[6] 0x7db3f2e00000 | |
| Rank 2 workspace[7] 0x7db3f3400000 | |
| Rank 2 workspace[8] 0x7db3dc000000 | |
| Rank 2 workspace[9] 0x7db3d2000000 | |
| Rank 2 workspace[10] 0x7db3e6000000 | |
| Rank 2 workspace[11] 0x7db3c8000000 | |
| Rank 2 workspace[12] 0x7ddfff264400 | |
| set flag_ptr[3] = lamport_comm_size: 47185920 | |
| Rank 1 workspace[0] 0x7eb6fc000000 | |
| Rank 1 workspace[1] 0x7edcbc000000 | |
| Rank 1 workspace[2] 0x7eb6bc000000 | |
| Rank 1 workspace[3] 0x7eb6b8000000 | |
| Rank 1 workspace[4] 0x7eb6bb000000 | |
| Rank 1 workspace[5] 0x7eb6bae00000 | |
| Rank 1 workspace[6] 0x7eb6bb200000 | |
| Rank 1 workspace[7] 0x7eb6bb400000 | |
| Rank 1 workspace[8] 0x7eb6a4000000 | |
| Rank 1 workspace[9] 0x7eb6ae000000 | |
| Rank 1 workspace[10] 0x7eb69a000000 | |
| Rank 1 workspace[11] 0x7eb690000000 | |
| Rank 1 workspace[12] 0x7ee2c9264400 | |
| Capturing batches (bs=200 avail_mem=9.39 GB): 4%|▎ | 1/28 [00:02<00:56, 2.10s/it] Capturing batches (bs=192 avail_mem=8.06 GB): 4%|▎ | 1/28 [00:02<00:56, 2.10s/it] Capturing batches (bs=192 avail_mem=8.06 GB): 7%|▋ | 2/28 [00:02<00:24, 1.05it/s] Capturing batches (bs=184 avail_mem=8.05 GB): 7%|▋ | 2/28 [00:02<00:24, 1.05it/s] Capturing batches (bs=176 avail_mem=8.04 GB): 7%|▋ | 2/28 [00:02<00:24, 1.05it/s] Capturing batches (bs=176 avail_mem=8.04 GB): 14%|█▍ | 4/28 [00:02<00:09, 2.44it/s] Capturing batches (bs=168 avail_mem=8.03 GB): 14%|█▍ | 4/28 [00:02<00:09, 2.44it/s] Capturing batches (bs=160 avail_mem=8.02 GB): 14%|█▍ | 4/28 [00:02<00:09, 2.44it/s] Capturing batches (bs=160 avail_mem=8.02 GB): 21%|██▏ | 6/28 [00:02<00:05, 3.92it/s] Capturing batches (bs=152 avail_mem=8.01 GB): 21%|██▏ | 6/28 [00:02<00:05, 3.92it/s] Capturing batches (bs=144 avail_mem=8.00 GB): 21%|██▏ | 6/28 [00:02<00:05, 3.92it/s] Capturing batches (bs=144 avail_mem=8.00 GB): 29%|██▊ | 8/28 [00:02<00:03, 5.35it/s] Capturing batches (bs=136 avail_mem=8.00 GB): 29%|██▊ | 8/28 [00:02<00:03, 5.35it/s] Capturing batches (bs=128 avail_mem=7.99 GB): 29%|██▊ | 8/28 [00:02<00:03, 5.35it/s] Capturing batches (bs=128 avail_mem=7.99 GB): 36%|███▌ | 10/28 [00:02<00:02, 6.63it/s] Capturing batches (bs=120 avail_mem=7.97 GB): 36%|███▌ | 10/28 [00:02<00:02, 6.63it/s] Capturing batches (bs=112 avail_mem=7.97 GB): 36%|███▌ | 10/28 [00:03<00:02, 6.63it/s] Capturing batches (bs=112 avail_mem=7.97 GB): 43%|████▎ | 12/28 [00:03<00:02, 7.77it/s] Capturing batches (bs=104 avail_mem=7.96 GB): 43%|████▎ | 12/28 [00:03<00:02, 7.77it/s] Capturing batches (bs=96 avail_mem=7.95 GB): 43%|████▎ | 12/28 [00:03<00:02, 7.77it/s] Capturing batches (bs=96 avail_mem=7.95 GB): 50%|█████ | 14/28 [00:03<00:01, 8.72it/s] Capturing batches (bs=88 avail_mem=7.94 GB): 50%|█████ | 14/28 [00:03<00:01, 8.72it/s] Capturing batches (bs=80 avail_mem=7.93 GB): 50%|█████ | 14/28 [00:03<00:01, 8.72it/s] Capturing batches (bs=80 avail_mem=7.93 GB): 57%|█████▋ | 16/28 [00:03<00:01, 9.51it/s] Capturing batches (bs=72 avail_mem=7.92 GB): 57%|█████▋ | 16/28 [00:03<00:01, 9.51it/s] Capturing batches (bs=64 avail_mem=7.91 GB): 57%|█████▋ | 16/28 [00:03<00:01, 9.51it/s] Capturing batches (bs=64 avail_mem=7.91 GB): 64%|██████▍ | 18/28 [00:03<00:00, 10.14it/s] Capturing batches (bs=56 avail_mem=7.90 GB): 64%|██████▍ | 18/28 [00:03<00:00, 10.14it/s] Capturing batches (bs=48 avail_mem=7.89 GB): 64%|██████▍ | 18/28 [00:03<00:00, 10.14it/s] Capturing batches (bs=48 avail_mem=7.89 GB): 71%|███████▏ | 20/28 [00:03<00:00, 10.70it/s] Capturing batches (bs=40 avail_mem=7.88 GB): 71%|███████▏ | 20/28 [00:03<00:00, 10.70it/s] Capturing batches (bs=32 avail_mem=7.88 GB): 71%|███████▏ | 20/28 [00:03<00:00, 10.70it/s] Capturing batches (bs=32 avail_mem=7.88 GB): 79%|███████▊ | 22/28 [00:03<00:00, 11.12it/s] Capturing batches (bs=24 avail_mem=7.87 GB): 79%|███████▊ | 22/28 [00:03<00:00, 11.12it/s] Capturing batches (bs=16 avail_mem=7.86 GB): 79%|███████▊ | 22/28 [00:04<00:00, 11.12it/s] Capturing batches (bs=16 avail_mem=7.86 GB): 86%|████████▌ | 24/28 [00:04<00:00, 11.49it/s] Capturing batches (bs=8 avail_mem=7.85 GB): 86%|████████▌ | 24/28 [00:04<00:00, 11.49it/s] Capturing batches (bs=4 avail_mem=7.84 GB): 86%|████████▌ | 24/28 [00:04<00:00, 11.49it/s] Capturing batches (bs=4 avail_mem=7.84 GB): 93%|█████████▎| 26/28 [00:04<00:00, 12.05it/s] Capturing batches (bs=2 avail_mem=7.83 GB): 93%|█████████▎| 26/28 [00:04<00:00, 12.05it/s] Capturing batches (bs=1 avail_mem=7.82 GB): 93%|█████████▎| 26/28 [00:04<00:00, 12.05it/s] Capturing batches (bs=1 avail_mem=7.82 GB): 100%|██████████| 28/28 [00:04<00:00, 11.45it/s] Capturing batches (bs=1 avail_mem=7.82 GB): 100%|██████████| 28/28 [00:04<00:00, 6.29it/s] | |
| [2025-09-06 08:34:28 TP0] Registering 56 cuda graph addresses | |
| [2025-09-06 08:34:28 TP2] Registering 56 cuda graph addresses | |
| [2025-09-06 08:34:28 TP3] Registering 56 cuda graph addresses | |
| [2025-09-06 08:34:28 TP1] Registering 56 cuda graph addresses | |
| [2025-09-06 08:34:28 TP0] Capture cuda graph end. Time elapsed: 4.95 s. mem usage=1.73 GB. avail mem=7.81 GB. | |
| [2025-09-06 08:34:29 TP0] max_total_num_tokens=8487040, chunked_prefill_size=16384, max_prefill_tokens=16384, max_running_requests=4096, context_len=131072, available_gpu_mem=7.81 GB | |
| [2025-09-06 08:34:30] INFO: Started server process [39541] | |
| [2025-09-06 08:34:30] INFO: Waiting for application startup. | |
| [2025-09-06 08:34:30] INFO: Application startup complete. | |
| [2025-09-06 08:34:30] INFO: Uvicorn running on http://127.0.0.1:8400 (Press CTRL+C to quit) | |
| [2025-09-06 08:34:30] INFO: 127.0.0.1:51204 - "GET /health_generate HTTP/1.1" 503 Service Unavailable | |
| [2025-09-06 08:34:31] INFO: 127.0.0.1:51212 - "GET /get_model_info HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:31 TP0] Prefill batch. #new-seq: 1, #new-token: 64, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, | |
| [2025-09-06 08:34:32] INFO: 127.0.0.1:51228 - "POST /generate HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:32] The server is fired up and ready to roll! | |
| [2025-09-06 08:34:40 TP0] Prefill batch. #new-seq: 1, #new-token: 64, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, | |
| [2025-09-06 08:34:41] INFO: 127.0.0.1:42932 - "GET /health_generate HTTP/1.1" 200 OK | |
| command=python3 -m sglang.launch_server --model-path /home/yiliu7/models/openai/gpt-oss-120b --tp 4 --cuda-graph-max-bs 200 --mem-fraction-static 0.93 --device cuda --host 127.0.0.1 --port 8400 | |
| Evaluation start: model=/home/yiliu7/models/openai/gpt-oss-120b reasoning_effort=low expected_score=0.6 | |
| ChatCompletionSampler initialized with self.system_message=None self.temperature=0.1 self.max_tokens=4096 self.reasoning_effort='low' | |
| 0%| | 0/198 [00:00<?, ?it/s][2025-09-06 08:34:42 TP0] Prefill batch. #new-seq: 1, #new-token: 448, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, | |
| [2025-09-06 08:34:42 TP0] Prefill batch. #new-seq: 1, #new-token: 192, #cached-token: 0, token usage: 0.00, #running-req: 1, #queue-req: 0, | |
| [2025-09-06 08:34:42 TP0] Prefill batch. #new-seq: 14, #new-token: 3648, #cached-token: 896, token usage: 0.00, #running-req: 2, #queue-req: 0, | |
| [2025-09-06 08:34:42 TP0] Prefill batch. #new-seq: 10, #new-token: 3648, #cached-token: 640, token usage: 0.00, #running-req: 16, #queue-req: 0, | |
| [2025-09-06 08:34:42 TP0] Prefill batch. #new-seq: 51, #new-token: 16192, #cached-token: 3264, token usage: 0.00, #running-req: 26, #queue-req: 47, | |
| [2025-09-06 08:34:42 TP0] Prefill batch. #new-seq: 60, #new-token: 16256, #cached-token: 3840, token usage: 0.00, #running-req: 77, #queue-req: 5, | |
| [2025-09-06 08:34:43 TP0] Prefill batch. #new-seq: 61, #new-token: 16320, #cached-token: 4032, token usage: 0.00, #running-req: 137, #queue-req: 0, | |
| [2025-09-06 08:34:43 TP0] Decode batch. #running-req: 198, #token: 62848, token usage: 0.01, cuda graph: True, gen throughput (token/s): 424.67, #queue-req: 0, | |
| [2025-09-06 08:34:43] INFO: 127.0.0.1:44440 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:43] INFO: 127.0.0.1:43052 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:43] INFO: 127.0.0.1:43842 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:43] INFO: 127.0.0.1:42990 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:43] INFO: 127.0.0.1:43898 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:43 TP0] Decode batch. #running-req: 193, #token: 68544, token usage: 0.01, cuda graph: True, gen throughput (token/s): 17176.49, #queue-req: 0, | |
| [2025-09-06 08:34:44] INFO: 127.0.0.1:42958 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:44] INFO: 127.0.0.1:44156 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:44] INFO: 127.0.0.1:44030 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:44] INFO: 127.0.0.1:43372 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:44] INFO: 127.0.0.1:44584 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:44 TP0] Decode batch. #running-req: 188, #token: 73984, token usage: 0.01, cuda graph: True, gen throughput (token/s): 16913.88, #queue-req: 0, | |
| [2025-09-06 08:34:44] INFO: 127.0.0.1:43384 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:44] INFO: 127.0.0.1:44636 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:44] INFO: 127.0.0.1:43320 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:44] INFO: 127.0.0.1:44038 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:44] INFO: 127.0.0.1:43412 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:44] INFO: 127.0.0.1:43790 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:44] INFO: 127.0.0.1:44138 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:44] INFO: 127.0.0.1:43464 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:44] INFO: 127.0.0.1:43970 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:44 TP0] Decode batch. #running-req: 179, #token: 77184, token usage: 0.01, cuda graph: True, gen throughput (token/s): 16437.03, #queue-req: 0, | |
| [2025-09-06 08:34:44] INFO: 127.0.0.1:43600 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:44] INFO: 127.0.0.1:43582 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:44] INFO: 127.0.0.1:44292 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:45] INFO: 127.0.0.1:43146 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:45] INFO: 127.0.0.1:44200 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:45] INFO: 127.0.0.1:44112 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:45] INFO: 127.0.0.1:43828 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:45] INFO: 127.0.0.1:43226 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:45] INFO: 127.0.0.1:43888 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:45 TP0] Decode batch. #running-req: 170, #token: 77504, token usage: 0.01, cuda graph: True, gen throughput (token/s): 15888.40, #queue-req: 0, | |
| [2025-09-06 08:34:45] INFO: 127.0.0.1:44662 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:45] INFO: 127.0.0.1:43190 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:45] INFO: 127.0.0.1:43364 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:45] INFO: 127.0.0.1:44004 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:45] INFO: 127.0.0.1:44652 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:45] INFO: 127.0.0.1:43162 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:45] INFO: 127.0.0.1:44628 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:45 TP0] Decode batch. #running-req: 163, #token: 80768, token usage: 0.01, cuda graph: True, gen throughput (token/s): 15192.08, #queue-req: 0, | |
| [2025-09-06 08:34:45] INFO: 127.0.0.1:43072 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:45] INFO: 127.0.0.1:44694 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:45] INFO: 127.0.0.1:43744 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:45] INFO: 127.0.0.1:42976 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:45] INFO: 127.0.0.1:43754 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:46] INFO: 127.0.0.1:43168 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:46] INFO: 127.0.0.1:44364 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:46] INFO: 127.0.0.1:42942 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:46] INFO: 127.0.0.1:43432 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:46] INFO: 127.0.0.1:43900 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:46 TP0] Decode batch. #running-req: 153, #token: 81536, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13955.66, #queue-req: 0, | |
| [2025-09-06 08:34:46] INFO: 127.0.0.1:44136 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:46] INFO: 127.0.0.1:44436 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:46] INFO: 127.0.0.1:44178 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:46] INFO: 127.0.0.1:43676 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:46] INFO: 127.0.0.1:43880 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:46] INFO: 127.0.0.1:43696 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:46] INFO: 127.0.0.1:44724 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:46] INFO: 127.0.0.1:44320 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:46] INFO: 127.0.0.1:43032 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:46] INFO: 127.0.0.1:42972 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:46 TP0] Decode batch. #running-req: 143, #token: 82176, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13467.78, #queue-req: 0, | |
| [2025-09-06 08:34:46] INFO: 127.0.0.1:44460 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:46] INFO: 127.0.0.1:44508 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:46] INFO: 127.0.0.1:43292 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:46] INFO: 127.0.0.1:43422 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:46] INFO: 127.0.0.1:43522 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:46] INFO: 127.0.0.1:43228 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:46] INFO: 127.0.0.1:43044 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:46] INFO: 127.0.0.1:44254 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:47] INFO: 127.0.0.1:42966 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:47 TP0] Decode batch. #running-req: 135, #token: 83392, token usage: 0.01, cuda graph: True, gen throughput (token/s): 12737.00, #queue-req: 0, | |
| [2025-09-06 08:34:47] INFO: 127.0.0.1:43606 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:47] INFO: 127.0.0.1:43660 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:47] INFO: 127.0.0.1:43622 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:47] INFO: 127.0.0.1:44504 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:47] INFO: 127.0.0.1:44022 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:47] INFO: 127.0.0.1:44732 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:47] INFO: 127.0.0.1:43124 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:47] INFO: 127.0.0.1:43996 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:47] INFO: 127.0.0.1:43088 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:47] INFO: 127.0.0.1:43566 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:47] INFO: 127.0.0.1:44756 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:47] INFO: 127.0.0.1:44150 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:47] INFO: 127.0.0.1:43806 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:47 TP0] Decode batch. #running-req: 121, #token: 80000, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13645.15, #queue-req: 0, | |
| [2025-09-06 08:34:47] INFO: 127.0.0.1:44216 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:47] INFO: 127.0.0.1:43646 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:47] INFO: 127.0.0.1:43968 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:47] INFO: 127.0.0.1:43224 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:47] INFO: 127.0.0.1:44354 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:47] INFO: 127.0.0.1:43780 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:47] INFO: 127.0.0.1:44526 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:47] INFO: 127.0.0.1:44406 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:47 TP0] Decode batch. #running-req: 113, #token: 79360, token usage: 0.01, cuda graph: True, gen throughput (token/s): 16010.52, #queue-req: 0, | |
| [2025-09-06 08:34:47] INFO: 127.0.0.1:43916 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:47] INFO: 127.0.0.1:43698 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:47] INFO: 127.0.0.1:43074 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:47] INFO: 127.0.0.1:43450 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:47] INFO: 127.0.0.1:43236 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:47] INFO: 127.0.0.1:44344 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:48] INFO: 127.0.0.1:43346 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:48] INFO: 127.0.0.1:43314 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:48 TP0] Decode batch. #running-req: 105, #token: 77248, token usage: 0.01, cuda graph: True, gen throughput (token/s): 15222.64, #queue-req: 0, | |
| [2025-09-06 08:34:48] INFO: 127.0.0.1:44680 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:48] INFO: 127.0.0.1:43134 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:48] INFO: 127.0.0.1:43776 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:48] INFO: 127.0.0.1:44304 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:48] INFO: 127.0.0.1:44562 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:48] INFO: 127.0.0.1:43574 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:48] INFO: 127.0.0.1:43636 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:48] INFO: 127.0.0.1:43700 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:48 TP0] Decode batch. #running-req: 97, #token: 76032, token usage: 0.01, cuda graph: True, gen throughput (token/s): 14367.14, #queue-req: 0, | |
| [2025-09-06 08:34:48] INFO: 127.0.0.1:43106 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:48] INFO: 127.0.0.1:43016 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:48] INFO: 127.0.0.1:43382 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:48] INFO: 127.0.0.1:43264 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:48] INFO: 127.0.0.1:44578 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:48] INFO: 127.0.0.1:44642 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:48] INFO: 127.0.0.1:44382 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:48] INFO: 127.0.0.1:44392 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:48] INFO: 127.0.0.1:44116 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:48] INFO: 127.0.0.1:43800 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:48] INFO: 127.0.0.1:43330 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:48 TP0] Decode batch. #running-req: 86, #token: 70464, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13311.84, #queue-req: 0, | |
| [2025-09-06 08:34:48] INFO: 127.0.0.1:44368 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:48] INFO: 127.0.0.1:44774 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:48] INFO: 127.0.0.1:43458 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:48] INFO: 127.0.0.1:44492 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:48] INFO: 127.0.0.1:43396 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:48] INFO: 127.0.0.1:44510 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:48] INFO: 127.0.0.1:44054 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:48] INFO: 127.0.0.1:44230 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:48] INFO: 127.0.0.1:43714 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:48] INFO: 127.0.0.1:44634 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:48 TP0] Decode batch. #running-req: 76, #token: 64576, token usage: 0.01, cuda graph: True, gen throughput (token/s): 12469.15, #queue-req: 0, | |
| [2025-09-06 08:34:48] INFO: 127.0.0.1:43418 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:48] INFO: 127.0.0.1:43280 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:48] INFO: 127.0.0.1:44446 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:48] INFO: 127.0.0.1:43618 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:48] INFO: 127.0.0.1:43956 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:49] INFO: 127.0.0.1:44020 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:49] INFO: 127.0.0.1:43244 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:49] INFO: 127.0.0.1:43208 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:49] INFO: 127.0.0.1:43584 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:49 TP0] Decode batch. #running-req: 67, #token: 59840, token usage: 0.01, cuda graph: True, gen throughput (token/s): 11205.85, #queue-req: 0, | |
| [2025-09-06 08:34:49] INFO: 127.0.0.1:43536 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:49] INFO: 127.0.0.1:44132 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:49] INFO: 127.0.0.1:44612 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:49] INFO: 127.0.0.1:43814 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:49] INFO: 127.0.0.1:44334 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:49] INFO: 127.0.0.1:44420 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:49] INFO: 127.0.0.1:44528 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:49] INFO: 127.0.0.1:43144 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:49] INFO: 127.0.0.1:43854 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:49] INFO: 127.0.0.1:44308 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:49 TP0] Decode batch. #running-req: 57, #token: 53248, token usage: 0.01, cuda graph: True, gen throughput (token/s): 9851.14, #queue-req: 0, | |
| [2025-09-06 08:34:49] INFO: 127.0.0.1:44268 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:49] INFO: 127.0.0.1:43728 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:49] INFO: 127.0.0.1:43438 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:49] INFO: 127.0.0.1:43098 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:49 TP0] Decode batch. #running-req: 54, #token: 51520, token usage: 0.01, cuda graph: True, gen throughput (token/s): 9352.81, #queue-req: 0, | |
| [2025-09-06 08:34:49] INFO: 127.0.0.1:43478 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:49] INFO: 127.0.0.1:42994 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:49] INFO: 127.0.0.1:43118 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:49] INFO: 127.0.0.1:43180 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:49 TP0] Decode batch. #running-req: 49, #token: 48832, token usage: 0.01, cuda graph: True, gen throughput (token/s): 8635.47, #queue-req: 0, | |
| [2025-09-06 08:34:49] INFO: 127.0.0.1:42962 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:49] INFO: 127.0.0.1:44170 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:50] INFO: 127.0.0.1:44078 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:50 TP0] Decode batch. #running-req: 46, #token: 48704, token usage: 0.01, cuda graph: True, gen throughput (token/s): 8221.97, #queue-req: 0, | |
| [2025-09-06 08:34:50] INFO: 127.0.0.1:44194 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:50] INFO: 127.0.0.1:43928 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:50] INFO: 127.0.0.1:44046 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:50] INFO: 127.0.0.1:43912 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:50] INFO: 127.0.0.1:43506 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:50] INFO: 127.0.0.1:43720 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:50] INFO: 127.0.0.1:43528 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:50 TP0] Decode batch. #running-req: 39, #token: 43072, token usage: 0.01, cuda graph: True, gen throughput (token/s): 7532.11, #queue-req: 0, | |
| [2025-09-06 08:34:50] INFO: 127.0.0.1:44128 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:50] INFO: 127.0.0.1:43158 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:50] INFO: 127.0.0.1:43006 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:50 TP0] Decode batch. #running-req: 36, #token: 41024, token usage: 0.00, cuda graph: True, gen throughput (token/s): 6518.27, #queue-req: 0, | |
| [2025-09-06 08:34:50] INFO: 127.0.0.1:44558 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:50] INFO: 127.0.0.1:44278 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:50] INFO: 127.0.0.1:44668 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:50] INFO: 127.0.0.1:42974 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:50] INFO: 127.0.0.1:43302 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:50 TP0] Decode batch. #running-req: 33, #token: 36672, token usage: 0.00, cuda graph: True, gen throughput (token/s): 5974.62, #queue-req: 0, | |
| [2025-09-06 08:34:50] INFO: 127.0.0.1:44246 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:50] INFO: 127.0.0.1:44538 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:50] INFO: 127.0.0.1:44064 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:50] INFO: 127.0.0.1:44542 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:50 TP0] Decode batch. #running-req: 27, #token: 33472, token usage: 0.00, cuda graph: True, gen throughput (token/s): 5340.73, #queue-req: 0, | |
| [2025-09-06 08:34:50] INFO: 127.0.0.1:44758 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:51] INFO: 127.0.0.1:43680 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:51] INFO: 127.0.0.1:44708 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:51 TP0] Decode batch. #running-req: 24, #token: 30784, token usage: 0.00, cuda graph: True, gen throughput (token/s): 4791.66, #queue-req: 0, | |
| [2025-09-06 08:34:51] INFO: 127.0.0.1:43350 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:51] INFO: 127.0.0.1:43762 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:51] INFO: 127.0.0.1:43876 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:51] INFO: 127.0.0.1:43852 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:51] INFO: 127.0.0.1:43884 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:51 TP0] Decode batch. #running-req: 19, #token: 25152, token usage: 0.00, cuda graph: True, gen throughput (token/s): 4001.31, #queue-req: 0, | |
| [2025-09-06 08:34:51] INFO: 127.0.0.1:44080 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:51] INFO: 127.0.0.1:44412 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:51 TP0] Decode batch. #running-req: 17, #token: 23104, token usage: 0.00, cuda graph: True, gen throughput (token/s): 3588.53, #queue-req: 0, | |
| [2025-09-06 08:34:51] INFO: 127.0.0.1:43866 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:51] INFO: 127.0.0.1:44096 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:51 TP0] Decode batch. #running-req: 15, #token: 20928, token usage: 0.00, cuda graph: True, gen throughput (token/s): 3195.11, #queue-req: 0, | |
| [2025-09-06 08:34:51 TP0] Decode batch. #running-req: 15, #token: 21568, token usage: 0.00, cuda graph: True, gen throughput (token/s): 3141.41, #queue-req: 0, | |
| [2025-09-06 08:34:52] INFO: 127.0.0.1:44362 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:52] INFO: 127.0.0.1:43062 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:52 TP0] Decode batch. #running-req: 13, #token: 19264, token usage: 0.00, cuda graph: True, gen throughput (token/s): 2949.04, #queue-req: 0, | |
| [2025-09-06 08:34:52 TP0] Decode batch. #running-req: 13, #token: 19648, token usage: 0.00, cuda graph: True, gen throughput (token/s): 2726.24, #queue-req: 0, | |
| [2025-09-06 08:34:52] INFO: 127.0.0.1:44616 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:52] INFO: 127.0.0.1:43554 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:52] INFO: 127.0.0.1:43980 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:52 TP0] Decode batch. #running-req: 10, #token: 15744, token usage: 0.00, cuda graph: True, gen throughput (token/s): 2499.47, #queue-req: 0, | |
| [2025-09-06 08:34:52] INFO: 127.0.0.1:43206 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:52 TP0] Decode batch. #running-req: 9, #token: 14528, token usage: 0.00, cuda graph: True, gen throughput (token/s): 2028.36, #queue-req: 0, | |
| [2025-09-06 08:34:52] INFO: 127.0.0.1:43944 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:52 TP0] Decode batch. #running-req: 8, #token: 13376, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1883.41, #queue-req: 0, | |
| [2025-09-06 08:34:52] INFO: 127.0.0.1:43628 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:53] INFO: 127.0.0.1:43250 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:53 TP0] Decode batch. #running-req: 6, #token: 10240, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1682.40, #queue-req: 0, | |
| [2025-09-06 08:34:53 TP0] Decode batch. #running-req: 6, #token: 10432, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1441.35, #queue-req: 0, | |
| [2025-09-06 08:34:53 TP0] Decode batch. #running-req: 6, #token: 10688, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1445.57, #queue-req: 0, | |
| [2025-09-06 08:34:53 TP0] Decode batch. #running-req: 6, #token: 10944, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1439.31, #queue-req: 0, | |
| [2025-09-06 08:34:53 TP0] Decode batch. #running-req: 6, #token: 11072, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1438.83, #queue-req: 0, | |
| [2025-09-06 08:34:53] INFO: 127.0.0.1:43494 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:53] INFO: 127.0.0.1:44322 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| 1%| | 1/198 [00:11<37:26, 11.40s/it][2025-09-06 08:34:53 TP0] Decode batch. #running-req: 4, #token: 7552, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1331.77, #queue-req: 0, | |
| [2025-09-06 08:34:54 TP0] Decode batch. #running-req: 4, #token: 7744, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1056.35, #queue-req: 0, | |
| [2025-09-06 08:34:54 TP0] Decode batch. #running-req: 4, #token: 7872, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1059.88, #queue-req: 0, | |
| [2025-09-06 08:34:54] INFO: 127.0.0.1:44476 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| 9%|▊ | 17/198 [00:11<01:30, 2.01it/s][2025-09-06 08:34:54] INFO: 127.0.0.1:43542 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:34:54 TP0] Decode batch. #running-req: 2, #token: 3968, token usage: 0.00, cuda graph: True, gen throughput (token/s): 621.73, #queue-req: 0, | |
| [2025-09-06 08:34:54 TP0] Decode batch. #running-req: 2, #token: 4032, token usage: 0.00, cuda graph: True, gen throughput (token/s): 580.81, #queue-req: 0, | |
| [2025-09-06 08:34:54 TP0] Decode batch. #running-req: 2, #token: 4160, token usage: 0.00, cuda graph: True, gen throughput (token/s): 579.50, #queue-req: 0, | |
| [2025-09-06 08:34:54 TP0] Decode batch. #running-req: 2, #token: 4160, token usage: 0.00, cuda graph: True, gen throughput (token/s): 579.01, #queue-req: 0, | |
| [2025-09-06 08:34:54] INFO: 127.0.0.1:44598 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| 15%|█▌ | 30/198 [00:12<00:44, 3.81it/s][2025-09-06 08:34:54 TP0] Decode batch. #running-req: 1, #token: 2176, token usage: 0.00, cuda graph: True, gen throughput (token/s): 535.38, #queue-req: 0, | |
| [2025-09-06 08:34:55 TP0] Decode batch. #running-req: 1, #token: 2240, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.50, #queue-req: 0, | |
| [2025-09-06 08:34:55 TP0] Decode batch. #running-req: 1, #token: 2240, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.40, #queue-req: 0, | |
| [2025-09-06 08:34:55] INFO: 127.0.0.1:44742 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| 23%|██▎ | 45/198 [00:12<00:23, 6.61it/s] 100%|██████████| 198/198 [00:12<00:00, 15.54it/s] | |
| /usr/lib/python3.12/subprocess.py:1127: ResourceWarning: subprocess 39541 is still running | |
| _warn("subprocess %s is still running" % self.pid, | |
| ResourceWarning: Enable tracemalloc to get the object allocation traceback | |
| . | |
| ---------------------------------------------------------------------- | |
| Ran 1 test in 177.044s | |
| OK | |
| Writing report to /tmp/gpqa__home_yiliu7_models_openai_gpt-oss-120b.html | |
| {'chars': 1735.459595959596, 'chars:std': 1063.0222380884343, 'score:std': 0.4824488175389596, 'score': 0.6313131313131313} | |
| Writing results to /tmp/gpqa__home_yiliu7_models_openai_gpt-oss-120b.json | |
| Total latency: 12.798 s | |
| Score: 0.631 | |
| Evaluation end: model=/home/yiliu7/models/openai/gpt-oss-120b reasoning_effort=low expected_score=0.6 metrics={'chars': 1735.459595959596, 'chars:std': 1063.0222380884343, 'score:std': 0.4824488175389596, 'score': 0.6313131313131313} | |
| ================================================================================ | |
| Run 4: | |
| Auto-configed device: cuda | |
| WARNING:sglang.srt.server_args:Detected SM100 and MXFP4 quantization format for GPT-OSS model, enabling FlashInfer MXFP4 MOE kernel. | |
| WARNING:sglang.srt.server_args:TensorRT-LLM MHA only supports page_size of 16, 32 or 64, changing page_size from None to 64. | |
| [2025-09-06 08:35:09] server_args=ServerArgs(model_path='/home/yiliu7/models/openai/gpt-oss-120b', tokenizer_path='/home/yiliu7/models/openai/gpt-oss-120b', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=8400, skip_server_warmup=False, warmups=None, nccl_port=None, dtype='bfloat16', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', mem_fraction_static=0.93, max_running_requests=None, max_queued_requests=9223372036854775807, max_total_tokens=None, chunked_prefill_size=16384, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, page_size=64, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=True, device='cuda', tp_size=4, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=615780304, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, gc_warning_threshold_secs=0.0, api_key=None, served_model_name='/home/yiliu7/models/openai/gpt-oss-120b', weight_version='default', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_backend='triton', attention_backend='trtllm_mha', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, ep_size=1, moe_a2a_backend='none', moe_runner_backend='flashinfer_mxfp4', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=True, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', disable_radix_cache=False, cuda_graph_max_bs=200, cuda_graph_bs=None, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, torch_compile_max_bs=32, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, enable_return_hidden_states=False, scheduler_recv_interval=1, numa_node=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, debug_tensor_dump_prefill_only=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, num_reserved_decode_tokens=512, custom_weight_loader=[], weight_loader_disable_mmap=False, enable_pdmux=False, sm_group_num=3, enable_ep_moe=False, enable_deepep_moe=False, enable_flashinfer_cutlass_moe=False, enable_flashinfer_trtllm_moe=False, enable_triton_kernel_moe=False, enable_flashinfer_mxfp4_moe=False) | |
| All deep_gemm operations loaded successfully! | |
| `torch_dtype` is deprecated! Use `dtype` instead! | |
| [2025-09-06 08:35:09] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:35:09] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [2025-09-06 08:35:10] Using default HuggingFace chat template with detected content format: string | |
| `torch_dtype` is deprecated! Use `dtype` instead! | |
| [2025-09-06 08:35:16 TP1] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:35:16 TP1] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| `torch_dtype` is deprecated! Use `dtype` instead! | |
| [2025-09-06 08:35:17 TP3] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:35:17 TP3] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| `torch_dtype` is deprecated! Use `dtype` instead! | |
| [2025-09-06 08:35:17 TP0] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:35:17 TP0] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| `torch_dtype` is deprecated! Use `dtype` instead! | |
| [2025-09-06 08:35:17 TP2] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:35:17 TP2] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [2025-09-06 08:35:17 TP1] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:35:17 TP1] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [2025-09-06 08:35:17 TP3] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:35:17 TP3] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [2025-09-06 08:35:17 TP0] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:35:17 TP0] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [2025-09-06 08:35:17 TP0] Init torch distributed begin. | |
| [2025-09-06 08:35:17 TP2] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:35:17 TP2] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [2025-09-06 08:35:19 TP0] sglang is using nccl==2.27.3 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [2025-09-06 08:35:21 TP0] Init torch distributed ends. mem usage=1.46 GB | |
| [2025-09-06 08:35:22 TP0] Load weight begin. avail mem=176.28 GB | |
| All deep_gemm operations loaded successfully! | |
| Loading safetensors checkpoint shards: 0% Completed | 0/15 [00:00<?, ?it/s] | |
| Loading safetensors checkpoint shards: 100% Completed | 15/15 [00:00<00:00, 1615.93it/s] | |
| All deep_gemm operations loaded successfully! | |
| All deep_gemm operations loaded successfully! | |
| [2025-09-06 08:35:37 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.0.mlp.experts), it might take a while... | |
| All deep_gemm operations loaded successfully! | |
| [2025-09-06 08:35:40 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.1.mlp.experts), it might take a while... | |
| [2025-09-06 08:35:43 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.2.mlp.experts), it might take a while... | |
| [2025-09-06 08:35:46 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.3.mlp.experts), it might take a while... | |
| [2025-09-06 08:35:50 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.4.mlp.experts), it might take a while... | |
| [2025-09-06 08:35:53 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.5.mlp.experts), it might take a while... | |
| [2025-09-06 08:35:56 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.6.mlp.experts), it might take a while... | |
| [2025-09-06 08:35:59 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.7.mlp.experts), it might take a while... | |
| [2025-09-06 08:36:02 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.8.mlp.experts), it might take a while... | |
| [2025-09-06 08:36:05 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.9.mlp.experts), it might take a while... | |
| [2025-09-06 08:36:08 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.10.mlp.experts), it might take a while... | |
| [2025-09-06 08:36:11 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.11.mlp.experts), it might take a while... | |
| [2025-09-06 08:36:15 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.12.mlp.experts), it might take a while... | |
| [2025-09-06 08:36:18 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.13.mlp.experts), it might take a while... | |
| [2025-09-06 08:36:21 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.14.mlp.experts), it might take a while... | |
| [2025-09-06 08:36:24 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.15.mlp.experts), it might take a while... | |
| [2025-09-06 08:36:27 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.16.mlp.experts), it might take a while... | |
| [2025-09-06 08:36:33 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.17.mlp.experts), it might take a while... | |
| [2025-09-06 08:36:37 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.18.mlp.experts), it might take a while... | |
| [2025-09-06 08:36:41 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.19.mlp.experts), it might take a while... | |
| [2025-09-06 08:36:44 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.20.mlp.experts), it might take a while... | |
| [2025-09-06 08:36:48 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.21.mlp.experts), it might take a while... | |
| [2025-09-06 08:36:51 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.22.mlp.experts), it might take a while... | |
| [2025-09-06 08:36:54 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.23.mlp.experts), it might take a while... | |
| [2025-09-06 08:36:57 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.24.mlp.experts), it might take a while... | |
| [2025-09-06 08:37:00 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.25.mlp.experts), it might take a while... | |
| [2025-09-06 08:37:03 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.26.mlp.experts), it might take a while... | |
| [2025-09-06 08:37:07 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.27.mlp.experts), it might take a while... | |
| [2025-09-06 08:37:10 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.28.mlp.experts), it might take a while... | |
| [2025-09-06 08:37:13 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.29.mlp.experts), it might take a while... | |
| [2025-09-06 08:37:16 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.30.mlp.experts), it might take a while... | |
| [2025-09-06 08:37:19 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.31.mlp.experts), it might take a while... | |
| [2025-09-06 08:37:22 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.32.mlp.experts), it might take a while... | |
| [2025-09-06 08:37:25 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.33.mlp.experts), it might take a while... | |
| [2025-09-06 08:37:28 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.34.mlp.experts), it might take a while... | |
| [2025-09-06 08:37:32 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.35.mlp.experts), it might take a while... | |
| [2025-09-06 08:37:35 TP0] Load weight end. type=GptOssForCausalLM, dtype=torch.bfloat16, avail mem=158.06 GB, mem usage=18.22 GB. | |
| [2025-09-06 08:37:35 TP0] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB | |
| [2025-09-06 08:37:35 TP0] Memory pool end. avail mem=10.23 GB | |
| [2025-09-06 08:37:35 TP2] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB | |
| [2025-09-06 08:37:35 TP1] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB | |
| [2025-09-06 08:37:35 TP3] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB | |
| [2025-09-06 08:37:35 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=9.54 GB | |
| [2025-09-06 08:37:35 TP0] Capture cuda graph bs [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200] | |
| 0%| | 0/28 [00:00<?, ?it/s] Capturing batches (bs=200 avail_mem=9.39 GB): 0%| | 0/28 [00:00<?, ?it/s]rank 2 allocated ipc_handles: [['0x739ed0000000', '0x739e68000000', '0x73c466000000', '0x739e64000000'], ['0x739e67000000', '0x739e67200000', '0x739e66e00000', '0x739e67400000'], ['0x739e50000000', '0x739e46000000', '0x739e5a000000', '0x739e3c000000']] | |
| [2025-09-06 08:37:37.715] [info] lamportInitialize start: buffer: 0x739e5a000000, size: 71303168 | |
| rank 1 allocated ipc_handles: [['0x7a06d4000000', '0x7a2c70000000', '0x7a0670000000', '0x7a066c000000'], ['0x7a066f000000', '0x7a066ee00000', '0x7a066f200000', '0x7a066f400000'], ['0x7a0658000000', '0x7a0662000000', '0x7a064e000000', '0x7a0644000000']] | |
| [2025-09-06 08:37:37.765] [info] lamportInitialize start: buffer: 0x7a0662000000, size: 71303168 | |
| rank 3 allocated ipc_handles: [['0x771bbc000000', '0x771b82000000', '0x771b7e000000', '0x774172000000'], ['0x771b81000000', '0x771b81200000', '0x771b81400000', '0x771b80e00000'], ['0x771b6a000000', '0x771b60000000', '0x771b56000000', '0x771b74000000']] | |
| [2025-09-06 08:37:37.817] [info] lamportInitialize start: buffer: 0x771b74000000, size: 71303168 | |
| rank 0 allocated ipc_handles: [['0x734ac8000000', '0x7324d6000000', '0x7324d2000000', '0x7324ce000000'], ['0x7324d0e00000', '0x7324d1000000', '0x7324d1200000', '0x7324d1400000'], ['0x7324c4000000', '0x7324ba000000', '0x7324b0000000', '0x7324a6000000']] | |
| [2025-09-06 08:37:37.865] [info] lamportInitialize start: buffer: 0x7324c4000000, size: 71303168 | |
| [2025-09-06 08:37:37 TP0] FlashInfer workspace initialized for rank 0, world_size 4 | |
| [2025-09-06 08:37:37 TP1] FlashInfer workspace initialized for rank 1, world_size 4 | |
| [2025-09-06 08:37:37 TP3] FlashInfer workspace initialized for rank 3, world_size 4 | |
| [2025-09-06 08:37:37 TP2] FlashInfer workspace initialized for rank 2, world_size 4 | |
| set flag_ptr[3] = lamport_comm_size: 47185920 | |
| Rank 0 workspace[0] 0x734ac8000000 | |
| Rank 0 workspace[1] 0x7324d6000000 | |
| Rank 0 workspace[2] 0x7324d2000000 | |
| Rank 0 workspace[3] 0x7324ce000000 | |
| Rank 0 workspace[4] 0x7324d0e00000 | |
| Rank 0 workspace[5] 0x7324d1000000 | |
| Rank 0 workspace[6] 0x7324d1200000 | |
| Rank 0 workspace[7] 0x7324d1400000 | |
| Rank 0 workspace[8] 0x7324c4000000 | |
| Rank 0 workspace[9] 0x7324ba000000 | |
| Rank 0 workspace[10] 0x7324b0000000 | |
| Rank 0 workspace[11] 0x7324a6000000 | |
| Rank 0 workspace[12] 0x7350c3264400 | |
| set flag_ptr[3] = lamport_comm_size: 47185920 | |
| Rank 1 workspace[0] 0x7a06d4000000 | |
| Rank 1 workspace[1] 0x7a2c70000000 | |
| Rank 1 workspace[2] 0x7a0670000000 | |
| Rank 1 workspace[3] 0x7a066c000000 | |
| Rank 1 workspace[4] 0x7a066f000000 | |
| Rank 1 workspace[5] 0x7a066ee00000 | |
| Rank 1 workspace[6] 0x7a066f200000 | |
| Rank 1 workspace[7] 0x7a066f400000 | |
| Rank 1 workspace[8] 0x7a0658000000 | |
| Rank 1 workspace[9] 0x7a0662000000 | |
| Rank 1 workspace[10] 0x7a064e000000 | |
| Rank 1 workspace[11] 0x7a0644000000 | |
| Rank 1 workspace[12] 0x7a327b264400 | |
| set flag_ptr[3] = lamport_comm_size: 47185920 | |
| Rank 3 workspace[0] 0x771bbc000000 | |
| Rank 3 workspace[1] 0x771b82000000 | |
| Rank 3 workspace[2] 0x771b7e000000 | |
| Rank 3 workspace[3] 0x774172000000 | |
| Rank 3 workspace[4] 0x771b81000000 | |
| Rank 3 workspace[5] 0x771b81200000 | |
| Rank 3 workspace[6] 0x771b81400000 | |
| Rank 3 workspace[7] 0x771b80e00000 | |
| Rank 3 workspace[8] 0x771b6a000000 | |
| Rank 3 workspace[9] 0x771b60000000 | |
| Rank 3 workspace[10] 0x771b56000000 | |
| Rank 3 workspace[11] 0x771b74000000 | |
| Rank 3 workspace[12] 0x77476d264400 | |
| set flag_ptr[3] = lamport_comm_size: 47185920 | |
| Rank 2 workspace[0] 0x739ed0000000 | |
| Rank 2 workspace[1] 0x739e68000000 | |
| Rank 2 workspace[2] 0x73c466000000 | |
| Rank 2 workspace[3] 0x739e64000000 | |
| Rank 2 workspace[4] 0x739e67000000 | |
| Rank 2 workspace[5] 0x739e67200000 | |
| Rank 2 workspace[6] 0x739e66e00000 | |
| Rank 2 workspace[7] 0x739e67400000 | |
| Rank 2 workspace[8] 0x739e50000000 | |
| Rank 2 workspace[9] 0x739e46000000 | |
| Rank 2 workspace[10] 0x739e5a000000 | |
| Rank 2 workspace[11] 0x739e3c000000 | |
| Rank 2 workspace[12] 0x73ca6f264400 | |
| Capturing batches (bs=200 avail_mem=9.39 GB): 4%|▎ | 1/28 [00:02<01:00, 2.26s/it] Capturing batches (bs=192 avail_mem=8.06 GB): 4%|▎ | 1/28 [00:02<01:00, 2.26s/it] Capturing batches (bs=192 avail_mem=8.06 GB): 7%|▋ | 2/28 [00:02<00:26, 1.02s/it] Capturing batches (bs=184 avail_mem=8.05 GB): 7%|▋ | 2/28 [00:02<00:26, 1.02s/it] Capturing batches (bs=176 avail_mem=8.04 GB): 7%|▋ | 2/28 [00:02<00:26, 1.02s/it] Capturing batches (bs=176 avail_mem=8.04 GB): 14%|█▍ | 4/28 [00:02<00:10, 2.30it/s] Capturing batches (bs=168 avail_mem=8.03 GB): 14%|█▍ | 4/28 [00:02<00:10, 2.30it/s] Capturing batches (bs=160 avail_mem=8.02 GB): 14%|█▍ | 4/28 [00:02<00:10, 2.30it/s] Capturing batches (bs=160 avail_mem=8.02 GB): 21%|██▏ | 6/28 [00:02<00:05, 3.71it/s] Capturing batches (bs=152 avail_mem=8.01 GB): 21%|██▏ | 6/28 [00:02<00:05, 3.71it/s] Capturing batches (bs=144 avail_mem=8.00 GB): 21%|██▏ | 6/28 [00:02<00:05, 3.71it/s] Capturing batches (bs=144 avail_mem=8.00 GB): 29%|██▊ | 8/28 [00:02<00:03, 5.12it/s] Capturing batches (bs=136 avail_mem=8.00 GB): 29%|██▊ | 8/28 [00:02<00:03, 5.12it/s] Capturing batches (bs=128 avail_mem=7.99 GB): 29%|██▊ | 8/28 [00:03<00:03, 5.12it/s] Capturing batches (bs=128 avail_mem=7.99 GB): 36%|███▌ | 10/28 [00:03<00:02, 6.40it/s] Capturing batches (bs=120 avail_mem=7.97 GB): 36%|███▌ | 10/28 [00:03<00:02, 6.40it/s] Capturing batches (bs=112 avail_mem=7.97 GB): 36%|███▌ | 10/28 [00:03<00:02, 6.40it/s] Capturing batches (bs=112 avail_mem=7.97 GB): 43%|████▎ | 12/28 [00:03<00:02, 7.55it/s] Capturing batches (bs=104 avail_mem=7.96 GB): 43%|████▎ | 12/28 [00:03<00:02, 7.55it/s] Capturing batches (bs=96 avail_mem=7.95 GB): 43%|████▎ | 12/28 [00:03<00:02, 7.55it/s] Capturing batches (bs=96 avail_mem=7.95 GB): 50%|█████ | 14/28 [00:03<00:01, 8.53it/s] Capturing batches (bs=88 avail_mem=7.94 GB): 50%|█████ | 14/28 [00:03<00:01, 8.53it/s] Capturing batches (bs=80 avail_mem=7.93 GB): 50%|█████ | 14/28 [00:03<00:01, 8.53it/s] Capturing batches (bs=80 avail_mem=7.93 GB): 57%|█████▋ | 16/28 [00:03<00:01, 9.29it/s] Capturing batches (bs=72 avail_mem=7.92 GB): 57%|█████▋ | 16/28 [00:03<00:01, 9.29it/s] Capturing batches (bs=64 avail_mem=7.91 GB): 57%|█████▋ | 16/28 [00:03<00:01, 9.29it/s] Capturing batches (bs=64 avail_mem=7.91 GB): 64%|██████▍ | 18/28 [00:03<00:01, 9.95it/s] Capturing batches (bs=56 avail_mem=7.90 GB): 64%|██████▍ | 18/28 [00:03<00:01, 9.95it/s] Capturing batches (bs=48 avail_mem=7.89 GB): 64%|██████▍ | 18/28 [00:03<00:01, 9.95it/s] Capturing batches (bs=48 avail_mem=7.89 GB): 71%|███████▏ | 20/28 [00:03<00:00, 10.52it/s] Capturing batches (bs=40 avail_mem=7.88 GB): 71%|███████▏ | 20/28 [00:03<00:00, 10.52it/s] Capturing batches (bs=32 avail_mem=7.88 GB): 71%|███████▏ | 20/28 [00:04<00:00, 10.52it/s] Capturing batches (bs=32 avail_mem=7.88 GB): 79%|███████▊ | 22/28 [00:04<00:00, 10.93it/s] Capturing batches (bs=24 avail_mem=7.87 GB): 79%|███████▊ | 22/28 [00:04<00:00, 10.93it/s] Capturing batches (bs=16 avail_mem=7.86 GB): 79%|███████▊ | 22/28 [00:04<00:00, 10.93it/s] Capturing batches (bs=16 avail_mem=7.86 GB): 86%|████████▌ | 24/28 [00:04<00:00, 11.34it/s] Capturing batches (bs=8 avail_mem=7.85 GB): 86%|████████▌ | 24/28 [00:04<00:00, 11.34it/s] Capturing batches (bs=4 avail_mem=7.84 GB): 86%|████████▌ | 24/28 [00:04<00:00, 11.34it/s] Capturing batches (bs=4 avail_mem=7.84 GB): 93%|█████████▎| 26/28 [00:04<00:00, 11.94it/s] Capturing batches (bs=2 avail_mem=7.83 GB): 93%|█████████▎| 26/28 [00:04<00:00, 11.94it/s] Capturing batches (bs=1 avail_mem=7.82 GB): 93%|█████████▎| 26/28 [00:04<00:00, 11.94it/s][2025-09-06 08:37:40 TP1] Registering 56 cuda graph addresses | |
| Capturing batches (bs=1 avail_mem=7.82 GB): 100%|██████████| 28/28 [00:04<00:00, 11.38it/s] Capturing batches (bs=1 avail_mem=7.82 GB): 100%|██████████| 28/28 [00:04<00:00, 6.05it/s] | |
| [2025-09-06 08:37:40 TP0] Registering 56 cuda graph addresses | |
| [2025-09-06 08:37:40 TP3] Registering 56 cuda graph addresses | |
| [2025-09-06 08:37:40 TP2] Registering 56 cuda graph addresses | |
| [2025-09-06 08:37:40 TP0] Capture cuda graph end. Time elapsed: 5.07 s. mem usage=1.73 GB. avail mem=7.81 GB. | |
| [2025-09-06 08:37:41 TP0] max_total_num_tokens=8487040, chunked_prefill_size=16384, max_prefill_tokens=16384, max_running_requests=4096, context_len=131072, available_gpu_mem=7.81 GB | |
| [2025-09-06 08:37:41] INFO: Started server process [42006] | |
| [2025-09-06 08:37:41] INFO: Waiting for application startup. | |
| [2025-09-06 08:37:42] INFO: Application startup complete. | |
| [2025-09-06 08:37:42] INFO: Uvicorn running on http://127.0.0.1:8400 (Press CTRL+C to quit) | |
| [2025-09-06 08:37:43] INFO: 127.0.0.1:53712 - "GET /get_model_info HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:43 TP0] Prefill batch. #new-seq: 1, #new-token: 64, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, | |
| [2025-09-06 08:37:44] INFO: 127.0.0.1:53722 - "POST /generate HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:44] The server is fired up and ready to roll! | |
| [2025-09-06 08:37:44 TP0] Prefill batch. #new-seq: 1, #new-token: 64, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, | |
| [2025-09-06 08:37:45] INFO: 127.0.0.1:53738 - "GET /health_generate HTTP/1.1" 200 OK | |
| command=python3 -m sglang.launch_server --model-path /home/yiliu7/models/openai/gpt-oss-120b --tp 4 --cuda-graph-max-bs 200 --mem-fraction-static 0.93 --device cuda --host 127.0.0.1 --port 8400 | |
| Evaluation start: model=/home/yiliu7/models/openai/gpt-oss-120b reasoning_effort=low expected_score=0.6 | |
| ChatCompletionSampler initialized with self.system_message=None self.temperature=0.1 self.max_tokens=4096 self.reasoning_effort='low' | |
| 0%| | 0/198 [00:00<?, ?it/s][2025-09-06 08:37:46 TP0] Prefill batch. #new-seq: 1, #new-token: 256, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, | |
| [2025-09-06 08:37:46 TP0] Prefill batch. #new-seq: 1, #new-token: 640, #cached-token: 0, token usage: 0.00, #running-req: 1, #queue-req: 0, | |
| [2025-09-06 08:37:46 TP0] Prefill batch. #new-seq: 15, #new-token: 4800, #cached-token: 960, token usage: 0.00, #running-req: 2, #queue-req: 0, | |
| [2025-09-06 08:37:46 TP0] Prefill batch. #new-seq: 55, #new-token: 16320, #cached-token: 3520, token usage: 0.00, #running-req: 17, #queue-req: 47, | |
| [2025-09-06 08:37:46 TP0] Prefill batch. #new-seq: 57, #new-token: 15296, #cached-token: 3648, token usage: 0.00, #running-req: 72, #queue-req: 0, | |
| [2025-09-06 08:37:46 TP0] Prefill batch. #new-seq: 58, #new-token: 16320, #cached-token: 3776, token usage: 0.00, #running-req: 129, #queue-req: 11, | |
| [2025-09-06 08:37:46 TP0] Prefill batch. #new-seq: 11, #new-token: 3136, #cached-token: 704, token usage: 0.01, #running-req: 187, #queue-req: 0, | |
| [2025-09-06 08:37:47 TP0] Decode batch. #running-req: 198, #token: 62976, token usage: 0.01, cuda graph: True, gen throughput (token/s): 986.46, #queue-req: 0, | |
| [2025-09-06 08:37:47] INFO: 127.0.0.1:55372 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:47] INFO: 127.0.0.1:54924 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:47] INFO: 127.0.0.1:55340 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:47] INFO: 127.0.0.1:54334 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:47 TP0] Decode batch. #running-req: 194, #token: 68928, token usage: 0.01, cuda graph: True, gen throughput (token/s): 17158.29, #queue-req: 0, | |
| [2025-09-06 08:37:47] INFO: 127.0.0.1:54134 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:47] INFO: 127.0.0.1:55242 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:47] INFO: 127.0.0.1:54550 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:48] INFO: 127.0.0.1:55312 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:48] INFO: 127.0.0.1:54414 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:48] INFO: 127.0.0.1:53950 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:48] INFO: 127.0.0.1:54428 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:48] INFO: 127.0.0.1:55094 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:48] INFO: 127.0.0.1:53964 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:48] INFO: 127.0.0.1:54790 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:48 TP0] Decode batch. #running-req: 186, #token: 70464, token usage: 0.01, cuda graph: True, gen throughput (token/s): 17032.28, #queue-req: 0, | |
| [2025-09-06 08:37:48] INFO: 127.0.0.1:53892 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:48] INFO: 127.0.0.1:54046 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:48] INFO: 127.0.0.1:54498 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:48] INFO: 127.0.0.1:54714 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:48] INFO: 127.0.0.1:53976 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:48] INFO: 127.0.0.1:54318 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:48] INFO: 127.0.0.1:54538 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:48] INFO: 127.0.0.1:53818 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:48] INFO: 127.0.0.1:54622 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:48] INFO: 127.0.0.1:54342 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:48] INFO: 127.0.0.1:53744 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:48] INFO: 127.0.0.1:54142 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:48 TP0] Decode batch. #running-req: 173, #token: 71424, token usage: 0.01, cuda graph: True, gen throughput (token/s): 16273.08, #queue-req: 0, | |
| [2025-09-06 08:37:48] INFO: 127.0.0.1:55112 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:48] INFO: 127.0.0.1:53802 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:49] INFO: 127.0.0.1:54982 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:49] INFO: 127.0.0.1:53908 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:49] INFO: 127.0.0.1:55380 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:49 TP0] Decode batch. #running-req: 168, #token: 75840, token usage: 0.01, cuda graph: True, gen throughput (token/s): 15646.05, #queue-req: 0, | |
| [2025-09-06 08:37:49] INFO: 127.0.0.1:54644 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:49] INFO: 127.0.0.1:54720 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:49] INFO: 127.0.0.1:53980 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:49] INFO: 127.0.0.1:54704 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:49] INFO: 127.0.0.1:55180 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:49] INFO: 127.0.0.1:54150 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:49] INFO: 127.0.0.1:53746 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:49] INFO: 127.0.0.1:55290 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:49] INFO: 127.0.0.1:54904 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:49] INFO: 127.0.0.1:54278 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:49] INFO: 127.0.0.1:55140 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:49 TP0] Decode batch. #running-req: 156, #token: 77824, token usage: 0.01, cuda graph: True, gen throughput (token/s): 14502.97, #queue-req: 0, | |
| [2025-09-06 08:37:49] INFO: 127.0.0.1:54816 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:49] INFO: 127.0.0.1:53756 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:49] INFO: 127.0.0.1:54526 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:49] INFO: 127.0.0.1:55120 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:49] INFO: 127.0.0.1:54338 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:49] INFO: 127.0.0.1:55062 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:49] INFO: 127.0.0.1:55178 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:49] INFO: 127.0.0.1:55396 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:49] INFO: 127.0.0.1:54910 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:49] INFO: 127.0.0.1:53844 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:50 TP0] Decode batch. #running-req: 146, #token: 78272, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13669.30, #queue-req: 0, | |
| [2025-09-06 08:37:50] INFO: 127.0.0.1:54436 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:50] INFO: 127.0.0.1:53766 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:50] INFO: 127.0.0.1:54386 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:50] INFO: 127.0.0.1:55334 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:50 TP0] Decode batch. #running-req: 142, #token: 82112, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13344.68, #queue-req: 0, | |
| [2025-09-06 08:37:50] INFO: 127.0.0.1:54844 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:50] INFO: 127.0.0.1:54860 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:50] INFO: 127.0.0.1:54504 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:50] INFO: 127.0.0.1:54086 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:50] INFO: 127.0.0.1:53928 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:50] INFO: 127.0.0.1:54232 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:50] INFO: 127.0.0.1:55348 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:50] INFO: 127.0.0.1:54114 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:50] INFO: 127.0.0.1:55412 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:50] INFO: 127.0.0.1:54272 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:50] INFO: 127.0.0.1:55216 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:50] INFO: 127.0.0.1:55214 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:50 TP0] Decode batch. #running-req: 131, #token: 81152, token usage: 0.01, cuda graph: True, gen throughput (token/s): 12770.37, #queue-req: 0, | |
| [2025-09-06 08:37:50] INFO: 127.0.0.1:54196 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:50] INFO: 127.0.0.1:54018 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:50] INFO: 127.0.0.1:55052 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:51] INFO: 127.0.0.1:54886 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:51] INFO: 127.0.0.1:54486 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:51] INFO: 127.0.0.1:54698 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:51] INFO: 127.0.0.1:54180 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:51] INFO: 127.0.0.1:54152 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:51] INFO: 127.0.0.1:54658 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:51] INFO: 127.0.0.1:54536 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:51 TP0] Decode batch. #running-req: 120, #token: 79616, token usage: 0.01, cuda graph: True, gen throughput (token/s): 15921.90, #queue-req: 0, | |
| [2025-09-06 08:37:51] INFO: 127.0.0.1:53876 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:51] INFO: 127.0.0.1:55226 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:51] INFO: 127.0.0.1:54302 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:51] INFO: 127.0.0.1:54394 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:51] INFO: 127.0.0.1:55014 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:51] INFO: 127.0.0.1:54246 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:51] INFO: 127.0.0.1:54986 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:51] INFO: 127.0.0.1:54742 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:51] INFO: 127.0.0.1:53934 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:51] INFO: 127.0.0.1:55000 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:51] INFO: 127.0.0.1:54838 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:51 TP0] Decode batch. #running-req: 109, #token: 76416, token usage: 0.01, cuda graph: True, gen throughput (token/s): 15876.24, #queue-req: 0, | |
| [2025-09-06 08:37:51] INFO: 127.0.0.1:54744 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:51] INFO: 127.0.0.1:55404 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:51] INFO: 127.0.0.1:54168 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:51] INFO: 127.0.0.1:55082 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:51] INFO: 127.0.0.1:53800 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:51] INFO: 127.0.0.1:54006 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:51] INFO: 127.0.0.1:54028 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:51] INFO: 127.0.0.1:53862 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:51] INFO: 127.0.0.1:54234 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:51 TP0] Decode batch. #running-req: 100, #token: 74048, token usage: 0.01, cuda graph: True, gen throughput (token/s): 14915.39, #queue-req: 0, | |
| [2025-09-06 08:37:51] INFO: 127.0.0.1:53994 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:51] INFO: 127.0.0.1:54294 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:51] INFO: 127.0.0.1:53874 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:51] INFO: 127.0.0.1:53852 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:51] INFO: 127.0.0.1:54870 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:52] INFO: 127.0.0.1:54900 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:52] INFO: 127.0.0.1:54854 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:52] INFO: 127.0.0.1:54124 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:52] INFO: 127.0.0.1:54936 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:52] INFO: 127.0.0.1:54952 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:52] INFO: 127.0.0.1:54684 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:52 TP0] Decode batch. #running-req: 89, #token: 69824, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13651.23, #queue-req: 0, | |
| [2025-09-06 08:37:52] INFO: 127.0.0.1:53886 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:52] INFO: 127.0.0.1:54516 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:52] INFO: 127.0.0.1:54798 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:52] INFO: 127.0.0.1:54962 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:52] INFO: 127.0.0.1:55384 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:52] INFO: 127.0.0.1:54874 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:52 TP0] Decode batch. #running-req: 84, #token: 67392, token usage: 0.01, cuda graph: True, gen throughput (token/s): 12980.65, #queue-req: 0, | |
| [2025-09-06 08:37:52] INFO: 127.0.0.1:54378 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:52] INFO: 127.0.0.1:53770 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:52] INFO: 127.0.0.1:54834 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:52] INFO: 127.0.0.1:55342 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:52] INFO: 127.0.0.1:54970 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:52] INFO: 127.0.0.1:54452 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:52] INFO: 127.0.0.1:53954 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:52] INFO: 127.0.0.1:54216 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:52] INFO: 127.0.0.1:54408 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:52] INFO: 127.0.0.1:54098 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:52] INFO: 127.0.0.1:54746 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:52] INFO: 127.0.0.1:55128 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:52] INFO: 127.0.0.1:54912 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:52] INFO: 127.0.0.1:53742 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:52] INFO: 127.0.0.1:54670 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:52 TP0] Decode batch. #running-req: 68, #token: 58624, token usage: 0.01, cuda graph: True, gen throughput (token/s): 11580.63, #queue-req: 0, | |
| [2025-09-06 08:37:52] INFO: 127.0.0.1:53916 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:52] INFO: 127.0.0.1:55352 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:52 TP0] Decode batch. #running-req: 66, #token: 59904, token usage: 0.01, cuda graph: True, gen throughput (token/s): 10703.52, #queue-req: 0, | |
| [2025-09-06 08:37:53] INFO: 127.0.0.1:54258 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:53] INFO: 127.0.0.1:54572 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:53] INFO: 127.0.0.1:54938 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:53 TP0] Decode batch. #running-req: 63, #token: 59328, token usage: 0.01, cuda graph: True, gen throughput (token/s): 10438.09, #queue-req: 0, | |
| [2025-09-06 08:37:53] INFO: 127.0.0.1:54208 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:53] INFO: 127.0.0.1:54990 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:53] INFO: 127.0.0.1:54350 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:53] INFO: 127.0.0.1:54160 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:53] INFO: 127.0.0.1:54806 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:53] INFO: 127.0.0.1:54566 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:53] INFO: 127.0.0.1:54496 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:53 TP0] Decode batch. #running-req: 56, #token: 54592, token usage: 0.01, cuda graph: True, gen throughput (token/s): 9934.25, #queue-req: 0, | |
| [2025-09-06 08:37:53] INFO: 127.0.0.1:55118 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:53] INFO: 127.0.0.1:53784 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:53 TP0] Decode batch. #running-req: 54, #token: 54080, token usage: 0.01, cuda graph: True, gen throughput (token/s): 9371.48, #queue-req: 0, | |
| [2025-09-06 08:37:53] INFO: 127.0.0.1:55360 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:53] INFO: 127.0.0.1:55320 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:53] INFO: 127.0.0.1:54128 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:53] INFO: 127.0.0.1:54774 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:53] INFO: 127.0.0.1:54634 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:53] INFO: 127.0.0.1:55172 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:53 TP0] Decode batch. #running-req: 48, #token: 50880, token usage: 0.01, cuda graph: True, gen throughput (token/s): 8626.22, #queue-req: 0, | |
| [2025-09-06 08:37:53] INFO: 127.0.0.1:55156 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:53] INFO: 127.0.0.1:54324 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:53] INFO: 127.0.0.1:54256 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:54] INFO: 127.0.0.1:55198 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:54 TP0] Decode batch. #running-req: 44, #token: 48256, token usage: 0.01, cuda graph: True, gen throughput (token/s): 8115.64, #queue-req: 0, | |
| [2025-09-06 08:37:54] INFO: 127.0.0.1:54034 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:54] INFO: 127.0.0.1:55068 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:54] INFO: 127.0.0.1:55100 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:54 TP0] Decode batch. #running-req: 41, #token: 46528, token usage: 0.01, cuda graph: True, gen throughput (token/s): 7577.10, #queue-req: 0, | |
| [2025-09-06 08:37:54] INFO: 127.0.0.1:54458 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:54] INFO: 127.0.0.1:55296 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:54] INFO: 127.0.0.1:55406 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:54] INFO: 127.0.0.1:54596 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:54] INFO: 127.0.0.1:54482 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:54] INFO: 127.0.0.1:54628 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:54] INFO: 127.0.0.1:55274 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:54 TP0] Decode batch. #running-req: 34, #token: 39104, token usage: 0.00, cuda graph: True, gen throughput (token/s): 6738.34, #queue-req: 0, | |
| [2025-09-06 08:37:54] INFO: 127.0.0.1:54074 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:54] INFO: 127.0.0.1:54444 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:54] INFO: 127.0.0.1:54092 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:54] INFO: 127.0.0.1:53872 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:54] INFO: 127.0.0.1:54818 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:54] INFO: 127.0.0.1:54768 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:54 TP0] Decode batch. #running-req: 28, #token: 34880, token usage: 0.00, cuda graph: True, gen throughput (token/s): 5616.97, #queue-req: 0, | |
| [2025-09-06 08:37:54] INFO: 127.0.0.1:53918 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:54] INFO: 127.0.0.1:54362 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:54] INFO: 127.0.0.1:54728 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:54] INFO: 127.0.0.1:53754 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:54] INFO: 127.0.0.1:55280 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:54 TP0] Decode batch. #running-req: 23, #token: 29184, token usage: 0.00, cuda graph: True, gen throughput (token/s): 4789.55, #queue-req: 0, | |
| [2025-09-06 08:37:54] INFO: 127.0.0.1:54756 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:55] INFO: 127.0.0.1:54268 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:55] INFO: 127.0.0.1:55182 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:55] INFO: 127.0.0.1:54356 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:55 TP0] Decode batch. #running-req: 19, #token: 24832, token usage: 0.00, cuda graph: True, gen throughput (token/s): 4037.65, #queue-req: 0, | |
| [2025-09-06 08:37:55] INFO: 127.0.0.1:55030 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:55 TP0] Decode batch. #running-req: 18, #token: 21568, token usage: 0.00, cuda graph: True, gen throughput (token/s): 3674.82, #queue-req: 0, | |
| [2025-09-06 08:37:55] INFO: 127.0.0.1:53846 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:55] INFO: 127.0.0.1:55024 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:55] INFO: 127.0.0.1:53834 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:55] INFO: 127.0.0.1:54110 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:55] INFO: 127.0.0.1:53778 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:55] INFO: 127.0.0.1:54568 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:55 TP0] Decode batch. #running-req: 12, #token: 16576, token usage: 0.00, cuda graph: True, gen throughput (token/s): 2813.33, #queue-req: 0, | |
| [2025-09-06 08:37:55] INFO: 127.0.0.1:54280 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:55 TP0] Decode batch. #running-req: 11, #token: 15744, token usage: 0.00, cuda graph: True, gen throughput (token/s): 2427.20, #queue-req: 0, | |
| [2025-09-06 08:37:55] INFO: 127.0.0.1:55040 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:55] INFO: 127.0.0.1:54148 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:55 TP0] Decode batch. #running-req: 9, #token: 13248, token usage: 0.00, cuda graph: True, gen throughput (token/s): 2133.90, #queue-req: 0, | |
| [2025-09-06 08:37:56 TP0] Decode batch. #running-req: 9, #token: 13440, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1900.71, #queue-req: 0, | |
| [2025-09-06 08:37:56] INFO: 127.0.0.1:54470 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:56] INFO: 127.0.0.1:54588 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:56] INFO: 127.0.0.1:54060 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:56] INFO: 127.0.0.1:54112 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:56] INFO: 127.0.0.1:55142 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:56 TP0] Decode batch. #running-req: 4, #token: 4736, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1559.00, #queue-req: 0, | |
| [2025-09-06 08:37:56] INFO: 127.0.0.1:54610 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:56 TP0] Decode batch. #running-req: 3, #token: 4864, token usage: 0.00, cuda graph: True, gen throughput (token/s): 803.60, #queue-req: 0, | |
| [2025-09-06 08:37:56] INFO: 127.0.0.1:55258 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:37:56 TP0] Decode batch. #running-req: 2, #token: 3392, token usage: 0.00, cuda graph: True, gen throughput (token/s): 610.98, #queue-req: 0, | |
| [2025-09-06 08:37:56] INFO: 127.0.0.1:54852 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| 1%| | 1/198 [00:10<34:54, 10.63s/it][2025-09-06 08:37:56 TP0] Decode batch. #running-req: 1, #token: 1664, token usage: 0.00, cuda graph: True, gen throughput (token/s): 512.37, #queue-req: 0, | |
| [2025-09-06 08:37:56 TP0] Decode batch. #running-req: 1, #token: 1728, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.50, #queue-req: 0, | |
| [2025-09-06 08:37:56 TP0] Decode batch. #running-req: 1, #token: 1792, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.95, #queue-req: 0, | |
| [2025-09-06 08:37:57 TP0] Decode batch. #running-req: 1, #token: 1792, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.45, #queue-req: 0, | |
| [2025-09-06 08:37:57 TP0] Decode batch. #running-req: 1, #token: 1856, token usage: 0.00, cuda graph: True, gen throughput (token/s): 331.02, #queue-req: 0, | |
| [2025-09-06 08:37:57 TP0] Decode batch. #running-req: 1, #token: 1920, token usage: 0.00, cuda graph: True, gen throughput (token/s): 331.02, #queue-req: 0, | |
| [2025-09-06 08:37:57 TP0] Decode batch. #running-req: 1, #token: 1920, token usage: 0.00, cuda graph: True, gen throughput (token/s): 331.09, #queue-req: 0, | |
| [2025-09-06 08:37:57 TP0] Decode batch. #running-req: 1, #token: 1984, token usage: 0.00, cuda graph: True, gen throughput (token/s): 331.74, #queue-req: 0, | |
| [2025-09-06 08:37:57 TP0] Decode batch. #running-req: 1, #token: 1984, token usage: 0.00, cuda graph: True, gen throughput (token/s): 331.23, #queue-req: 0, | |
| [2025-09-06 08:37:57 TP0] Decode batch. #running-req: 1, #token: 2048, token usage: 0.00, cuda graph: True, gen throughput (token/s): 331.15, #queue-req: 0, | |
| [2025-09-06 08:37:57 TP0] Decode batch. #running-req: 1, #token: 2112, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.99, #queue-req: 0, | |
| [2025-09-06 08:37:58 TP0] Decode batch. #running-req: 1, #token: 2112, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.87, #queue-req: 0, | |
| [2025-09-06 08:37:58] INFO: 127.0.0.1:55070 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| 15%|█▌ | 30/198 [00:12<00:50, 3.36it/s] 100%|██████████| 198/198 [00:12<00:00, 16.45it/s] | |
| /usr/lib/python3.12/subprocess.py:1127: ResourceWarning: subprocess 42006 is still running | |
| _warn("subprocess %s is still running" % self.pid, | |
| ResourceWarning: Enable tracemalloc to get the object allocation traceback | |
| . | |
| ---------------------------------------------------------------------- | |
| Ran 1 test in 176.322s | |
| OK | |
| Writing report to /tmp/gpqa__home_yiliu7_models_openai_gpt-oss-120b.html | |
| {'chars': 1670.3434343434344, 'chars:std': 964.280672515106, 'score:std': 0.46958834412435685, 'score': 0.6717171717171717} | |
| Writing results to /tmp/gpqa__home_yiliu7_models_openai_gpt-oss-120b.json | |
| Total latency: 12.093 s | |
| Score: 0.672 | |
| Evaluation end: model=/home/yiliu7/models/openai/gpt-oss-120b reasoning_effort=low expected_score=0.6 metrics={'chars': 1670.3434343434344, 'chars:std': 964.280672515106, 'score:std': 0.46958834412435685, 'score': 0.6717171717171717} | |
| ================================================================================ | |
| Run 5: | |
| Auto-configed device: cuda | |
| WARNING:sglang.srt.server_args:Detected SM100 and MXFP4 quantization format for GPT-OSS model, enabling FlashInfer MXFP4 MOE kernel. | |
| WARNING:sglang.srt.server_args:TensorRT-LLM MHA only supports page_size of 16, 32 or 64, changing page_size from None to 64. | |
| [2025-09-06 08:38:12] server_args=ServerArgs(model_path='/home/yiliu7/models/openai/gpt-oss-120b', tokenizer_path='/home/yiliu7/models/openai/gpt-oss-120b', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=8400, skip_server_warmup=False, warmups=None, nccl_port=None, dtype='bfloat16', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', mem_fraction_static=0.93, max_running_requests=None, max_queued_requests=9223372036854775807, max_total_tokens=None, chunked_prefill_size=16384, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, page_size=64, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=True, device='cuda', tp_size=4, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=861196426, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, gc_warning_threshold_secs=0.0, api_key=None, served_model_name='/home/yiliu7/models/openai/gpt-oss-120b', weight_version='default', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_backend='triton', attention_backend='trtllm_mha', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, ep_size=1, moe_a2a_backend='none', moe_runner_backend='flashinfer_mxfp4', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=True, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', disable_radix_cache=False, cuda_graph_max_bs=200, cuda_graph_bs=None, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, torch_compile_max_bs=32, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, enable_return_hidden_states=False, scheduler_recv_interval=1, numa_node=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, debug_tensor_dump_prefill_only=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, num_reserved_decode_tokens=512, custom_weight_loader=[], weight_loader_disable_mmap=False, enable_pdmux=False, sm_group_num=3, enable_ep_moe=False, enable_deepep_moe=False, enable_flashinfer_cutlass_moe=False, enable_flashinfer_trtllm_moe=False, enable_triton_kernel_moe=False, enable_flashinfer_mxfp4_moe=False) | |
| All deep_gemm operations loaded successfully! | |
| `torch_dtype` is deprecated! Use `dtype` instead! | |
| [2025-09-06 08:38:12] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:38:12] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [2025-09-06 08:38:13] Using default HuggingFace chat template with detected content format: string | |
| `torch_dtype` is deprecated! Use `dtype` instead! | |
| [2025-09-06 08:38:19 TP0] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:38:19 TP0] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| `torch_dtype` is deprecated! Use `dtype` instead! | |
| [2025-09-06 08:38:19 TP3] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:38:19 TP3] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| `torch_dtype` is deprecated! Use `dtype` instead! | |
| [2025-09-06 08:38:19 TP2] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:38:19 TP2] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| `torch_dtype` is deprecated! Use `dtype` instead! | |
| [2025-09-06 08:38:19 TP1] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:38:19 TP1] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [2025-09-06 08:38:20 TP0] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:38:20 TP0] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [2025-09-06 08:38:20 TP0] Init torch distributed begin. | |
| [2025-09-06 08:38:20 TP3] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:38:20 TP3] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [2025-09-06 08:38:20 TP2] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:38:20 TP2] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [2025-09-06 08:38:20 TP1] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:38:20 TP1] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [2025-09-06 08:38:21 TP0] sglang is using nccl==2.27.3 | |
| [Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [2025-09-06 08:38:24 TP0] Init torch distributed ends. mem usage=1.46 GB | |
| [2025-09-06 08:38:24 TP0] Load weight begin. avail mem=176.28 GB | |
| All deep_gemm operations loaded successfully! | |
| Loading safetensors checkpoint shards: 0% Completed | 0/15 [00:00<?, ?it/s] | |
| Loading safetensors checkpoint shards: 100% Completed | 15/15 [00:00<00:00, 1591.28it/s] | |
| [2025-09-06 08:38:33 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.0.mlp.experts), it might take a while... | |
| All deep_gemm operations loaded successfully! | |
| [2025-09-06 08:38:36 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.1.mlp.experts), it might take a while... | |
| All deep_gemm operations loaded successfully! | |
| All deep_gemm operations loaded successfully! | |
| [2025-09-06 08:38:39 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.2.mlp.experts), it might take a while... | |
| [2025-09-06 08:38:42 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.3.mlp.experts), it might take a while... | |
| [2025-09-06 08:38:46 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.4.mlp.experts), it might take a while... | |
| [2025-09-06 08:38:49 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.5.mlp.experts), it might take a while... | |
| [2025-09-06 08:38:52 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.6.mlp.experts), it might take a while... | |
| [2025-09-06 08:38:55 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.7.mlp.experts), it might take a while... | |
| [2025-09-06 08:38:58 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.8.mlp.experts), it might take a while... | |
| [2025-09-06 08:39:01 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.9.mlp.experts), it might take a while... | |
| [2025-09-06 08:39:04 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.10.mlp.experts), it might take a while... | |
| [2025-09-06 08:39:07 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.11.mlp.experts), it might take a while... | |
| [2025-09-06 08:39:10 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.12.mlp.experts), it might take a while... | |
| [2025-09-06 08:39:13 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.13.mlp.experts), it might take a while... | |
| [2025-09-06 08:39:16 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.14.mlp.experts), it might take a while... | |
| [2025-09-06 08:39:19 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.15.mlp.experts), it might take a while... | |
| [2025-09-06 08:39:22 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.16.mlp.experts), it might take a while... | |
| [2025-09-06 08:39:25 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.17.mlp.experts), it might take a while... | |
| [2025-09-06 08:39:28 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.18.mlp.experts), it might take a while... | |
| [2025-09-06 08:39:32 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.19.mlp.experts), it might take a while... | |
| [2025-09-06 08:39:35 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.20.mlp.experts), it might take a while... | |
| [2025-09-06 08:39:38 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.21.mlp.experts), it might take a while... | |
| [2025-09-06 08:39:41 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.22.mlp.experts), it might take a while... | |
| [2025-09-06 08:39:44 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.23.mlp.experts), it might take a while... | |
| [2025-09-06 08:39:47 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.24.mlp.experts), it might take a while... | |
| [2025-09-06 08:39:50 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.25.mlp.experts), it might take a while... | |
| [2025-09-06 08:39:53 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.26.mlp.experts), it might take a while... | |
| [2025-09-06 08:39:56 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.27.mlp.experts), it might take a while... | |
| [2025-09-06 08:39:59 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.28.mlp.experts), it might take a while... | |
| [2025-09-06 08:40:02 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.29.mlp.experts), it might take a while... | |
| [2025-09-06 08:40:05 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.30.mlp.experts), it might take a while... | |
| [2025-09-06 08:40:08 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.31.mlp.experts), it might take a while... | |
| [2025-09-06 08:40:11 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.32.mlp.experts), it might take a while... | |
| [2025-09-06 08:40:14 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.33.mlp.experts), it might take a while... | |
| [2025-09-06 08:40:18 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.34.mlp.experts), it might take a while... | |
| [2025-09-06 08:40:21 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.35.mlp.experts), it might take a while... | |
| [2025-09-06 08:40:24 TP0] Load weight end. type=GptOssForCausalLM, dtype=torch.bfloat16, avail mem=158.06 GB, mem usage=18.22 GB. | |
| [2025-09-06 08:40:31 TP2] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB | |
| [2025-09-06 08:40:31 TP1] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB | |
| [2025-09-06 08:40:31 TP0] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB | |
| [2025-09-06 08:40:31 TP0] Memory pool end. avail mem=10.23 GB | |
| [2025-09-06 08:40:31 TP3] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB | |
| [2025-09-06 08:40:32 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=9.54 GB | |
| [2025-09-06 08:40:32 TP0] Capture cuda graph bs [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200] | |
| 0%| | 0/28 [00:00<?, ?it/s] Capturing batches (bs=200 avail_mem=9.39 GB): 0%| | 0/28 [00:00<?, ?it/s]rank 0 allocated ipc_handles: [['0x73dcf8000000', '0x73b734000000', '0x73b700000000', '0x73b6fc000000'], ['0x73b6fee00000', '0x73b6ff000000', '0x73b6ff200000', '0x73b6ff400000'], ['0x73b6f2000000', '0x73b6e8000000', '0x73b6de000000', '0x73b6d4000000']] | |
| [2025-09-06 08:40:34.384] [info] lamportInitialize start: buffer: 0x73b6f2000000, size: 71303168 | |
| rank 1 allocated ipc_handles: [['0x75bc64000000', '0x75e1fa000000', '0x75bc00000000', '0x75bbfc000000'], ['0x75bbff000000', '0x75bbfee00000', '0x75bbff200000', '0x75bbff400000'], ['0x75bbe8000000', '0x75bbf2000000', '0x75bbde000000', '0x75bbd4000000']] | |
| [2025-09-06 08:40:34.433] [info] lamportInitialize start: buffer: 0x75bbf2000000, size: 71303168 | |
| rank 2 allocated ipc_handles: [['0x74be10000000', '0x74bdac000000', '0x74e3a6000000', '0x74bda8000000'], ['0x74bdab000000', '0x74bdab200000', '0x74bdaae00000', '0x74bdab400000'], ['0x74bd94000000', '0x74bd8a000000', '0x74bd9e000000', '0x74bd80000000']] | |
| [2025-09-06 08:40:34.483] [info] lamportInitialize start: buffer: 0x74bd9e000000, size: 71303168 | |
| rank 3 allocated ipc_handles: [['0x795024000000', '0x794fc8000000', '0x794fc4000000', '0x7975c0000000'], ['0x794fc7000000', '0x794fc7200000', '0x794fc7400000', '0x794fc6e00000'], ['0x794fb0000000', '0x794fa6000000', '0x794f9c000000', '0x794fba000000']] | |
| [2025-09-06 08:40:34.533] [info] lamportInitialize start: buffer: 0x794fba000000, size: 71303168 | |
| [2025-09-06 08:40:34 TP0] FlashInfer workspace initialized for rank 0, world_size 4 | |
| [2025-09-06 08:40:34 TP1] FlashInfer workspace initialized for rank 1, world_size 4 | |
| [2025-09-06 08:40:34 TP3] FlashInfer workspace initialized for rank 3, world_size 4 | |
| [2025-09-06 08:40:34 TP2] FlashInfer workspace initialized for rank 2, world_size 4 | |
| set flag_ptr[3] = lamport_comm_size: 47185920 | |
| Rank 0 workspace[0] 0x73dcf8000000 | |
| Rank 0 workspace[1] 0x73b734000000 | |
| Rank 0 workspace[2] 0x73b700000000 | |
| Rank 0 workspace[3] 0x73b6fc000000 | |
| Rank 0 workspace[4] 0x73b6fee00000 | |
| Rank 0 workspace[5] 0x73b6ff000000 | |
| Rank 0 workspace[6] 0x73b6ff200000 | |
| Rank 0 workspace[7] 0x73b6ff400000 | |
| Rank 0 workspace[8] 0x73b6f2000000 | |
| Rank 0 workspace[9] 0x73b6e8000000 | |
| Rank 0 workspace[10] 0x73b6de000000 | |
| Rank 0 workspace[11] 0x73b6d4000000 | |
| Rank 0 workspace[12] 0x73e2eb264400 | |
| set flag_ptr[3] = lamport_comm_size: 47185920 | |
| Rank 3 workspace[0] 0x795024000000 | |
| Rank 3 workspace[1] 0x794fc8000000 | |
| Rank 3 workspace[2] 0x794fc4000000 | |
| Rank 3 workspace[3] 0x7975c0000000 | |
| Rank 3 workspace[4] 0x794fc7000000 | |
| Rank 3 workspace[5] 0x794fc7200000 | |
| Rank 3 workspace[6] 0x794fc7400000 | |
| Rank 3 workspace[7] 0x794fc6e00000 | |
| Rank 3 workspace[8] 0x794fb0000000 | |
| Rank 3 workspace[9] 0x794fa6000000 | |
| Rank 3 workspace[10] 0x794f9c000000 | |
| Rank 3 workspace[11] 0x794fba000000 | |
| Rank 3 workspace[12] 0x797bbd264400 | |
| set flag_ptr[3] = lamport_comm_size: 47185920 | |
| Rank 1 workspace[0] 0x75bc64000000 | |
| Rank 1 workspace[1] 0x75e1fa000000 | |
| Rank 1 workspace[2] 0x75bc00000000 | |
| Rank 1 workspace[3] 0x75bbfc000000 | |
| Rank 1 workspace[4] 0x75bbff000000 | |
| Rank 1 workspace[5] 0x75bbfee00000 | |
| Rank 1 workspace[6] 0x75bbff200000 | |
| Rank 1 workspace[7] 0x75bbff400000 | |
| Rank 1 workspace[8] 0x75bbe8000000 | |
| Rank 1 workspace[9] 0x75bbf2000000 | |
| Rank 1 workspace[10] 0x75bbde000000 | |
| Rank 1 workspace[11] 0x75bbd4000000 | |
| Rank 1 workspace[12] 0x75e801264400 | |
| set flag_ptr[3] = lamport_comm_size: 47185920 | |
| Rank 2 workspace[0] 0x74be10000000 | |
| Rank 2 workspace[1] 0x74bdac000000 | |
| Rank 2 workspace[2] 0x74e3a6000000 | |
| Rank 2 workspace[3] 0x74bda8000000 | |
| Rank 2 workspace[4] 0x74bdab000000 | |
| Rank 2 workspace[5] 0x74bdab200000 | |
| Rank 2 workspace[6] 0x74bdaae00000 | |
| Rank 2 workspace[7] 0x74bdab400000 | |
| Rank 2 workspace[8] 0x74bd94000000 | |
| Rank 2 workspace[9] 0x74bd8a000000 | |
| Rank 2 workspace[10] 0x74bd9e000000 | |
| Rank 2 workspace[11] 0x74bd80000000 | |
| Rank 2 workspace[12] 0x74e9b1264400 | |
| Capturing batches (bs=200 avail_mem=9.39 GB): 4%|▎ | 1/28 [00:02<01:02, 2.30s/it] Capturing batches (bs=192 avail_mem=8.06 GB): 4%|▎ | 1/28 [00:02<01:02, 2.30s/it] Capturing batches (bs=192 avail_mem=8.06 GB): 7%|▋ | 2/28 [00:02<00:27, 1.04s/it] Capturing batches (bs=184 avail_mem=8.05 GB): 7%|▋ | 2/28 [00:02<00:27, 1.04s/it] Capturing batches (bs=176 avail_mem=8.04 GB): 7%|▋ | 2/28 [00:02<00:27, 1.04s/it] Capturing batches (bs=176 avail_mem=8.04 GB): 14%|█▍ | 4/28 [00:02<00:10, 2.23it/s] Capturing batches (bs=168 avail_mem=8.03 GB): 14%|█▍ | 4/28 [00:02<00:10, 2.23it/s] Capturing batches (bs=160 avail_mem=8.02 GB): 14%|█▍ | 4/28 [00:02<00:10, 2.23it/s] Capturing batches (bs=160 avail_mem=8.02 GB): 21%|██▏ | 6/28 [00:02<00:06, 3.55it/s] Capturing batches (bs=152 avail_mem=8.01 GB): 21%|██▏ | 6/28 [00:02<00:06, 3.55it/s] Capturing batches (bs=144 avail_mem=8.00 GB): 21%|██▏ | 6/28 [00:02<00:06, 3.55it/s] Capturing batches (bs=144 avail_mem=8.00 GB): 29%|██▊ | 8/28 [00:03<00:04, 4.82it/s] Capturing batches (bs=136 avail_mem=8.00 GB): 29%|██▊ | 8/28 [00:03<00:04, 4.82it/s] Capturing batches (bs=136 avail_mem=8.00 GB): 32%|███▏ | 9/28 [00:03<00:03, 5.42it/s] Capturing batches (bs=128 avail_mem=7.99 GB): 32%|███▏ | 9/28 [00:03<00:03, 5.42it/s] Capturing batches (bs=128 avail_mem=7.99 GB): 36%|███▌ | 10/28 [00:03<00:02, 6.02it/s] Capturing batches (bs=120 avail_mem=7.97 GB): 36%|███▌ | 10/28 [00:03<00:02, 6.02it/s] Capturing batches (bs=112 avail_mem=7.97 GB): 36%|███▌ | 10/28 [00:03<00:02, 6.02it/s] Capturing batches (bs=112 avail_mem=7.97 GB): 43%|████▎ | 12/28 [00:03<00:02, 7.35it/s] Capturing batches (bs=104 avail_mem=7.96 GB): 43%|████▎ | 12/28 [00:03<00:02, 7.35it/s] Capturing batches (bs=96 avail_mem=7.95 GB): 43%|████▎ | 12/28 [00:03<00:02, 7.35it/s] Capturing batches (bs=96 avail_mem=7.95 GB): 50%|█████ | 14/28 [00:03<00:01, 8.08it/s] Capturing batches (bs=88 avail_mem=7.94 GB): 50%|█████ | 14/28 [00:03<00:01, 8.08it/s] Capturing batches (bs=88 avail_mem=7.94 GB): 54%|█████▎ | 15/28 [00:03<00:01, 8.40it/s] Capturing batches (bs=80 avail_mem=7.93 GB): 54%|█████▎ | 15/28 [00:03<00:01, 8.40it/s] Capturing batches (bs=72 avail_mem=7.92 GB): 54%|█████▎ | 15/28 [00:03<00:01, 8.40it/s] Capturing batches (bs=72 avail_mem=7.92 GB): 61%|██████ | 17/28 [00:03<00:01, 9.23it/s] Capturing batches (bs=64 avail_mem=7.91 GB): 61%|██████ | 17/28 [00:03<00:01, 9.23it/s] Capturing batches (bs=56 avail_mem=7.90 GB): 61%|██████ | 17/28 [00:04<00:01, 9.23it/s] Capturing batches (bs=56 avail_mem=7.90 GB): 68%|██████▊ | 19/28 [00:04<00:00, 9.53it/s] Capturing batches (bs=48 avail_mem=7.89 GB): 68%|██████▊ | 19/28 [00:04<00:00, 9.53it/s] Capturing batches (bs=40 avail_mem=7.88 GB): 68%|██████▊ | 19/28 [00:04<00:00, 9.53it/s] Capturing batches (bs=40 avail_mem=7.88 GB): 75%|███████▌ | 21/28 [00:04<00:00, 9.85it/s] Capturing batches (bs=32 avail_mem=7.88 GB): 75%|███████▌ | 21/28 [00:04<00:00, 9.85it/s] Capturing batches (bs=24 avail_mem=7.87 GB): 75%|███████▌ | 21/28 [00:04<00:00, 9.85it/s] Capturing batches (bs=24 avail_mem=7.87 GB): 82%|████████▏ | 23/28 [00:04<00:00, 10.35it/s] Capturing batches (bs=16 avail_mem=7.86 GB): 82%|████████▏ | 23/28 [00:04<00:00, 10.35it/s] Capturing batches (bs=8 avail_mem=7.85 GB): 82%|████████▏ | 23/28 [00:04<00:00, 10.35it/s] Capturing batches (bs=8 avail_mem=7.85 GB): 89%|████████▉ | 25/28 [00:04<00:00, 10.47it/s] Capturing batches (bs=4 avail_mem=7.84 GB): 89%|████████▉ | 25/28 [00:04<00:00, 10.47it/s] Capturing batches (bs=2 avail_mem=7.83 GB): 89%|████████▉ | 25/28 [00:04<00:00, 10.47it/s] Capturing batches (bs=2 avail_mem=7.83 GB): 96%|█████████▋| 27/28 [00:04<00:00, 10.88it/s] Capturing batches (bs=1 avail_mem=7.82 GB): 96%|█████████▋| 27/28 [00:04<00:00, 10.88it/s] Capturing batches (bs=1 avail_mem=7.82 GB): 100%|██████████| 28/28 [00:04<00:00, 5.63it/s] | |
| [2025-09-06 08:40:37 TP0] Registering 56 cuda graph addresses | |
| [2025-09-06 08:40:37 TP1] Registering 56 cuda graph addresses | |
| [2025-09-06 08:40:37 TP2] Registering 56 cuda graph addresses | |
| [2025-09-06 08:40:37 TP3] Registering 56 cuda graph addresses | |
| [2025-09-06 08:40:37 TP0] Capture cuda graph end. Time elapsed: 5.51 s. mem usage=1.73 GB. avail mem=7.81 GB. | |
| [2025-09-06 08:40:38 TP0] max_total_num_tokens=8487040, chunked_prefill_size=16384, max_prefill_tokens=16384, max_running_requests=4096, context_len=131072, available_gpu_mem=7.81 GB | |
| [2025-09-06 08:40:39] INFO: Started server process [44552] | |
| [2025-09-06 08:40:39] INFO: Waiting for application startup. | |
| [2025-09-06 08:40:39] INFO: Application startup complete. | |
| [2025-09-06 08:40:39] INFO: Uvicorn running on http://127.0.0.1:8400 (Press CTRL+C to quit) | |
| [2025-09-06 08:40:40] INFO: 127.0.0.1:40126 - "GET /get_model_info HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:40 TP0] Prefill batch. #new-seq: 1, #new-token: 64, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, | |
| [2025-09-06 08:40:41] INFO: 127.0.0.1:40140 - "POST /generate HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:41] The server is fired up and ready to roll! | |
| [2025-09-06 08:40:46 TP0] Prefill batch. #new-seq: 1, #new-token: 64, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, | |
| [2025-09-06 08:40:47] INFO: 127.0.0.1:40142 - "GET /health_generate HTTP/1.1" 200 OK | |
| command=python3 -m sglang.launch_server --model-path /home/yiliu7/models/openai/gpt-oss-120b --tp 4 --cuda-graph-max-bs 200 --mem-fraction-static 0.93 --device cuda --host 127.0.0.1 --port 8400 | |
| Evaluation start: model=/home/yiliu7/models/openai/gpt-oss-120b reasoning_effort=low expected_score=0.6 | |
| ChatCompletionSampler initialized with self.system_message=None self.temperature=0.1 self.max_tokens=4096 self.reasoning_effort='low' | |
| 0%| | 0/198 [00:00<?, ?it/s][2025-09-06 08:40:48 TP0] Prefill batch. #new-seq: 1, #new-token: 256, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, | |
| [2025-09-06 08:40:48 TP0] Prefill batch. #new-seq: 2, #new-token: 576, #cached-token: 0, token usage: 0.00, #running-req: 1, #queue-req: 0, | |
| [2025-09-06 08:40:48 TP0] Prefill batch. #new-seq: 14, #new-token: 4096, #cached-token: 896, token usage: 0.00, #running-req: 3, #queue-req: 0, | |
| [2025-09-06 08:40:48 TP0] Prefill batch. #new-seq: 15, #new-token: 4736, #cached-token: 960, token usage: 0.00, #running-req: 17, #queue-req: 0, | |
| [2025-09-06 08:40:49 TP0] Prefill batch. #new-seq: 26, #new-token: 7168, #cached-token: 1664, token usage: 0.00, #running-req: 32, #queue-req: 0, | |
| [2025-09-06 08:40:49 TP0] Prefill batch. #new-seq: 10, #new-token: 2560, #cached-token: 640, token usage: 0.00, #running-req: 58, #queue-req: 0, | |
| [2025-09-06 08:40:49 TP0] Prefill batch. #new-seq: 32, #new-token: 10752, #cached-token: 2048, token usage: 0.00, #running-req: 68, #queue-req: 0, | |
| [2025-09-06 08:40:49 TP0] Prefill batch. #new-seq: 58, #new-token: 15744, #cached-token: 3776, token usage: 0.00, #running-req: 100, #queue-req: 0, | |
| [2025-09-06 08:40:49 TP0] Prefill batch. #new-seq: 40, #new-token: 10944, #cached-token: 2560, token usage: 0.01, #running-req: 158, #queue-req: 0, | |
| [2025-09-06 08:40:49 TP0] Decode batch. #running-req: 198, #token: 62976, token usage: 0.01, cuda graph: True, gen throughput (token/s): 528.97, #queue-req: 0, | |
| [2025-09-06 08:40:49] INFO: 127.0.0.1:40254 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:49] INFO: 127.0.0.1:41524 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:49] INFO: 127.0.0.1:40968 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:50 TP0] Decode batch. #running-req: 195, #token: 69824, token usage: 0.01, cuda graph: True, gen throughput (token/s): 17214.69, #queue-req: 0, | |
| [2025-09-06 08:40:50] INFO: 127.0.0.1:40724 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:50] INFO: 127.0.0.1:41188 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:50] INFO: 127.0.0.1:40482 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:50] INFO: 127.0.0.1:41328 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:50] INFO: 127.0.0.1:40108 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:50] INFO: 127.0.0.1:40540 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:50] INFO: 127.0.0.1:41066 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:50] INFO: 127.0.0.1:41182 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:50] INFO: 127.0.0.1:41704 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:50] INFO: 127.0.0.1:40556 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:50 TP0] Decode batch. #running-req: 185, #token: 71744, token usage: 0.01, cuda graph: True, gen throughput (token/s): 17078.90, #queue-req: 0, | |
| [2025-09-06 08:40:50] INFO: 127.0.0.1:40956 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:50] INFO: 127.0.0.1:41430 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:50] INFO: 127.0.0.1:41060 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:50] INFO: 127.0.0.1:40940 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:50] INFO: 127.0.0.1:41668 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:51] INFO: 127.0.0.1:40338 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:51] INFO: 127.0.0.1:40398 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:51] INFO: 127.0.0.1:41196 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:51] INFO: 127.0.0.1:40726 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:51] INFO: 127.0.0.1:40640 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:51] INFO: 127.0.0.1:40672 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:51] INFO: 127.0.0.1:40566 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:51 TP0] Decode batch. #running-req: 173, #token: 73216, token usage: 0.01, cuda graph: True, gen throughput (token/s): 16408.26, #queue-req: 0, | |
| [2025-09-06 08:40:51] INFO: 127.0.0.1:40208 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:51] INFO: 127.0.0.1:41490 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:51] INFO: 127.0.0.1:41310 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:51] INFO: 127.0.0.1:41750 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:51 TP0] Decode batch. #running-req: 169, #token: 77568, token usage: 0.01, cuda graph: True, gen throughput (token/s): 15698.13, #queue-req: 0, | |
| [2025-09-06 08:40:51] INFO: 127.0.0.1:40138 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:51] INFO: 127.0.0.1:40592 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:51] INFO: 127.0.0.1:40986 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:51] INFO: 127.0.0.1:40498 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:52 TP0] Decode batch. #running-req: 165, #token: 82432, token usage: 0.01, cuda graph: True, gen throughput (token/s): 15224.52, #queue-req: 0, | |
| [2025-09-06 08:40:52] INFO: 127.0.0.1:41262 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:52] INFO: 127.0.0.1:41790 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:52] INFO: 127.0.0.1:40896 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:52] INFO: 127.0.0.1:41772 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:52] INFO: 127.0.0.1:40348 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:52] INFO: 127.0.0.1:41142 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:52] INFO: 127.0.0.1:40448 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:52] INFO: 127.0.0.1:41292 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:52 TP0] Decode batch. #running-req: 157, #token: 84416, token usage: 0.01, cuda graph: True, gen throughput (token/s): 14434.46, #queue-req: 0, | |
| [2025-09-06 08:40:52] INFO: 127.0.0.1:41166 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:52] INFO: 127.0.0.1:40706 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:52] INFO: 127.0.0.1:41036 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:52] INFO: 127.0.0.1:41092 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:52] INFO: 127.0.0.1:40560 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:52] INFO: 127.0.0.1:40888 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:52] INFO: 127.0.0.1:41556 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:52] INFO: 127.0.0.1:40296 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:52] INFO: 127.0.0.1:40222 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:52] INFO: 127.0.0.1:40794 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:52] INFO: 127.0.0.1:41482 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:52] INFO: 127.0.0.1:41374 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:52] INFO: 127.0.0.1:41074 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:52] INFO: 127.0.0.1:41438 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:52 TP0] Decode batch. #running-req: 145, #token: 82560, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13510.77, #queue-req: 0, | |
| [2025-09-06 08:40:52] INFO: 127.0.0.1:40164 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:52] INFO: 127.0.0.1:40230 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:52] INFO: 127.0.0.1:40730 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:53] INFO: 127.0.0.1:41682 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:53] INFO: 127.0.0.1:40386 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:53] INFO: 127.0.0.1:40666 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:53] INFO: 127.0.0.1:41362 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:53] INFO: 127.0.0.1:40080 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:53 TP0] Decode batch. #running-req: 135, #token: 83392, token usage: 0.01, cuda graph: True, gen throughput (token/s): 12758.89, #queue-req: 0, | |
| [2025-09-06 08:40:53] INFO: 127.0.0.1:40804 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:53] INFO: 127.0.0.1:40756 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:53] INFO: 127.0.0.1:40272 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:53] INFO: 127.0.0.1:41734 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:53] INFO: 127.0.0.1:40360 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:53] INFO: 127.0.0.1:41616 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:53] INFO: 127.0.0.1:41574 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:53] INFO: 127.0.0.1:40824 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:53] INFO: 127.0.0.1:41346 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:53] INFO: 127.0.0.1:41644 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:53] INFO: 127.0.0.1:41434 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:53] INFO: 127.0.0.1:41402 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:53] INFO: 127.0.0.1:40328 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:53 TP0] Decode batch. #running-req: 122, #token: 80704, token usage: 0.01, cuda graph: True, gen throughput (token/s): 14211.91, #queue-req: 0, | |
| [2025-09-06 08:40:53] INFO: 127.0.0.1:41604 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:53] INFO: 127.0.0.1:41830 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:53] INFO: 127.0.0.1:40104 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:53] INFO: 127.0.0.1:41124 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:53] INFO: 127.0.0.1:41330 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:53] INFO: 127.0.0.1:40268 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:53] INFO: 127.0.0.1:40408 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:53] INFO: 127.0.0.1:40438 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:53] INFO: 127.0.0.1:41352 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:53] INFO: 127.0.0.1:40148 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:54 TP0] Decode batch. #running-req: 112, #token: 78400, token usage: 0.01, cuda graph: True, gen throughput (token/s): 15988.90, #queue-req: 0, | |
| [2025-09-06 08:40:54] INFO: 127.0.0.1:41502 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:54] INFO: 127.0.0.1:40480 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:54] INFO: 127.0.0.1:40248 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:54] INFO: 127.0.0.1:41136 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:54 TP0] Decode batch. #running-req: 108, #token: 79616, token usage: 0.01, cuda graph: True, gen throughput (token/s): 15405.37, #queue-req: 0, | |
| [2025-09-06 08:40:54] INFO: 127.0.0.1:40916 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:54] INFO: 127.0.0.1:40624 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:54] INFO: 127.0.0.1:40292 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:54] INFO: 127.0.0.1:40648 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:54] INFO: 127.0.0.1:40280 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:54] INFO: 127.0.0.1:41560 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:54] INFO: 127.0.0.1:40608 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:54] INFO: 127.0.0.1:41510 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:54] INFO: 127.0.0.1:40830 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:54] INFO: 127.0.0.1:40524 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:54] INFO: 127.0.0.1:40720 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:54] INFO: 127.0.0.1:41508 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:54] INFO: 127.0.0.1:41652 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:54] INFO: 127.0.0.1:40580 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:54 TP0] Decode batch. #running-req: 94, #token: 73088, token usage: 0.01, cuda graph: True, gen throughput (token/s): 14289.78, #queue-req: 0, | |
| [2025-09-06 08:40:54] INFO: 127.0.0.1:40442 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:54] INFO: 127.0.0.1:41518 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:54] INFO: 127.0.0.1:40520 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:54] INFO: 127.0.0.1:40928 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:54] INFO: 127.0.0.1:41278 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:54] INFO: 127.0.0.1:40270 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:54] INFO: 127.0.0.1:40374 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:54] INFO: 127.0.0.1:40778 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:54] INFO: 127.0.0.1:41824 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:54 TP0] Decode batch. #running-req: 85, #token: 69760, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13354.62, #queue-req: 0, | |
| [2025-09-06 08:40:54] INFO: 127.0.0.1:41598 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:54] INFO: 127.0.0.1:41760 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:55] INFO: 127.0.0.1:41466 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:55] INFO: 127.0.0.1:41584 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:55 TP0] Decode batch. #running-req: 81, #token: 68416, token usage: 0.01, cuda graph: True, gen throughput (token/s): 12492.43, #queue-req: 0, | |
| [2025-09-06 08:40:55] INFO: 127.0.0.1:41446 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:55] INFO: 127.0.0.1:41544 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:55] INFO: 127.0.0.1:41158 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:55] INFO: 127.0.0.1:40980 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:55] INFO: 127.0.0.1:40678 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:55] INFO: 127.0.0.1:40772 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:55] INFO: 127.0.0.1:40950 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:55] INFO: 127.0.0.1:41210 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:55] INFO: 127.0.0.1:41308 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:55 TP0] Decode batch. #running-req: 72, #token: 63872, token usage: 0.01, cuda graph: True, gen throughput (token/s): 12045.72, #queue-req: 0, | |
| [2025-09-06 08:40:55] INFO: 127.0.0.1:40844 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:55] INFO: 127.0.0.1:41020 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:55] INFO: 127.0.0.1:40656 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:55] INFO: 127.0.0.1:41324 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:55] INFO: 127.0.0.1:41630 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:55] INFO: 127.0.0.1:41700 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:55] INFO: 127.0.0.1:40124 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:55] INFO: 127.0.0.1:40904 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:55 TP0] Decode batch. #running-req: 64, #token: 60288, token usage: 0.01, cuda graph: True, gen throughput (token/s): 10692.53, #queue-req: 0, | |
| [2025-09-06 08:40:55] INFO: 127.0.0.1:40746 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:55] INFO: 127.0.0.1:41462 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:55] INFO: 127.0.0.1:41572 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:55] INFO: 127.0.0.1:40858 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:55] INFO: 127.0.0.1:41636 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:55] INFO: 127.0.0.1:40378 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:55 TP0] Decode batch. #running-req: 58, #token: 56704, token usage: 0.01, cuda graph: True, gen throughput (token/s): 10220.62, #queue-req: 0, | |
| [2025-09-06 08:40:55] INFO: 127.0.0.1:41732 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:55] INFO: 127.0.0.1:40816 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:55] INFO: 127.0.0.1:41690 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:55] INFO: 127.0.0.1:40096 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:56] INFO: 127.0.0.1:41414 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:56 TP0] Decode batch. #running-req: 53, #token: 54080, token usage: 0.01, cuda graph: True, gen throughput (token/s): 9137.63, #queue-req: 0, | |
| [2025-09-06 08:40:56] INFO: 127.0.0.1:40880 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:56] INFO: 127.0.0.1:41740 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:56] INFO: 127.0.0.1:40300 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:56] INFO: 127.0.0.1:41696 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:56 TP0] Decode batch. #running-req: 49, #token: 51968, token usage: 0.01, cuda graph: True, gen throughput (token/s): 8730.68, #queue-req: 0, | |
| [2025-09-06 08:40:56] INFO: 127.0.0.1:40820 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:56] INFO: 127.0.0.1:41778 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:56] INFO: 127.0.0.1:40462 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:56] INFO: 127.0.0.1:41004 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:56] INFO: 127.0.0.1:41098 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:56] INFO: 127.0.0.1:40290 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:56 TP0] Decode batch. #running-req: 43, #token: 47680, token usage: 0.01, cuda graph: True, gen throughput (token/s): 8114.26, #queue-req: 0, | |
| [2025-09-06 08:40:56] INFO: 127.0.0.1:41264 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:56] INFO: 127.0.0.1:41002 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:56] INFO: 127.0.0.1:41122 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:56] INFO: 127.0.0.1:40316 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:56] INFO: 127.0.0.1:41422 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:56] INFO: 127.0.0.1:41386 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:56 TP0] Decode batch. #running-req: 37, #token: 41472, token usage: 0.00, cuda graph: True, gen throughput (token/s): 7218.83, #queue-req: 0, | |
| [2025-09-06 08:40:56] INFO: 127.0.0.1:41202 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:56] INFO: 127.0.0.1:41108 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:56] INFO: 127.0.0.1:40532 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:56] INFO: 127.0.0.1:40988 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:56] INFO: 127.0.0.1:41350 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:56] INFO: 127.0.0.1:41260 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:56] INFO: 127.0.0.1:40204 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:56] INFO: 127.0.0.1:41088 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:57] INFO: 127.0.0.1:41132 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:57 TP0] Decode batch. #running-req: 28, #token: 33536, token usage: 0.00, cuda graph: True, gen throughput (token/s): 5858.74, #queue-req: 0, | |
| [2025-09-06 08:40:57] INFO: 127.0.0.1:40872 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:57] INFO: 127.0.0.1:40242 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:57] INFO: 127.0.0.1:41528 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:57 TP0] Decode batch. #running-req: 25, #token: 31168, token usage: 0.00, cuda graph: True, gen throughput (token/s): 4823.55, #queue-req: 0, | |
| [2025-09-06 08:40:57] INFO: 127.0.0.1:41244 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:57] INFO: 127.0.0.1:40194 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:57] INFO: 127.0.0.1:41670 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:57] INFO: 127.0.0.1:40506 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:57] INFO: 127.0.0.1:40476 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:57 TP0] Decode batch. #running-req: 20, #token: 25152, token usage: 0.00, cuda graph: True, gen throughput (token/s): 4278.43, #queue-req: 0, | |
| [2025-09-06 08:40:57 TP0] Decode batch. #running-req: 20, #token: 25920, token usage: 0.00, cuda graph: True, gen throughput (token/s): 3898.73, #queue-req: 0, | |
| [2025-09-06 08:40:57] INFO: 127.0.0.1:41712 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:57] INFO: 127.0.0.1:40596 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:57 TP0] Decode batch. #running-req: 18, #token: 24192, token usage: 0.00, cuda graph: True, gen throughput (token/s): 3718.50, #queue-req: 0, | |
| [2025-09-06 08:40:58 TP0] Decode batch. #running-req: 18, #token: 24640, token usage: 0.00, cuda graph: True, gen throughput (token/s): 3511.95, #queue-req: 0, | |
| [2025-09-06 08:40:58] INFO: 127.0.0.1:40424 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:58] INFO: 127.0.0.1:41720 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:58] INFO: 127.0.0.1:41236 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:58 TP0] Decode batch. #running-req: 15, #token: 21184, token usage: 0.00, cuda graph: True, gen throughput (token/s): 3166.87, #queue-req: 0, | |
| [2025-09-06 08:40:58] INFO: 127.0.0.1:40788 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:58] INFO: 127.0.0.1:40350 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:58] INFO: 127.0.0.1:41804 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:58] INFO: 127.0.0.1:41222 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:58] INFO: 127.0.0.1:41460 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| 1%| | 1/198 [00:09<31:52, 9.71s/it][2025-09-06 08:40:58] INFO: 127.0.0.1:41050 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:58 TP0] Decode batch. #running-req: 9, #token: 13056, token usage: 0.00, cuda graph: True, gen throughput (token/s): 2633.08, #queue-req: 0, | |
| [2025-09-06 08:40:58] INFO: 127.0.0.1:40512 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:58 TP0] Decode batch. #running-req: 8, #token: 11776, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1903.68, #queue-req: 0, | |
| [2025-09-06 08:40:58] INFO: 127.0.0.1:41822 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:58] INFO: 127.0.0.1:40694 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:58 TP0] Decode batch. #running-req: 6, #token: 9152, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1819.86, #queue-req: 0, | |
| [2025-09-06 08:40:58] INFO: 127.0.0.1:40156 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:58] INFO: 127.0.0.1:40370 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:58 TP0] Decode batch. #running-req: 4, #token: 6144, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1302.25, #queue-req: 0, | |
| [2025-09-06 08:40:59 TP0] Decode batch. #running-req: 4, #token: 6336, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1056.97, #queue-req: 0, | |
| [2025-09-06 08:40:59] INFO: 127.0.0.1:41820 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| 23%|██▎ | 45/198 [00:10<00:26, 5.87it/s][2025-09-06 08:40:59 TP0] Decode batch. #running-req: 3, #token: 4928, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1005.99, #queue-req: 0, | |
| [2025-09-06 08:40:59 TP0] Decode batch. #running-req: 3, #token: 4992, token usage: 0.00, cuda graph: True, gen throughput (token/s): 799.03, #queue-req: 0, | |
| [2025-09-06 08:40:59] INFO: 127.0.0.1:40680 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:59 TP0] Decode batch. #running-req: 2, #token: 3456, token usage: 0.00, cuda graph: True, gen throughput (token/s): 716.67, #queue-req: 0, | |
| [2025-09-06 08:40:59] INFO: 127.0.0.1:40642 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:40:59 TP0] Decode batch. #running-req: 1, #token: 1792, token usage: 0.00, cuda graph: True, gen throughput (token/s): 546.76, #queue-req: 0, | |
| [2025-09-06 08:40:59 TP0] Decode batch. #running-req: 1, #token: 1792, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.99, #queue-req: 0, | |
| [2025-09-06 08:40:59 TP0] Decode batch. #running-req: 1, #token: 1856, token usage: 0.00, cuda graph: True, gen throughput (token/s): 331.77, #queue-req: 0, | |
| [2025-09-06 08:41:00 TP0] Decode batch. #running-req: 1, #token: 1856, token usage: 0.00, cuda graph: True, gen throughput (token/s): 328.65, #queue-req: 0, | |
| [2025-09-06 08:41:00 TP0] Decode batch. #running-req: 1, #token: 1920, token usage: 0.00, cuda graph: True, gen throughput (token/s): 331.68, #queue-req: 0, | |
| [2025-09-06 08:41:00 TP0] Decode batch. #running-req: 1, #token: 1984, token usage: 0.00, cuda graph: True, gen throughput (token/s): 331.19, #queue-req: 0, | |
| [2025-09-06 08:41:00 TP0] Decode batch. #running-req: 1, #token: 1984, token usage: 0.00, cuda graph: True, gen throughput (token/s): 331.51, #queue-req: 0, | |
| [2025-09-06 08:41:00 TP0] Decode batch. #running-req: 1, #token: 2048, token usage: 0.00, cuda graph: True, gen throughput (token/s): 331.70, #queue-req: 0, | |
| [2025-09-06 08:41:00 TP0] Decode batch. #running-req: 1, #token: 2112, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.81, #queue-req: 0, | |
| [2025-09-06 08:41:00 TP0] Decode batch. #running-req: 1, #token: 2112, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.76, #queue-req: 0, | |
| [2025-09-06 08:41:00 TP0] Decode batch. #running-req: 1, #token: 2176, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.45, #queue-req: 0, | |
| [2025-09-06 08:41:01 TP0] Decode batch. #running-req: 1, #token: 2176, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.66, #queue-req: 0, | |
| [2025-09-06 08:41:01 TP0] Decode batch. #running-req: 1, #token: 2240, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.81, #queue-req: 0, | |
| [2025-09-06 08:41:01 TP0] Decode batch. #running-req: 1, #token: 2304, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.68, #queue-req: 0, | |
| [2025-09-06 08:41:01 TP0] Decode batch. #running-req: 1, #token: 2304, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.63, #queue-req: 0, | |
| [2025-09-06 08:41:01 TP0] Decode batch. #running-req: 1, #token: 2368, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.50, #queue-req: 0, | |
| [2025-09-06 08:41:01 TP0] Decode batch. #running-req: 1, #token: 2432, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.94, #queue-req: 0, | |
| [2025-09-06 08:41:01 TP0] Decode batch. #running-req: 1, #token: 2432, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.84, #queue-req: 0, | |
| [2025-09-06 08:41:01 TP0] Decode batch. #running-req: 1, #token: 2496, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.78, #queue-req: 0, | |
| [2025-09-06 08:41:02 TP0] Decode batch. #running-req: 1, #token: 2496, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.12, #queue-req: 0, | |
| [2025-09-06 08:41:02 TP0] Decode batch. #running-req: 1, #token: 2560, token usage: 0.00, cuda graph: True, gen throughput (token/s): 328.85, #queue-req: 0, | |
| [2025-09-06 08:41:02] INFO: 127.0.0.1:40180 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| 28%|██▊ | 56/198 [00:13<00:27, 5.09it/s] 100%|██████████| 198/198 [00:13<00:00, 14.66it/s] | |
| /usr/lib/python3.12/subprocess.py:1127: ResourceWarning: subprocess 44552 is still running | |
| _warn("subprocess %s is still running" % self.pid, | |
| ResourceWarning: Enable tracemalloc to get the object allocation traceback | |
| E | |
| ====================================================================== | |
| ERROR: test_mxfp4_120b (__main__.TestGptOss4Gpu.test_mxfp4_120b) | |
| ---------------------------------------------------------------------- | |
| Traceback (most recent call last): | |
| File "/home/yiliu7/sglang/python/sglang/srt/utils.py", line 2187, in retry | |
| return fn() | |
| ^^^^ | |
| File "/home/yiliu7/sglang/python/sglang/test/test_utils.py", line 1396, in <lambda> | |
| lambda: super(CustomTestCase, self)._callTestMethod(method), | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
| AssertionError: 0.5707070707070707 not greater than or equal to 0.6 | |
| During handling of the above exception, another exception occurred: | |
| Traceback (most recent call last): | |
| File "/home/yiliu7/sglang/python/sglang/test/test_utils.py", line 1395, in _callTestMethod | |
| retry( | |
| File "/home/yiliu7/sglang/python/sglang/srt/utils.py", line 2190, in retry | |
| raise Exception(f"retry() exceed maximum number of retries.") | |
| Exception: retry() exceed maximum number of retries. | |
| ---------------------------------------------------------------------- | |
| Ran 1 test in 178.092s | |
| FAILED (errors=1) | |
| Writing report to /tmp/gpqa__home_yiliu7_models_openai_gpt-oss-120b.html | |
| {'chars': 1742.560606060606, 'chars:std': 1075.4861263368614, 'score:std': 0.4949752621616814, 'score': 0.5707070707070707} | |
| Writing results to /tmp/gpqa__home_yiliu7_models_openai_gpt-oss-120b.json | |
| Total latency: 13.566 s | |
| Score: 0.571 | |
| Evaluation end: model=/home/yiliu7/models/openai/gpt-oss-120b reasoning_effort=low expected_score=0.6 metrics={'chars': 1742.560606060606, 'chars:std': 1075.4861263368614, 'score:std': 0.4949752621616814, 'score': 0.5707070707070707} | |
| ================================================================================ | |
| Run 6: | |
| Auto-configed device: cuda | |
| WARNING:sglang.srt.server_args:Detected SM100 and MXFP4 quantization format for GPT-OSS model, enabling FlashInfer MXFP4 MOE kernel. | |
| WARNING:sglang.srt.server_args:TensorRT-LLM MHA only supports page_size of 16, 32 or 64, changing page_size from None to 64. | |
| [2025-09-06 08:41:17] server_args=ServerArgs(model_path='/home/yiliu7/models/openai/gpt-oss-120b', tokenizer_path='/home/yiliu7/models/openai/gpt-oss-120b', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=8400, skip_server_warmup=False, warmups=None, nccl_port=None, dtype='bfloat16', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', mem_fraction_static=0.93, max_running_requests=None, max_queued_requests=9223372036854775807, max_total_tokens=None, chunked_prefill_size=16384, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, page_size=64, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=True, device='cuda', tp_size=4, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=88006138, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, gc_warning_threshold_secs=0.0, api_key=None, served_model_name='/home/yiliu7/models/openai/gpt-oss-120b', weight_version='default', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_backend='triton', attention_backend='trtllm_mha', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, ep_size=1, moe_a2a_backend='none', moe_runner_backend='flashinfer_mxfp4', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=True, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', disable_radix_cache=False, cuda_graph_max_bs=200, cuda_graph_bs=None, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, torch_compile_max_bs=32, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, enable_return_hidden_states=False, scheduler_recv_interval=1, numa_node=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, debug_tensor_dump_prefill_only=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, num_reserved_decode_tokens=512, custom_weight_loader=[], weight_loader_disable_mmap=False, enable_pdmux=False, sm_group_num=3, enable_ep_moe=False, enable_deepep_moe=False, enable_flashinfer_cutlass_moe=False, enable_flashinfer_trtllm_moe=False, enable_triton_kernel_moe=False, enable_flashinfer_mxfp4_moe=False) | |
| All deep_gemm operations loaded successfully! | |
| `torch_dtype` is deprecated! Use `dtype` instead! | |
| [2025-09-06 08:41:17] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:41:17] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [2025-09-06 08:41:18] Using default HuggingFace chat template with detected content format: string | |
| `torch_dtype` is deprecated! Use `dtype` instead! | |
| [2025-09-06 08:41:24 TP0] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:41:24 TP0] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| `torch_dtype` is deprecated! Use `dtype` instead! | |
| [2025-09-06 08:41:24 TP2] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:41:24 TP2] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| `torch_dtype` is deprecated! Use `dtype` instead! | |
| `torch_dtype` is deprecated! Use `dtype` instead! | |
| [2025-09-06 08:41:24 TP3] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:41:24 TP1] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:41:24 TP3] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [2025-09-06 08:41:24 TP1] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [2025-09-06 08:41:25 TP0] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:41:25 TP0] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [2025-09-06 08:41:25 TP0] Init torch distributed begin. | |
| [2025-09-06 08:41:25 TP2] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:41:25 TP2] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [2025-09-06 08:41:25 TP1] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:41:25 TP1] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [2025-09-06 08:41:25 TP3] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:41:25 TP3] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [2025-09-06 08:41:26 TP0] sglang is using nccl==2.27.3 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [2025-09-06 08:41:29 TP0] Init torch distributed ends. mem usage=1.46 GB | |
| [2025-09-06 08:41:29 TP0] Load weight begin. avail mem=176.28 GB | |
| All deep_gemm operations loaded successfully! | |
| Loading safetensors checkpoint shards: 0% Completed | 0/15 [00:00<?, ?it/s] | |
| Loading safetensors checkpoint shards: 100% Completed | 15/15 [00:00<00:00, 1477.42it/s] | |
| All deep_gemm operations loaded successfully! | |
| [2025-09-06 08:41:39 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.0.mlp.experts), it might take a while... | |
| All deep_gemm operations loaded successfully! | |
| All deep_gemm operations loaded successfully! | |
| [2025-09-06 08:41:42 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.1.mlp.experts), it might take a while... | |
| [2025-09-06 08:41:45 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.2.mlp.experts), it might take a while... | |
| [2025-09-06 08:41:48 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.3.mlp.experts), it might take a while... | |
| [2025-09-06 08:41:51 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.4.mlp.experts), it might take a while... | |
| [2025-09-06 08:41:55 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.5.mlp.experts), it might take a while... | |
| [2025-09-06 08:41:58 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.6.mlp.experts), it might take a while... | |
| [2025-09-06 08:42:01 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.7.mlp.experts), it might take a while... | |
| [2025-09-06 08:42:04 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.8.mlp.experts), it might take a while... | |
| [2025-09-06 08:42:07 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.9.mlp.experts), it might take a while... | |
| [2025-09-06 08:42:10 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.10.mlp.experts), it might take a while... | |
| [2025-09-06 08:42:13 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.11.mlp.experts), it might take a while... | |
| [2025-09-06 08:42:16 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.12.mlp.experts), it might take a while... | |
| [2025-09-06 08:42:19 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.13.mlp.experts), it might take a while... | |
| [2025-09-06 08:42:22 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.14.mlp.experts), it might take a while... | |
| [2025-09-06 08:42:25 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.15.mlp.experts), it might take a while... | |
| [2025-09-06 08:42:28 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.16.mlp.experts), it might take a while... | |
| [2025-09-06 08:42:31 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.17.mlp.experts), it might take a while... | |
| [2025-09-06 08:42:34 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.18.mlp.experts), it might take a while... | |
| [2025-09-06 08:42:38 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.19.mlp.experts), it might take a while... | |
| [2025-09-06 08:42:41 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.20.mlp.experts), it might take a while... | |
| [2025-09-06 08:42:44 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.21.mlp.experts), it might take a while... | |
| [2025-09-06 08:42:47 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.22.mlp.experts), it might take a while... | |
| [2025-09-06 08:42:50 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.23.mlp.experts), it might take a while... | |
| [2025-09-06 08:42:53 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.24.mlp.experts), it might take a while... | |
| [2025-09-06 08:42:56 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.25.mlp.experts), it might take a while... | |
| [2025-09-06 08:42:59 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.26.mlp.experts), it might take a while... | |
| [2025-09-06 08:43:02 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.27.mlp.experts), it might take a while... | |
| [2025-09-06 08:43:05 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.28.mlp.experts), it might take a while... | |
| [2025-09-06 08:43:08 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.29.mlp.experts), it might take a while... | |
| [2025-09-06 08:43:11 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.30.mlp.experts), it might take a while... | |
| [2025-09-06 08:43:14 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.31.mlp.experts), it might take a while... | |
| [2025-09-06 08:43:18 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.32.mlp.experts), it might take a while... | |
| [2025-09-06 08:43:21 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.33.mlp.experts), it might take a while... | |
| [2025-09-06 08:43:24 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.34.mlp.experts), it might take a while... | |
| [2025-09-06 08:43:27 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.35.mlp.experts), it might take a while... | |
| [2025-09-06 08:43:30 TP0] Load weight end. type=GptOssForCausalLM, dtype=torch.bfloat16, avail mem=158.06 GB, mem usage=18.22 GB. | |
| [2025-09-06 08:43:36 TP2] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB | |
| [2025-09-06 08:43:36 TP0] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB | |
| [2025-09-06 08:43:36 TP0] Memory pool end. avail mem=10.23 GB | |
| [2025-09-06 08:43:36 TP3] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB | |
| [2025-09-06 08:43:36 TP1] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB | |
| [2025-09-06 08:43:36 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=9.54 GB | |
| [2025-09-06 08:43:36 TP0] Capture cuda graph bs [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200] | |
| 0%| | 0/28 [00:00<?, ?it/s] Capturing batches (bs=200 avail_mem=9.39 GB): 0%| | 0/28 [00:00<?, ?it/s]rank 1 allocated ipc_handles: [['0x7ecb5c000000', '0x7ef116000000', '0x7ecb18000000', '0x7ecb14000000'], ['0x7ecb17000000', '0x7ecb16e00000', '0x7ecb17200000', '0x7ecb17400000'], ['0x7ecb00000000', '0x7ecb0a000000', '0x7ecaf6000000', '0x7ecaec000000']] | |
| [2025-09-06 08:43:38.924] [info] lamportInitialize start: buffer: 0x7ecb0a000000, size: 71303168 | |
| rank 3 allocated ipc_handles: [['0x7cbe7c000000', '0x7cbe42000000', '0x7cbe3e000000', '0x7ce43c000000'], ['0x7cbe41000000', '0x7cbe41200000', '0x7cbe41400000', '0x7cbe40e00000'], ['0x7cbe2a000000', '0x7cbe20000000', '0x7cbe16000000', '0x7cbe34000000']] | |
| [2025-09-06 08:43:38.972] [info] lamportInitialize start: buffer: 0x7cbe34000000, size: 71303168 | |
| rank 0 allocated ipc_handles: [['0x7bf8d2000000', '0x7bd316000000', '0x7bd2e4000000', '0x7bd2e0000000'], ['0x7bd2e2e00000', '0x7bd2e3000000', '0x7bd2e3200000', '0x7bd2e3400000'], ['0x7bd2d6000000', '0x7bd2cc000000', '0x7bd2c2000000', '0x7bd2b8000000']] | |
| [2025-09-06 08:43:39.021] [info] lamportInitialize start: buffer: 0x7bd2d6000000, size: 71303168 | |
| rank 2 allocated ipc_handles: [['0x72c838000000', '0x72c834000000', '0x72ee34000000', '0x72c830000000'], ['0x72c833000000', '0x72c833200000', '0x72c832e00000', '0x72c833400000'], ['0x72c81c000000', '0x72c812000000', '0x72c826000000', '0x72c808000000']] | |
| [2025-09-06 08:43:39.071] [info] lamportInitialize start: buffer: 0x72c826000000, size: 71303168 | |
| [2025-09-06 08:43:39 TP0] FlashInfer workspace initialized for rank 0, world_size 4 | |
| [2025-09-06 08:43:39 TP1] FlashInfer workspace initialized for rank 1, world_size 4 | |
| [2025-09-06 08:43:39 TP2] FlashInfer workspace initialized for rank 2, world_size 4 | |
| [2025-09-06 08:43:39 TP3] FlashInfer workspace initialized for rank 3, world_size 4 | |
| set flag_ptr[3] = lamport_comm_size: 47185920 | |
| Rank 0 workspace[0] 0x7bf8d2000000 | |
| Rank 0 workspace[1] 0x7bd316000000 | |
| Rank 0 workspace[2] 0x7bd2e4000000 | |
| Rank 0 workspace[3] 0x7bd2e0000000 | |
| Rank 0 workspace[4] 0x7bd2e2e00000 | |
| Rank 0 workspace[5] 0x7bd2e3000000 | |
| Rank 0 workspace[6] 0x7bd2e3200000 | |
| Rank 0 workspace[7] 0x7bd2e3400000 | |
| Rank 0 workspace[8] 0x7bd2d6000000 | |
| Rank 0 workspace[9] 0x7bd2cc000000 | |
| Rank 0 workspace[10] 0x7bd2c2000000 | |
| Rank 0 workspace[11] 0x7bd2b8000000 | |
| Rank 0 workspace[12] 0x7bfec7264400 | |
| set flag_ptr[3] = lamport_comm_size: 47185920 | |
| Rank 3 workspace[0] 0x7cbe7c000000 | |
| Rank 3 workspace[1] 0x7cbe42000000 | |
| Rank 3 workspace[2] 0x7cbe3e000000 | |
| Rank 3 workspace[3] 0x7ce43c000000 | |
| Rank 3 workspace[4] 0x7cbe41000000 | |
| Rank 3 workspace[5] 0x7cbe41200000 | |
| Rank 3 workspace[6] 0x7cbe41400000 | |
| Rank 3 workspace[7] 0x7cbe40e00000 | |
| Rank 3 workspace[8] 0x7cbe2a000000 | |
| Rank 3 workspace[9] 0x7cbe20000000 | |
| Rank 3 workspace[10] 0x7cbe16000000 | |
| Rank 3 workspace[11] 0x7cbe34000000 | |
| Rank 3 workspace[12] 0x7cea3b264400 | |
| set flag_ptr[3] = lamport_comm_size: 47185920 | |
| Rank 2 workspace[0] 0x72c838000000 | |
| Rank 2 workspace[1] 0x72c834000000 | |
| Rank 2 workspace[2] 0x72ee34000000 | |
| Rank 2 workspace[3] 0x72c830000000 | |
| Rank 2 workspace[4] 0x72c833000000 | |
| Rank 2 workspace[5] 0x72c833200000 | |
| Rank 2 workspace[6] 0x72c832e00000 | |
| Rank 2 workspace[7] 0x72c833400000 | |
| Rank 2 workspace[8] 0x72c81c000000 | |
| Rank 2 workspace[9] 0x72c812000000 | |
| Rank 2 workspace[10] 0x72c826000000 | |
| Rank 2 workspace[11] 0x72c808000000 | |
| Rank 2 workspace[12] 0x72f441264400 | |
| set flag_ptr[3] = lamport_comm_size: 47185920 | |
| Rank 1 workspace[0] 0x7ecb5c000000 | |
| Rank 1 workspace[1] 0x7ef116000000 | |
| Rank 1 workspace[2] 0x7ecb18000000 | |
| Rank 1 workspace[3] 0x7ecb14000000 | |
| Rank 1 workspace[4] 0x7ecb17000000 | |
| Rank 1 workspace[5] 0x7ecb16e00000 | |
| Rank 1 workspace[6] 0x7ecb17200000 | |
| Rank 1 workspace[7] 0x7ecb17400000 | |
| Rank 1 workspace[8] 0x7ecb00000000 | |
| Rank 1 workspace[9] 0x7ecb0a000000 | |
| Rank 1 workspace[10] 0x7ecaf6000000 | |
| Rank 1 workspace[11] 0x7ecaec000000 | |
| Rank 1 workspace[12] 0x7ef71f264400 | |
| Capturing batches (bs=200 avail_mem=9.39 GB): 4%|▎ | 1/28 [00:02<00:57, 2.12s/it] Capturing batches (bs=192 avail_mem=8.06 GB): 4%|▎ | 1/28 [00:02<00:57, 2.12s/it] Capturing batches (bs=192 avail_mem=8.06 GB): 7%|▋ | 2/28 [00:02<00:25, 1.04it/s] Capturing batches (bs=184 avail_mem=8.05 GB): 7%|▋ | 2/28 [00:02<00:25, 1.04it/s] Capturing batches (bs=176 avail_mem=8.04 GB): 7%|▋ | 2/28 [00:02<00:25, 1.04it/s] Capturing batches (bs=176 avail_mem=8.04 GB): 14%|█▍ | 4/28 [00:02<00:09, 2.43it/s] Capturing batches (bs=168 avail_mem=8.03 GB): 14%|█▍ | 4/28 [00:02<00:09, 2.43it/s] Capturing batches (bs=160 avail_mem=8.02 GB): 14%|█▍ | 4/28 [00:02<00:09, 2.43it/s] Capturing batches (bs=160 avail_mem=8.02 GB): 21%|██▏ | 6/28 [00:02<00:05, 3.89it/s] Capturing batches (bs=152 avail_mem=8.01 GB): 21%|██▏ | 6/28 [00:02<00:05, 3.89it/s] Capturing batches (bs=144 avail_mem=8.00 GB): 21%|██▏ | 6/28 [00:02<00:05, 3.89it/s] Capturing batches (bs=144 avail_mem=8.00 GB): 29%|██▊ | 8/28 [00:02<00:03, 5.32it/s] Capturing batches (bs=136 avail_mem=8.00 GB): 29%|██▊ | 8/28 [00:02<00:03, 5.32it/s] Capturing batches (bs=128 avail_mem=7.99 GB): 29%|██▊ | 8/28 [00:02<00:03, 5.32it/s] Capturing batches (bs=128 avail_mem=7.99 GB): 36%|███▌ | 10/28 [00:02<00:02, 6.54it/s] Capturing batches (bs=120 avail_mem=7.97 GB): 36%|███▌ | 10/28 [00:02<00:02, 6.54it/s] Capturing batches (bs=112 avail_mem=7.97 GB): 36%|███▌ | 10/28 [00:03<00:02, 6.54it/s] Capturing batches (bs=112 avail_mem=7.97 GB): 43%|████▎ | 12/28 [00:03<00:02, 7.61it/s] Capturing batches (bs=104 avail_mem=7.96 GB): 43%|████▎ | 12/28 [00:03<00:02, 7.61it/s] Capturing batches (bs=96 avail_mem=7.95 GB): 43%|████▎ | 12/28 [00:03<00:02, 7.61it/s] Capturing batches (bs=96 avail_mem=7.95 GB): 50%|█████ | 14/28 [00:03<00:01, 8.48it/s] Capturing batches (bs=88 avail_mem=7.94 GB): 50%|█████ | 14/28 [00:03<00:01, 8.48it/s] Capturing batches (bs=80 avail_mem=7.93 GB): 50%|█████ | 14/28 [00:03<00:01, 8.48it/s] Capturing batches (bs=80 avail_mem=7.93 GB): 57%|█████▋ | 16/28 [00:03<00:01, 9.20it/s] Capturing batches (bs=72 avail_mem=7.92 GB): 57%|█████▋ | 16/28 [00:03<00:01, 9.20it/s] Capturing batches (bs=64 avail_mem=7.91 GB): 57%|█████▋ | 16/28 [00:03<00:01, 9.20it/s] Capturing batches (bs=64 avail_mem=7.91 GB): 64%|██████▍ | 18/28 [00:03<00:01, 9.76it/s] Capturing batches (bs=56 avail_mem=7.90 GB): 64%|██████▍ | 18/28 [00:03<00:01, 9.76it/s] Capturing batches (bs=48 avail_mem=7.89 GB): 64%|██████▍ | 18/28 [00:03<00:01, 9.76it/s] Capturing batches (bs=48 avail_mem=7.89 GB): 71%|███████▏ | 20/28 [00:03<00:00, 10.27it/s] Capturing batches (bs=40 avail_mem=7.88 GB): 71%|███████▏ | 20/28 [00:03<00:00, 10.27it/s] Capturing batches (bs=32 avail_mem=7.88 GB): 71%|███████▏ | 20/28 [00:03<00:00, 10.27it/s] Capturing batches (bs=32 avail_mem=7.88 GB): 79%|███████▊ | 22/28 [00:04<00:00, 10.64it/s] Capturing batches (bs=24 avail_mem=7.87 GB): 79%|███████▊ | 22/28 [00:04<00:00, 10.64it/s] Capturing batches (bs=16 avail_mem=7.86 GB): 79%|███████▊ | 22/28 [00:04<00:00, 10.64it/s] Capturing batches (bs=16 avail_mem=7.86 GB): 86%|████████▌ | 24/28 [00:04<00:00, 10.94it/s] Capturing batches (bs=8 avail_mem=7.85 GB): 86%|████████▌ | 24/28 [00:04<00:00, 10.94it/s] Capturing batches (bs=4 avail_mem=7.84 GB): 86%|████████▌ | 24/28 [00:04<00:00, 10.94it/s] Capturing batches (bs=4 avail_mem=7.84 GB): 93%|█████████▎| 26/28 [00:04<00:00, 11.33it/s] Capturing batches (bs=2 avail_mem=7.83 GB): 93%|█████████▎| 26/28 [00:04<00:00, 11.33it/s] Capturing batches (bs=1 avail_mem=7.82 GB): 93%|█████████▎| 26/28 [00:04<00:00, 11.33it/s] Capturing batches (bs=1 avail_mem=7.82 GB): 100%|██████████| 28/28 [00:04<00:00, 10.79it/s] Capturing batches (bs=1 avail_mem=7.82 GB): 100%|██████████| 28/28 [00:04<00:00, 6.13it/s] | |
| [2025-09-06 08:43:41 TP3] Registering 56 cuda graph addresses | |
| [2025-09-06 08:43:41 TP0] Registering 56 cuda graph addresses | |
| [2025-09-06 08:43:41 TP2] Registering 56 cuda graph addresses | |
| [2025-09-06 08:43:41 TP1] Registering 56 cuda graph addresses | |
| [2025-09-06 08:43:41 TP0] Capture cuda graph end. Time elapsed: 5.10 s. mem usage=1.73 GB. avail mem=7.81 GB. | |
| [2025-09-06 08:43:42 TP0] max_total_num_tokens=8487040, chunked_prefill_size=16384, max_prefill_tokens=16384, max_running_requests=4096, context_len=131072, available_gpu_mem=7.81 GB | |
| [2025-09-06 08:43:43] INFO: Started server process [47053] | |
| [2025-09-06 08:43:43] INFO: Waiting for application startup. | |
| [2025-09-06 08:43:43] INFO: Application startup complete. | |
| [2025-09-06 08:43:43] INFO: Uvicorn running on http://127.0.0.1:8400 (Press CTRL+C to quit) | |
| [2025-09-06 08:43:44] INFO: 127.0.0.1:58318 - "GET /get_model_info HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:44 TP0] Prefill batch. #new-seq: 1, #new-token: 64, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, | |
| [2025-09-06 08:43:45] INFO: 127.0.0.1:58322 - "POST /generate HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:45] The server is fired up and ready to roll! | |
| [2025-09-06 08:43:51 TP0] Prefill batch. #new-seq: 1, #new-token: 64, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, | |
| [2025-09-06 08:43:52] INFO: 127.0.0.1:52732 - "GET /health_generate HTTP/1.1" 200 OK | |
| command=python3 -m sglang.launch_server --model-path /home/yiliu7/models/openai/gpt-oss-120b --tp 4 --cuda-graph-max-bs 200 --mem-fraction-static 0.93 --device cuda --host 127.0.0.1 --port 8400 | |
| Evaluation start: model=/home/yiliu7/models/openai/gpt-oss-120b reasoning_effort=low expected_score=0.6 | |
| ChatCompletionSampler initialized with self.system_message=None self.temperature=0.1 self.max_tokens=4096 self.reasoning_effort='low' | |
| 0%| | 0/198 [00:00<?, ?it/s][2025-09-06 08:43:53 TP0] Prefill batch. #new-seq: 1, #new-token: 320, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, | |
| [2025-09-06 08:43:53 TP0] Prefill batch. #new-seq: 3, #new-token: 768, #cached-token: 0, token usage: 0.00, #running-req: 1, #queue-req: 0, | |
| [2025-09-06 08:43:53 TP0] Prefill batch. #new-seq: 4, #new-token: 1152, #cached-token: 256, token usage: 0.00, #running-req: 4, #queue-req: 0, | |
| [2025-09-06 08:43:53 TP0] Prefill batch. #new-seq: 18, #new-token: 5568, #cached-token: 1152, token usage: 0.00, #running-req: 8, #queue-req: 0, | |
| [2025-09-06 08:43:53 TP0] Prefill batch. #new-seq: 18, #new-token: 5120, #cached-token: 1152, token usage: 0.00, #running-req: 26, #queue-req: 0, | |
| [2025-09-06 08:43:53 TP0] Prefill batch. #new-seq: 30, #new-token: 7872, #cached-token: 1920, token usage: 0.00, #running-req: 44, #queue-req: 0, | |
| [2025-09-06 08:43:53 TP0] Prefill batch. #new-seq: 24, #new-token: 8320, #cached-token: 1536, token usage: 0.00, #running-req: 74, #queue-req: 0, | |
| [2025-09-06 08:43:53 TP0] Prefill batch. #new-seq: 24, #new-token: 6592, #cached-token: 1536, token usage: 0.00, #running-req: 98, #queue-req: 0, | |
| [2025-09-06 08:43:54 TP0] Prefill batch. #new-seq: 33, #new-token: 8896, #cached-token: 2176, token usage: 0.00, #running-req: 122, #queue-req: 0, | |
| [2025-09-06 08:43:54 TP0] Prefill batch. #new-seq: 43, #new-token: 12224, #cached-token: 2816, token usage: 0.01, #running-req: 155, #queue-req: 0, | |
| [2025-09-06 08:43:54 TP0] Decode batch. #running-req: 198, #token: 62848, token usage: 0.01, cuda graph: True, gen throughput (token/s): 487.94, #queue-req: 0, | |
| [2025-09-06 08:43:54] INFO: 127.0.0.1:52948 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:54] INFO: 127.0.0.1:52900 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:54] INFO: 127.0.0.1:53670 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:55] INFO: 127.0.0.1:53800 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:55] INFO: 127.0.0.1:54298 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:55 TP0] Decode batch. #running-req: 194, #token: 68352, token usage: 0.01, cuda graph: True, gen throughput (token/s): 17172.79, #queue-req: 0, | |
| [2025-09-06 08:43:55] INFO: 127.0.0.1:52808 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:55] INFO: 127.0.0.1:53238 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:55] INFO: 127.0.0.1:52874 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:55] INFO: 127.0.0.1:53274 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:55] INFO: 127.0.0.1:54182 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:55] INFO: 127.0.0.1:53978 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:55] INFO: 127.0.0.1:53768 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:55 TP0] Decode batch. #running-req: 186, #token: 73792, token usage: 0.01, cuda graph: True, gen throughput (token/s): 16945.77, #queue-req: 0, | |
| [2025-09-06 08:43:55] INFO: 127.0.0.1:54494 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:55] INFO: 127.0.0.1:53900 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:55] INFO: 127.0.0.1:53894 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:55] INFO: 127.0.0.1:53452 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:55] INFO: 127.0.0.1:53336 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:55] INFO: 127.0.0.1:53294 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:55] INFO: 127.0.0.1:53440 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:55] INFO: 127.0.0.1:53628 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:55] INFO: 127.0.0.1:53856 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:55] INFO: 127.0.0.1:53178 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:56 TP0] Decode batch. #running-req: 176, #token: 74048, token usage: 0.01, cuda graph: True, gen throughput (token/s): 16353.88, #queue-req: 0, | |
| [2025-09-06 08:43:56] INFO: 127.0.0.1:54048 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:56] INFO: 127.0.0.1:53040 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:56] INFO: 127.0.0.1:54504 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:56] INFO: 127.0.0.1:54042 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:56] INFO: 127.0.0.1:53108 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:56] INFO: 127.0.0.1:52794 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:56] INFO: 127.0.0.1:53198 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:56 TP0] Decode batch. #running-req: 169, #token: 77760, token usage: 0.01, cuda graph: True, gen throughput (token/s): 15971.45, #queue-req: 0, | |
| [2025-09-06 08:43:56] INFO: 127.0.0.1:52958 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:56] INFO: 127.0.0.1:53600 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:56] INFO: 127.0.0.1:53476 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:56] INFO: 127.0.0.1:53986 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:56] INFO: 127.0.0.1:54524 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:56] INFO: 127.0.0.1:53698 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:56] INFO: 127.0.0.1:53606 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:56 TP0] Decode batch. #running-req: 162, #token: 81472, token usage: 0.01, cuda graph: True, gen throughput (token/s): 14919.82, #queue-req: 0, | |
| [2025-09-06 08:43:57] INFO: 127.0.0.1:53120 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:57] INFO: 127.0.0.1:54530 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:57] INFO: 127.0.0.1:54378 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:57] INFO: 127.0.0.1:53050 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:57] INFO: 127.0.0.1:53356 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:57] INFO: 127.0.0.1:53162 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:57] INFO: 127.0.0.1:53864 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:57] INFO: 127.0.0.1:52768 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:57] INFO: 127.0.0.1:52924 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:57 TP0] Decode batch. #running-req: 154, #token: 82560, token usage: 0.01, cuda graph: True, gen throughput (token/s): 14149.30, #queue-req: 0, | |
| [2025-09-06 08:43:57] INFO: 127.0.0.1:53884 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:57] INFO: 127.0.0.1:52908 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:57] INFO: 127.0.0.1:54242 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:57] INFO: 127.0.0.1:54326 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:57] INFO: 127.0.0.1:53520 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:57] INFO: 127.0.0.1:53540 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:57] INFO: 127.0.0.1:53378 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:57] INFO: 127.0.0.1:53278 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:57] INFO: 127.0.0.1:54328 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:57] INFO: 127.0.0.1:52810 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:57] INFO: 127.0.0.1:53756 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:57 TP0] Decode batch. #running-req: 142, #token: 83264, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13411.74, #queue-req: 0, | |
| [2025-09-06 08:43:58] INFO: 127.0.0.1:52864 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:58] INFO: 127.0.0.1:53172 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:58] INFO: 127.0.0.1:53780 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:58] INFO: 127.0.0.1:54050 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:58] INFO: 127.0.0.1:54212 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:58 TP0] Decode batch. #running-req: 137, #token: 85312, token usage: 0.01, cuda graph: True, gen throughput (token/s): 12996.31, #queue-req: 0, | |
| [2025-09-06 08:43:58] INFO: 127.0.0.1:54372 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:58] INFO: 127.0.0.1:54190 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:58] INFO: 127.0.0.1:53426 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:58] INFO: 127.0.0.1:53958 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:58] INFO: 127.0.0.1:53516 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:58] INFO: 127.0.0.1:53652 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:58] INFO: 127.0.0.1:52766 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:58 TP0] Decode batch. #running-req: 130, #token: 86720, token usage: 0.01, cuda graph: True, gen throughput (token/s): 12626.29, #queue-req: 0, | |
| [2025-09-06 08:43:58] INFO: 127.0.0.1:54126 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:58] INFO: 127.0.0.1:53928 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:58] INFO: 127.0.0.1:53792 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:58] INFO: 127.0.0.1:53636 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:58] INFO: 127.0.0.1:53126 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:58] INFO: 127.0.0.1:53066 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:58] INFO: 127.0.0.1:53862 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:58] INFO: 127.0.0.1:54260 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:58] INFO: 127.0.0.1:54404 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:58] INFO: 127.0.0.1:53966 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:58] INFO: 127.0.0.1:53096 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:58] INFO: 127.0.0.1:53506 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:58] INFO: 127.0.0.1:54096 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:59 TP0] Decode batch. #running-req: 118, #token: 83264, token usage: 0.01, cuda graph: True, gen throughput (token/s): 15464.60, #queue-req: 0, | |
| [2025-09-06 08:43:59] INFO: 127.0.0.1:53626 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:59] INFO: 127.0.0.1:52788 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:59] INFO: 127.0.0.1:52896 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:59] INFO: 127.0.0.1:54354 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:59] INFO: 127.0.0.1:53026 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:59] INFO: 127.0.0.1:54226 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:59] INFO: 127.0.0.1:53660 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:59] INFO: 127.0.0.1:54290 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:59] INFO: 127.0.0.1:54488 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:59] INFO: 127.0.0.1:53322 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:59] INFO: 127.0.0.1:53454 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:59] INFO: 127.0.0.1:52840 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:59] INFO: 127.0.0.1:54280 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:59] INFO: 127.0.0.1:54432 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:59 TP0] Decode batch. #running-req: 103, #token: 76160, token usage: 0.01, cuda graph: True, gen throughput (token/s): 15313.38, #queue-req: 0, | |
| [2025-09-06 08:43:59] INFO: 127.0.0.1:52744 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:59] INFO: 127.0.0.1:54454 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:59] INFO: 127.0.0.1:53218 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:59] INFO: 127.0.0.1:54172 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:59] INFO: 127.0.0.1:54124 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:59] INFO: 127.0.0.1:54274 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:59] INFO: 127.0.0.1:53152 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:59] INFO: 127.0.0.1:53312 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:59] INFO: 127.0.0.1:53212 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:59] INFO: 127.0.0.1:54156 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:59] INFO: 127.0.0.1:54142 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:59] INFO: 127.0.0.1:53342 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:59 TP0] Decode batch. #running-req: 91, #token: 71104, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13987.12, #queue-req: 0, | |
| [2025-09-06 08:43:59] INFO: 127.0.0.1:53504 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:59] INFO: 127.0.0.1:53082 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:59] INFO: 127.0.0.1:54314 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:59] INFO: 127.0.0.1:53380 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:59] INFO: 127.0.0.1:53024 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:59] INFO: 127.0.0.1:53432 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:59] INFO: 127.0.0.1:54262 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:59 TP0] Decode batch. #running-req: 84, #token: 68288, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13118.97, #queue-req: 0, | |
| [2025-09-06 08:43:59] INFO: 127.0.0.1:54442 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:59] INFO: 127.0.0.1:53556 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:59] INFO: 127.0.0.1:54344 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:59] INFO: 127.0.0.1:54414 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:43:59] INFO: 127.0.0.1:52982 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:00] INFO: 127.0.0.1:52970 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:00] INFO: 127.0.0.1:53122 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:00 TP0] Decode batch. #running-req: 77, #token: 65984, token usage: 0.01, cuda graph: True, gen throughput (token/s): 12342.54, #queue-req: 0, | |
| [2025-09-06 08:44:00] INFO: 127.0.0.1:53142 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:00] INFO: 127.0.0.1:54370 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:00] INFO: 127.0.0.1:53258 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:00] INFO: 127.0.0.1:53062 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:00] INFO: 127.0.0.1:54386 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:00] INFO: 127.0.0.1:54250 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:00 TP0] Decode batch. #running-req: 71, #token: 63680, token usage: 0.01, cuda graph: True, gen throughput (token/s): 11621.65, #queue-req: 0, | |
| [2025-09-06 08:44:00] INFO: 127.0.0.1:53462 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:00] INFO: 127.0.0.1:53836 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:00] INFO: 127.0.0.1:53720 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:00] INFO: 127.0.0.1:54206 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:00] INFO: 127.0.0.1:53014 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:00] INFO: 127.0.0.1:54400 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:00] INFO: 127.0.0.1:54512 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:00] INFO: 127.0.0.1:53186 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:00] INFO: 127.0.0.1:52964 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:00] INFO: 127.0.0.1:53832 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:00] INFO: 127.0.0.1:53306 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:00] INFO: 127.0.0.1:53682 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:00 TP0] Decode batch. #running-req: 59, #token: 55040, token usage: 0.01, cuda graph: True, gen throughput (token/s): 10315.34, #queue-req: 0, | |
| [2025-09-06 08:44:00] INFO: 127.0.0.1:53580 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:00] INFO: 127.0.0.1:53442 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:00] INFO: 127.0.0.1:54478 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:00] INFO: 127.0.0.1:54526 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:00] INFO: 127.0.0.1:54066 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:00] INFO: 127.0.0.1:53250 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:00 TP0] Decode batch. #running-req: 54, #token: 51520, token usage: 0.01, cuda graph: True, gen throughput (token/s): 9487.70, #queue-req: 0, | |
| [2025-09-06 08:44:00] INFO: 127.0.0.1:53528 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:00] INFO: 127.0.0.1:53562 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:00] INFO: 127.0.0.1:53762 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:00] INFO: 127.0.0.1:54080 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:01] INFO: 127.0.0.1:53816 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:01 TP0] Decode batch. #running-req: 48, #token: 48768, token usage: 0.01, cuda graph: True, gen throughput (token/s): 8627.06, #queue-req: 0, | |
| [2025-09-06 08:44:01] INFO: 127.0.0.1:53000 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:01] INFO: 127.0.0.1:52882 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:01] INFO: 127.0.0.1:52828 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:01] INFO: 127.0.0.1:53734 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:01 TP0] Decode batch. #running-req: 45, #token: 45376, token usage: 0.01, cuda graph: True, gen throughput (token/s): 8071.15, #queue-req: 0, | |
| [2025-09-06 08:44:01] INFO: 127.0.0.1:53620 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:01] INFO: 127.0.0.1:52754 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:01] INFO: 127.0.0.1:53368 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:01] INFO: 127.0.0.1:54412 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:01] INFO: 127.0.0.1:53844 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:01] INFO: 127.0.0.1:53872 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:01 TP0] Decode batch. #running-req: 38, #token: 41984, token usage: 0.00, cuda graph: True, gen throughput (token/s): 7190.87, #queue-req: 0, | |
| [2025-09-06 08:44:01] INFO: 127.0.0.1:53576 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:01] INFO: 127.0.0.1:54002 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:01] INFO: 127.0.0.1:54146 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:01 TP0] Decode batch. #running-req: 35, #token: 40256, token usage: 0.00, cuda graph: True, gen throughput (token/s): 6559.52, #queue-req: 0, | |
| [2025-09-06 08:44:01] INFO: 127.0.0.1:54216 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| 1%| | 1/198 [00:08<27:03, 8.24s/it][2025-09-06 08:44:01 TP0] Decode batch. #running-req: 34, #token: 40192, token usage: 0.00, cuda graph: True, gen throughput (token/s): 6195.61, #queue-req: 0, | |
| [2025-09-06 08:44:01] INFO: 127.0.0.1:53030 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:02] INFO: 127.0.0.1:53170 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:02] INFO: 127.0.0.1:53790 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:02] INFO: 127.0.0.1:53708 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:02] INFO: 127.0.0.1:52756 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:02] INFO: 127.0.0.1:53228 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:02 TP0] Decode batch. #running-req: 29, #token: 34112, token usage: 0.00, cuda graph: True, gen throughput (token/s): 5702.41, #queue-req: 0, | |
| [2025-09-06 08:44:02] INFO: 127.0.0.1:53416 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:02] INFO: 127.0.0.1:52850 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:02] INFO: 127.0.0.1:53392 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:02] INFO: 127.0.0.1:54462 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:02] INFO: 127.0.0.1:53914 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:02 TP0] Decode batch. #running-req: 23, #token: 29184, token usage: 0.00, cuda graph: True, gen throughput (token/s): 4869.99, #queue-req: 0, | |
| [2025-09-06 08:44:02] INFO: 127.0.0.1:52812 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:02] INFO: 127.0.0.1:52890 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:02] INFO: 127.0.0.1:52936 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:02] INFO: 127.0.0.1:54422 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:02 TP0] Decode batch. #running-req: 19, #token: 24896, token usage: 0.00, cuda graph: True, gen throughput (token/s): 3986.23, #queue-req: 0, | |
| [2025-09-06 08:44:02] INFO: 127.0.0.1:53492 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:02] INFO: 127.0.0.1:54218 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| 1%| | 2/198 [00:09<12:50, 3.93s/it][2025-09-06 08:44:02] INFO: 127.0.0.1:53826 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:02 TP0] Decode batch. #running-req: 16, #token: 21888, token usage: 0.00, cuda graph: True, gen throughput (token/s): 3458.66, #queue-req: 0, | |
| [2025-09-06 08:44:02] INFO: 127.0.0.1:54016 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:02] INFO: 127.0.0.1:54138 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:02] INFO: 127.0.0.1:54108 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:02 TP0] Decode batch. #running-req: 13, #token: 18048, token usage: 0.00, cuda graph: True, gen throughput (token/s): 3017.86, #queue-req: 0, | |
| [2025-09-06 08:44:03] INFO: 127.0.0.1:53590 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:03] INFO: 127.0.0.1:52782 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:03 TP0] Decode batch. #running-req: 11, #token: 15680, token usage: 0.00, cuda graph: True, gen throughput (token/s): 2492.21, #queue-req: 0, | |
| [2025-09-06 08:44:03] INFO: 127.0.0.1:54300 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| 6%|▌ | 12/198 [00:09<01:26, 2.15it/s][2025-09-06 08:44:03 TP0] Decode batch. #running-req: 10, #token: 14784, token usage: 0.00, cuda graph: True, gen throughput (token/s): 2218.82, #queue-req: 0, | |
| [2025-09-06 08:44:03 TP0] Decode batch. #running-req: 10, #token: 14976, token usage: 0.00, cuda graph: True, gen throughput (token/s): 2132.96, #queue-req: 0, | |
| [2025-09-06 08:44:03] INFO: 127.0.0.1:53124 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:03] INFO: 127.0.0.1:54028 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:03] INFO: 127.0.0.1:53400 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:03 TP0] Decode batch. #running-req: 7, #token: 10880, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1887.84, #queue-req: 0, | |
| [2025-09-06 08:44:03] INFO: 127.0.0.1:53076 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:03 TP0] Decode batch. #running-req: 6, #token: 9536, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1548.43, #queue-req: 0, | |
| [2025-09-06 08:44:03] INFO: 127.0.0.1:53944 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:04 TP0] Decode batch. #running-req: 5, #token: 8064, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1236.11, #queue-req: 0, | |
| [2025-09-06 08:44:04] INFO: 127.0.0.1:53746 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:04 TP0] Decode batch. #running-req: 4, #token: 4928, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1162.51, #queue-req: 0, | |
| [2025-09-06 08:44:04] INFO: 127.0.0.1:54486 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| 16%|█▌ | 32/198 [00:10<00:27, 6.01it/s][2025-09-06 08:44:04 TP0] Decode batch. #running-req: 3, #token: 4992, token usage: 0.00, cuda graph: True, gen throughput (token/s): 798.81, #queue-req: 0, | |
| [2025-09-06 08:44:04 TP0] Decode batch. #running-req: 3, #token: 5120, token usage: 0.00, cuda graph: True, gen throughput (token/s): 794.70, #queue-req: 0, | |
| [2025-09-06 08:44:04 TP0] Decode batch. #running-req: 3, #token: 5312, token usage: 0.00, cuda graph: True, gen throughput (token/s): 795.25, #queue-req: 0, | |
| [2025-09-06 08:44:04] INFO: 127.0.0.1:52852 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:44:04 TP0] Decode batch. #running-req: 3, #token: 3648, token usage: 0.00, cuda graph: True, gen throughput (token/s): 781.23, #queue-req: 0, | |
| 27%|██▋ | 54/198 [00:11<00:13, 10.80it/s][2025-09-06 08:44:04] INFO: 127.0.0.1:52992 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| 35%|███▍ | 69/198 [00:11<00:08, 15.68it/s][2025-09-06 08:44:04 TP0] Decode batch. #running-req: 1, #token: 1920, token usage: 0.00, cuda graph: True, gen throughput (token/s): 530.75, #queue-req: 0, | |
| [2025-09-06 08:44:05 TP0] Decode batch. #running-req: 1, #token: 1920, token usage: 0.00, cuda graph: True, gen throughput (token/s): 331.02, #queue-req: 0, | |
| [2025-09-06 08:44:05 TP0] Decode batch. #running-req: 1, #token: 1984, token usage: 0.00, cuda graph: True, gen throughput (token/s): 331.12, #queue-req: 0, | |
| [2025-09-06 08:44:05 TP0] Decode batch. #running-req: 1, #token: 2048, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.90, #queue-req: 0, | |
| [2025-09-06 08:44:05 TP0] Decode batch. #running-req: 1, #token: 2048, token usage: 0.00, cuda graph: True, gen throughput (token/s): 331.23, #queue-req: 0, | |
| [2025-09-06 08:44:05 TP0] Decode batch. #running-req: 1, #token: 2112, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.86, #queue-req: 0, | |
| [2025-09-06 08:44:05 TP0] Decode batch. #running-req: 1, #token: 2176, token usage: 0.00, cuda graph: True, gen throughput (token/s): 324.77, #queue-req: 0, | |
| [2025-09-06 08:44:05 TP0] Decode batch. #running-req: 1, #token: 2176, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.95, #queue-req: 0, | |
| [2025-09-06 08:44:05 TP0] Decode batch. #running-req: 1, #token: 2240, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.11, #queue-req: 0, | |
| [2025-09-06 08:44:06] INFO: 127.0.0.1:53362 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| 56%|█████▌ | 110/198 [00:12<00:03, 23.46it/s] 100%|██████████| 198/198 [00:12<00:00, 15.86it/s] | |
| /usr/lib/python3.12/subprocess.py:1127: ResourceWarning: subprocess 47053 is still running | |
| _warn("subprocess %s is still running" % self.pid, | |
| ResourceWarning: Enable tracemalloc to get the object allocation traceback | |
| . | |
| ---------------------------------------------------------------------- | |
| Ran 1 test in 177.127s | |
| OK | |
| Writing report to /tmp/gpqa__home_yiliu7_models_openai_gpt-oss-120b.html | |
| {'chars': 1697.6010101010102, 'chars:std': 924.1141307574765, 'score:std': 0.48379515211426455, 'score': 0.6262626262626263} | |
| Writing results to /tmp/gpqa__home_yiliu7_models_openai_gpt-oss-120b.json | |
| Total latency: 12.546 s | |
| Score: 0.626 | |
| Evaluation end: model=/home/yiliu7/models/openai/gpt-oss-120b reasoning_effort=low expected_score=0.6 metrics={'chars': 1697.6010101010102, 'chars:std': 924.1141307574765, 'score:std': 0.48379515211426455, 'score': 0.6262626262626263} | |
| ================================================================================ | |
| Run 7: | |
| Auto-configed device: cuda | |
| WARNING:sglang.srt.server_args:Detected SM100 and MXFP4 quantization format for GPT-OSS model, enabling FlashInfer MXFP4 MOE kernel. | |
| WARNING:sglang.srt.server_args:TensorRT-LLM MHA only supports page_size of 16, 32 or 64, changing page_size from None to 64. | |
| [2025-09-06 08:44:20] server_args=ServerArgs(model_path='/home/yiliu7/models/openai/gpt-oss-120b', tokenizer_path='/home/yiliu7/models/openai/gpt-oss-120b', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=8400, skip_server_warmup=False, warmups=None, nccl_port=None, dtype='bfloat16', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', mem_fraction_static=0.93, max_running_requests=None, max_queued_requests=9223372036854775807, max_total_tokens=None, chunked_prefill_size=16384, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, page_size=64, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=True, device='cuda', tp_size=4, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=917768611, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, gc_warning_threshold_secs=0.0, api_key=None, served_model_name='/home/yiliu7/models/openai/gpt-oss-120b', weight_version='default', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_backend='triton', attention_backend='trtllm_mha', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, ep_size=1, moe_a2a_backend='none', moe_runner_backend='flashinfer_mxfp4', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=True, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', disable_radix_cache=False, cuda_graph_max_bs=200, cuda_graph_bs=None, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, torch_compile_max_bs=32, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, enable_return_hidden_states=False, scheduler_recv_interval=1, numa_node=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, debug_tensor_dump_prefill_only=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, num_reserved_decode_tokens=512, custom_weight_loader=[], weight_loader_disable_mmap=False, enable_pdmux=False, sm_group_num=3, enable_ep_moe=False, enable_deepep_moe=False, enable_flashinfer_cutlass_moe=False, enable_flashinfer_trtllm_moe=False, enable_triton_kernel_moe=False, enable_flashinfer_mxfp4_moe=False) | |
| All deep_gemm operations loaded successfully! | |
| `torch_dtype` is deprecated! Use `dtype` instead! | |
| [2025-09-06 08:44:20] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:44:20] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [2025-09-06 08:44:21] Using default HuggingFace chat template with detected content format: string | |
| `torch_dtype` is deprecated! Use `dtype` instead! | |
| [2025-09-06 08:44:27 TP2] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:44:27 TP2] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| `torch_dtype` is deprecated! Use `dtype` instead! | |
| [2025-09-06 08:44:27 TP0] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:44:27 TP0] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| `torch_dtype` is deprecated! Use `dtype` instead! | |
| [2025-09-06 08:44:27 TP3] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:44:27 TP3] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| `torch_dtype` is deprecated! Use `dtype` instead! | |
| [2025-09-06 08:44:28 TP1] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:44:28 TP1] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [2025-09-06 08:44:28 TP2] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:44:28 TP2] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [2025-09-06 08:44:28 TP0] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:44:28 TP0] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [2025-09-06 08:44:28 TP0] Init torch distributed begin. | |
| [2025-09-06 08:44:28 TP3] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:44:28 TP3] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [2025-09-06 08:44:28 TP1] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:44:28 TP1] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [2025-09-06 08:44:30 TP0] sglang is using nccl==2.27.3 | |
| [Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [2025-09-06 08:44:32 TP0] Init torch distributed ends. mem usage=1.46 GB | |
| [2025-09-06 08:44:33 TP0] Load weight begin. avail mem=176.28 GB | |
| All deep_gemm operations loaded successfully! | |
| Loading safetensors checkpoint shards: 0% Completed | 0/15 [00:00<?, ?it/s] | |
| Loading safetensors checkpoint shards: 100% Completed | 15/15 [00:00<00:00, 1574.99it/s] | |
| [2025-09-06 08:44:42 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.0.mlp.experts), it might take a while... | |
| All deep_gemm operations loaded successfully! | |
| All deep_gemm operations loaded successfully! | |
| All deep_gemm operations loaded successfully! | |
| [2025-09-06 08:44:46 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.1.mlp.experts), it might take a while... | |
| [2025-09-06 08:44:49 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.2.mlp.experts), it might take a while... | |
| [2025-09-06 08:44:52 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.3.mlp.experts), it might take a while... | |
| [2025-09-06 08:44:55 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.4.mlp.experts), it might take a while... | |
| [2025-09-06 08:44:58 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.5.mlp.experts), it might take a while... | |
| [2025-09-06 08:45:01 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.6.mlp.experts), it might take a while... | |
| [2025-09-06 08:45:04 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.7.mlp.experts), it might take a while... | |
| [2025-09-06 08:45:07 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.8.mlp.experts), it might take a while... | |
| [2025-09-06 08:45:10 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.9.mlp.experts), it might take a while... | |
| [2025-09-06 08:45:13 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.10.mlp.experts), it might take a while... | |
| [2025-09-06 08:45:16 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.11.mlp.experts), it might take a while... | |
| [2025-09-06 08:45:19 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.12.mlp.experts), it might take a while... | |
| [2025-09-06 08:45:22 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.13.mlp.experts), it might take a while... | |
| [2025-09-06 08:45:25 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.14.mlp.experts), it might take a while... | |
| [2025-09-06 08:45:28 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.15.mlp.experts), it might take a while... | |
| [2025-09-06 08:45:31 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.16.mlp.experts), it might take a while... | |
| [2025-09-06 08:45:34 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.17.mlp.experts), it might take a while... | |
| [2025-09-06 08:45:37 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.18.mlp.experts), it might take a while... | |
| [2025-09-06 08:45:40 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.19.mlp.experts), it might take a while... | |
| [2025-09-06 08:45:43 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.20.mlp.experts), it might take a while... | |
| [2025-09-06 08:45:46 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.21.mlp.experts), it might take a while... | |
| [2025-09-06 08:45:49 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.22.mlp.experts), it might take a while... | |
| [2025-09-06 08:45:52 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.23.mlp.experts), it might take a while... | |
| [2025-09-06 08:45:55 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.24.mlp.experts), it might take a while... | |
| [2025-09-06 08:45:58 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.25.mlp.experts), it might take a while... | |
| [2025-09-06 08:46:01 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.26.mlp.experts), it might take a while... | |
| [2025-09-06 08:46:05 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.27.mlp.experts), it might take a while... | |
| [2025-09-06 08:46:08 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.28.mlp.experts), it might take a while... | |
| [2025-09-06 08:46:11 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.29.mlp.experts), it might take a while... | |
| [2025-09-06 08:46:14 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.30.mlp.experts), it might take a while... | |
| [2025-09-06 08:46:17 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.31.mlp.experts), it might take a while... | |
| [2025-09-06 08:46:20 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.32.mlp.experts), it might take a while... | |
| [2025-09-06 08:46:23 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.33.mlp.experts), it might take a while... | |
| [2025-09-06 08:46:26 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.34.mlp.experts), it might take a while... | |
| [2025-09-06 08:46:29 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.35.mlp.experts), it might take a while... | |
| [2025-09-06 08:46:32 TP0] Load weight end. type=GptOssForCausalLM, dtype=torch.bfloat16, avail mem=158.06 GB, mem usage=18.22 GB. | |
| [2025-09-06 08:46:37 TP0] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB | |
| [2025-09-06 08:46:37 TP2] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB | |
| [2025-09-06 08:46:37 TP0] Memory pool end. avail mem=10.23 GB | |
| [2025-09-06 08:46:37 TP3] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB | |
| [2025-09-06 08:46:37 TP1] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB | |
| [2025-09-06 08:46:38 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=9.54 GB | |
| [2025-09-06 08:46:38 TP0] Capture cuda graph bs [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200] | |
| 0%| | 0/28 [00:00<?, ?it/s] Capturing batches (bs=200 avail_mem=9.39 GB): 0%| | 0/28 [00:00<?, ?it/s]rank 2 allocated ipc_handles: [['0x778cbc000000', '0x778c78000000', '0x77b276000000', '0x778c74000000'], ['0x778c77000000', '0x778c77200000', '0x778c76e00000', '0x778c77400000'], ['0x778c60000000', '0x778c56000000', '0x778c6a000000', '0x778c4c000000']] | |
| [2025-09-06 08:46:40.032] [info] lamportInitialize start: buffer: 0x778c6a000000, size: 71303168 | |
| rank 0 allocated ipc_handles: [['0x75212a000000', '0x74fb8e000000', '0x74fb38000000', '0x74fb34000000'], ['0x74fb36e00000', '0x74fb37000000', '0x74fb37200000', '0x74fb37400000'], ['0x74fb2a000000', '0x74fb20000000', '0x74fb16000000', '0x74fb0c000000']] | |
| [2025-09-06 08:46:40.082] [info] lamportInitialize start: buffer: 0x74fb2a000000, size: 71303168 | |
| rank 3 allocated ipc_handles: [['0x78650c000000', '0x7864ae000000', '0x7864aa000000', '0x788aa8000000'], ['0x7864ad000000', '0x7864ad200000', '0x7864ad400000', '0x7864ace00000'], ['0x786496000000', '0x78648c000000', '0x786482000000', '0x7864a0000000']] | |
| [2025-09-06 08:46:40.131] [info] lamportInitialize start: buffer: 0x7864a0000000, size: 71303168 | |
| rank 1 allocated ipc_handles: [['0x7f7614000000', '0x7f9baa000000', '0x7f75b0000000', '0x7f75ac000000'], ['0x7f75af000000', '0x7f75aee00000', '0x7f75af200000', '0x7f75af400000'], ['0x7f7598000000', '0x7f75a2000000', '0x7f758e000000', '0x7f7584000000']] | |
| [2025-09-06 08:46:40.181] [info] lamportInitialize start: buffer: 0x7f75a2000000, size: 71303168 | |
| [2025-09-06 08:46:40 TP0] FlashInfer workspace initialized for rank 0, world_size 4 | |
| [2025-09-06 08:46:40 TP1] FlashInfer workspace initialized for rank 1, world_size 4 | |
| [2025-09-06 08:46:40 TP3] FlashInfer workspace initialized for rank 3, world_size 4 | |
| [2025-09-06 08:46:40 TP2] FlashInfer workspace initialized for rank 2, world_size 4 | |
| set flag_ptr[3] = lamport_comm_size: 47185920 | |
| Rank 3 workspace[0] 0x78650c000000 | |
| Rank 3 workspace[1] 0x7864ae000000 | |
| Rank 3 workspace[2] 0x7864aa000000 | |
| Rank 3 workspace[3] 0x788aa8000000 | |
| Rank 3 workspace[4] 0x7864ad000000 | |
| Rank 3 workspace[5] 0x7864ad200000 | |
| Rank 3 workspace[6] 0x7864ad400000 | |
| Rank 3 workspace[7] 0x7864ace00000 | |
| Rank 3 workspace[8] 0x786496000000 | |
| Rank 3 workspace[9] 0x78648c000000 | |
| Rank 3 workspace[10] 0x786482000000 | |
| Rank 3 workspace[11] 0x7864a0000000 | |
| Rank 3 workspace[12] 0x7890a3264400 | |
| set flag_ptr[3] = lamport_comm_size: 47185920 | |
| Rank 0 workspace[0] 0x75212a000000 | |
| Rank 0 workspace[1] 0x74fb8e000000 | |
| Rank 0 workspace[2] 0x74fb38000000 | |
| Rank 0 workspace[3] 0x74fb34000000 | |
| Rank 0 workspace[4] 0x74fb36e00000 | |
| Rank 0 workspace[5] 0x74fb37000000 | |
| Rank 0 workspace[6] 0x74fb37200000 | |
| Rank 0 workspace[7] 0x74fb37400000 | |
| Rank 0 workspace[8] 0x74fb2a000000 | |
| Rank 0 workspace[9] 0x74fb20000000 | |
| Rank 0 workspace[10] 0x74fb16000000 | |
| Rank 0 workspace[11] 0x74fb0c000000 | |
| Rank 0 workspace[12] 0x75271f264400 | |
| set flag_ptr[3] = lamport_comm_size: 47185920 | |
| Rank 1 workspace[0] 0x7f7614000000 | |
| Rank 1 workspace[1] 0x7f9baa000000 | |
| Rank 1 workspace[2] 0x7f75b0000000 | |
| Rank 1 workspace[3] 0x7f75ac000000 | |
| Rank 1 workspace[4] 0x7f75af000000 | |
| Rank 1 workspace[5] 0x7f75aee00000 | |
| Rank 1 workspace[6] 0x7f75af200000 | |
| Rank 1 workspace[7] 0x7f75af400000 | |
| Rank 1 workspace[8] 0x7f7598000000 | |
| Rank 1 workspace[9] 0x7f75a2000000 | |
| Rank 1 workspace[10] 0x7f758e000000 | |
| Rank 1 workspace[11] 0x7f7584000000 | |
| Rank 1 workspace[12] 0x7fa1b3264400 | |
| set flag_ptr[3] = lamport_comm_size: 47185920 | |
| Rank 2 workspace[0] 0x778cbc000000 | |
| Rank 2 workspace[1] 0x778c78000000 | |
| Rank 2 workspace[2] 0x77b276000000 | |
| Rank 2 workspace[3] 0x778c74000000 | |
| Rank 2 workspace[4] 0x778c77000000 | |
| Rank 2 workspace[5] 0x778c77200000 | |
| Rank 2 workspace[6] 0x778c76e00000 | |
| Rank 2 workspace[7] 0x778c77400000 | |
| Rank 2 workspace[8] 0x778c60000000 | |
| Rank 2 workspace[9] 0x778c56000000 | |
| Rank 2 workspace[10] 0x778c6a000000 | |
| Rank 2 workspace[11] 0x778c4c000000 | |
| Rank 2 workspace[12] 0x77b881264400 | |
| Capturing batches (bs=200 avail_mem=9.39 GB): 4%|▎ | 1/28 [00:01<00:51, 1.89s/it] Capturing batches (bs=192 avail_mem=8.06 GB): 4%|▎ | 1/28 [00:01<00:51, 1.89s/it] Capturing batches (bs=192 avail_mem=8.06 GB): 7%|▋ | 2/28 [00:02<00:22, 1.15it/s] Capturing batches (bs=184 avail_mem=8.05 GB): 7%|▋ | 2/28 [00:02<00:22, 1.15it/s] Capturing batches (bs=176 avail_mem=8.04 GB): 7%|▋ | 2/28 [00:02<00:22, 1.15it/s] Capturing batches (bs=176 avail_mem=8.04 GB): 14%|█▍ | 4/28 [00:02<00:08, 2.67it/s] Capturing batches (bs=168 avail_mem=8.03 GB): 14%|█▍ | 4/28 [00:02<00:08, 2.67it/s] Capturing batches (bs=160 avail_mem=8.02 GB): 14%|█▍ | 4/28 [00:02<00:08, 2.67it/s] Capturing batches (bs=160 avail_mem=8.02 GB): 21%|██▏ | 6/28 [00:02<00:05, 4.24it/s] Capturing batches (bs=152 avail_mem=8.01 GB): 21%|██▏ | 6/28 [00:02<00:05, 4.24it/s] Capturing batches (bs=144 avail_mem=8.00 GB): 21%|██▏ | 6/28 [00:02<00:05, 4.24it/s] Capturing batches (bs=144 avail_mem=8.00 GB): 29%|██▊ | 8/28 [00:02<00:03, 5.76it/s] Capturing batches (bs=136 avail_mem=8.00 GB): 29%|██▊ | 8/28 [00:02<00:03, 5.76it/s] Capturing batches (bs=128 avail_mem=7.99 GB): 29%|██▊ | 8/28 [00:02<00:03, 5.76it/s] Capturing batches (bs=128 avail_mem=7.99 GB): 36%|███▌ | 10/28 [00:02<00:02, 7.10it/s] Capturing batches (bs=120 avail_mem=7.97 GB): 36%|███▌ | 10/28 [00:02<00:02, 7.10it/s] Capturing batches (bs=112 avail_mem=7.97 GB): 36%|███▌ | 10/28 [00:02<00:02, 7.10it/s] Capturing batches (bs=112 avail_mem=7.97 GB): 43%|████▎ | 12/28 [00:02<00:01, 8.29it/s] Capturing batches (bs=104 avail_mem=7.96 GB): 43%|████▎ | 12/28 [00:02<00:01, 8.29it/s] Capturing batches (bs=96 avail_mem=7.95 GB): 43%|████▎ | 12/28 [00:02<00:01, 8.29it/s] Capturing batches (bs=96 avail_mem=7.95 GB): 50%|█████ | 14/28 [00:03<00:01, 9.24it/s] Capturing batches (bs=88 avail_mem=7.94 GB): 50%|█████ | 14/28 [00:03<00:01, 9.24it/s] Capturing batches (bs=80 avail_mem=7.93 GB): 50%|█████ | 14/28 [00:03<00:01, 9.24it/s] Capturing batches (bs=80 avail_mem=7.93 GB): 57%|█████▋ | 16/28 [00:03<00:01, 10.03it/s] Capturing batches (bs=72 avail_mem=7.92 GB): 57%|█████▋ | 16/28 [00:03<00:01, 10.03it/s] Capturing batches (bs=64 avail_mem=7.91 GB): 57%|█████▋ | 16/28 [00:03<00:01, 10.03it/s] Capturing batches (bs=64 avail_mem=7.91 GB): 64%|██████▍ | 18/28 [00:03<00:00, 10.65it/s] Capturing batches (bs=56 avail_mem=7.90 GB): 64%|██████▍ | 18/28 [00:03<00:00, 10.65it/s] Capturing batches (bs=48 avail_mem=7.89 GB): 64%|██████▍ | 18/28 [00:03<00:00, 10.65it/s] Capturing batches (bs=48 avail_mem=7.89 GB): 71%|███████▏ | 20/28 [00:03<00:00, 11.23it/s] Capturing batches (bs=40 avail_mem=7.88 GB): 71%|███████▏ | 20/28 [00:03<00:00, 11.23it/s] Capturing batches (bs=32 avail_mem=7.88 GB): 71%|███████▏ | 20/28 [00:03<00:00, 11.23it/s] Capturing batches (bs=32 avail_mem=7.88 GB): 79%|███████▊ | 22/28 [00:03<00:00, 11.60it/s] Capturing batches (bs=24 avail_mem=7.87 GB): 79%|███████▊ | 22/28 [00:03<00:00, 11.60it/s] Capturing batches (bs=16 avail_mem=7.86 GB): 79%|███████▊ | 22/28 [00:03<00:00, 11.60it/s] Capturing batches (bs=16 avail_mem=7.86 GB): 86%|████████▌ | 24/28 [00:03<00:00, 11.93it/s] Capturing batches (bs=8 avail_mem=7.85 GB): 86%|████████▌ | 24/28 [00:03<00:00, 11.93it/s] Capturing batches (bs=4 avail_mem=7.84 GB): 86%|████████▌ | 24/28 [00:03<00:00, 11.93it/s] Capturing batches (bs=4 avail_mem=7.84 GB): 93%|█████████▎| 26/28 [00:03<00:00, 12.47it/s] Capturing batches (bs=2 avail_mem=7.83 GB): 93%|█████████▎| 26/28 [00:03<00:00, 12.47it/s] Capturing batches (bs=1 avail_mem=7.82 GB): 93%|█████████▎| 26/28 [00:04<00:00, 12.47it/s] Capturing batches (bs=1 avail_mem=7.82 GB): 100%|██████████| 28/28 [00:04<00:00, 11.86it/s] Capturing batches (bs=1 avail_mem=7.82 GB): 100%|██████████| 28/28 [00:04<00:00, 6.74it/s] | |
| [2025-09-06 08:46:42 TP0] Registering 56 cuda graph addresses | |
| [2025-09-06 08:46:42 TP3] Registering 56 cuda graph addresses | |
| [2025-09-06 08:46:42 TP1] Registering 56 cuda graph addresses | |
| [2025-09-06 08:46:42 TP2] Registering 56 cuda graph addresses | |
| [2025-09-06 08:46:42 TP0] Capture cuda graph end. Time elapsed: 4.66 s. mem usage=1.73 GB. avail mem=7.81 GB. | |
| [2025-09-06 08:46:43 TP0] max_total_num_tokens=8487040, chunked_prefill_size=16384, max_prefill_tokens=16384, max_running_requests=4096, context_len=131072, available_gpu_mem=7.81 GB | |
| [2025-09-06 08:46:44] INFO: Started server process [49519] | |
| [2025-09-06 08:46:44] INFO: Waiting for application startup. | |
| [2025-09-06 08:46:44] INFO: Application startup complete. | |
| [2025-09-06 08:46:44] INFO: Uvicorn running on http://127.0.0.1:8400 (Press CTRL+C to quit) | |
| [2025-09-06 08:46:45] INFO: 127.0.0.1:42690 - "GET /health_generate HTTP/1.1" 503 Service Unavailable | |
| [2025-09-06 08:46:45] INFO: 127.0.0.1:42702 - "GET /get_model_info HTTP/1.1" 200 OK | |
| [2025-09-06 08:46:45 TP0] Prefill batch. #new-seq: 1, #new-token: 64, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, | |
| [2025-09-06 08:46:46] INFO: 127.0.0.1:42716 - "POST /generate HTTP/1.1" 200 OK | |
| [2025-09-06 08:46:46] The server is fired up and ready to roll! | |
| [2025-09-06 08:46:55 TP0] Prefill batch. #new-seq: 1, #new-token: 64, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, | |
| [2025-09-06 08:46:56] INFO: 127.0.0.1:47496 - "GET /health_generate HTTP/1.1" 200 OK | |
| command=python3 -m sglang.launch_server --model-path /home/yiliu7/models/openai/gpt-oss-120b --tp 4 --cuda-graph-max-bs 200 --mem-fraction-static 0.93 --device cuda --host 127.0.0.1 --port 8400 | |
| Evaluation start: model=/home/yiliu7/models/openai/gpt-oss-120b reasoning_effort=low expected_score=0.6 | |
| ChatCompletionSampler initialized with self.system_message=None self.temperature=0.1 self.max_tokens=4096 self.reasoning_effort='low' | |
| 0%| | 0/198 [00:00<?, ?it/s][2025-09-06 08:46:56 TP0] Prefill batch. #new-seq: 1, #new-token: 320, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, | |
| [2025-09-06 08:46:56 TP0] Prefill batch. #new-seq: 3, #new-token: 960, #cached-token: 0, token usage: 0.00, #running-req: 1, #queue-req: 0, | |
| [2025-09-06 08:46:57 TP0] Prefill batch. #new-seq: 9, #new-token: 2304, #cached-token: 576, token usage: 0.00, #running-req: 4, #queue-req: 0, | |
| [2025-09-06 08:46:57 TP0] Prefill batch. #new-seq: 44, #new-token: 13184, #cached-token: 2816, token usage: 0.00, #running-req: 13, #queue-req: 0, | |
| [2025-09-06 08:46:57 TP0] Prefill batch. #new-seq: 23, #new-token: 8448, #cached-token: 1472, token usage: 0.00, #running-req: 57, #queue-req: 0, | |
| [2025-09-06 08:46:57 TP0] Prefill batch. #new-seq: 45, #new-token: 11200, #cached-token: 2880, token usage: 0.00, #running-req: 80, #queue-req: 0, | |
| [2025-09-06 08:46:57 TP0] Prefill batch. #new-seq: 24, #new-token: 6848, #cached-token: 1600, token usage: 0.00, #running-req: 125, #queue-req: 0, | |
| [2025-09-06 08:46:57 TP0] Prefill batch. #new-seq: 49, #new-token: 13568, #cached-token: 3200, token usage: 0.01, #running-req: 149, #queue-req: 0, | |
| [2025-09-06 08:46:57 TP0] Decode batch. #running-req: 198, #token: 62976, token usage: 0.01, cuda graph: True, gen throughput (token/s): 422.71, #queue-req: 0, | |
| [2025-09-06 08:46:57] INFO: 127.0.0.1:48442 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:46:58] INFO: 127.0.0.1:48364 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:46:58] INFO: 127.0.0.1:47634 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:46:58 TP0] Decode batch. #running-req: 195, #token: 69760, token usage: 0.01, cuda graph: True, gen throughput (token/s): 17213.36, #queue-req: 0, | |
| [2025-09-06 08:46:58] INFO: 127.0.0.1:47522 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:46:58] INFO: 127.0.0.1:47972 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:46:58] INFO: 127.0.0.1:47900 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:46:58] INFO: 127.0.0.1:48432 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:46:58] INFO: 127.0.0.1:47994 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:46:58 TP0] Decode batch. #running-req: 190, #token: 75968, token usage: 0.01, cuda graph: True, gen throughput (token/s): 17152.43, #queue-req: 0, | |
| [2025-09-06 08:46:58] INFO: 127.0.0.1:47612 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:46:58] INFO: 127.0.0.1:48514 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:46:58] INFO: 127.0.0.1:48768 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:46:58] INFO: 127.0.0.1:47846 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:46:59] INFO: 127.0.0.1:49120 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:46:59] INFO: 127.0.0.1:48562 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:46:59] INFO: 127.0.0.1:48578 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:46:59] INFO: 127.0.0.1:48894 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:46:59] INFO: 127.0.0.1:48012 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:46:59] INFO: 127.0.0.1:48018 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:46:59] INFO: 127.0.0.1:48142 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:46:59] INFO: 127.0.0.1:48648 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:46:59 TP0] Decode batch. #running-req: 178, #token: 76928, token usage: 0.01, cuda graph: True, gen throughput (token/s): 16526.97, #queue-req: 0, | |
| [2025-09-06 08:46:59] INFO: 127.0.0.1:49034 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:46:59] INFO: 127.0.0.1:47916 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:46:59] INFO: 127.0.0.1:48330 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:46:59] INFO: 127.0.0.1:48710 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:46:59] INFO: 127.0.0.1:48130 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:46:59] INFO: 127.0.0.1:48858 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:46:59 TP0] Decode batch. #running-req: 172, #token: 79296, token usage: 0.01, cuda graph: True, gen throughput (token/s): 16031.50, #queue-req: 0, | |
| [2025-09-06 08:46:59] INFO: 127.0.0.1:47990 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:46:59] INFO: 127.0.0.1:48380 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:46:59] INFO: 127.0.0.1:48622 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:46:59] INFO: 127.0.0.1:48654 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:46:59] INFO: 127.0.0.1:49236 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:00] INFO: 127.0.0.1:48286 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:00] INFO: 127.0.0.1:47750 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:00] INFO: 127.0.0.1:48898 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:00 TP0] Decode batch. #running-req: 164, #token: 82368, token usage: 0.01, cuda graph: True, gen throughput (token/s): 15193.77, #queue-req: 0, | |
| [2025-09-06 08:47:00] INFO: 127.0.0.1:47760 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:00] INFO: 127.0.0.1:48718 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:00] INFO: 127.0.0.1:49222 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:00] INFO: 127.0.0.1:48996 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:00] INFO: 127.0.0.1:49008 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:00] INFO: 127.0.0.1:47802 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:00] INFO: 127.0.0.1:47734 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:00 TP0] Decode batch. #running-req: 157, #token: 84032, token usage: 0.01, cuda graph: True, gen throughput (token/s): 14316.44, #queue-req: 0, | |
| [2025-09-06 08:47:00] INFO: 127.0.0.1:47976 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:00] INFO: 127.0.0.1:47564 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:00] INFO: 127.0.0.1:47608 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:00] INFO: 127.0.0.1:48228 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:00] INFO: 127.0.0.1:48054 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:00] INFO: 127.0.0.1:49080 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:00] INFO: 127.0.0.1:48176 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:00] INFO: 127.0.0.1:48834 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:00] INFO: 127.0.0.1:48942 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:00] INFO: 127.0.0.1:48284 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:00] INFO: 127.0.0.1:48528 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:01] INFO: 127.0.0.1:48444 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:01 TP0] Decode batch. #running-req: 145, #token: 83904, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13590.98, #queue-req: 0, | |
| [2025-09-06 08:47:01] INFO: 127.0.0.1:48408 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:01] INFO: 127.0.0.1:49022 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:01] INFO: 127.0.0.1:49142 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:01] INFO: 127.0.0.1:48096 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:01] INFO: 127.0.0.1:47710 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:01] INFO: 127.0.0.1:47768 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:01] INFO: 127.0.0.1:47622 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:01] INFO: 127.0.0.1:48666 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:01 TP0] Decode batch. #running-req: 137, #token: 85312, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13062.56, #queue-req: 0, | |
| [2025-09-06 08:47:01] INFO: 127.0.0.1:47814 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:01] INFO: 127.0.0.1:48222 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:01] INFO: 127.0.0.1:47892 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:01] INFO: 127.0.0.1:48314 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:01] INFO: 127.0.0.1:49194 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:01] INFO: 127.0.0.1:47848 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:01 TP0] Decode batch. #running-req: 131, #token: 86528, token usage: 0.01, cuda graph: True, gen throughput (token/s): 12544.55, #queue-req: 0, | |
| [2025-09-06 08:47:01] INFO: 127.0.0.1:48158 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:02] INFO: 127.0.0.1:47646 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:02] INFO: 127.0.0.1:48302 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:02] INFO: 127.0.0.1:48520 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:02] INFO: 127.0.0.1:49028 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:02] INFO: 127.0.0.1:48206 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:02] INFO: 127.0.0.1:48552 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:02] INFO: 127.0.0.1:47694 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:02] INFO: 127.0.0.1:48766 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:02] INFO: 127.0.0.1:48914 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:02] INFO: 127.0.0.1:47766 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:02] INFO: 127.0.0.1:48236 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:02] INFO: 127.0.0.1:47586 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:02] INFO: 127.0.0.1:48016 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:02 TP0] Decode batch. #running-req: 117, #token: 81664, token usage: 0.01, cuda graph: True, gen throughput (token/s): 14231.44, #queue-req: 0, | |
| [2025-09-06 08:47:02] INFO: 127.0.0.1:47552 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:02] INFO: 127.0.0.1:47502 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:02] INFO: 127.0.0.1:48626 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:02] INFO: 127.0.0.1:48350 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:02] INFO: 127.0.0.1:48732 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:02] INFO: 127.0.0.1:48982 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:02] INFO: 127.0.0.1:47872 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:02] INFO: 127.0.0.1:48794 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:02] INFO: 127.0.0.1:48846 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:02] INFO: 127.0.0.1:48888 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:02] INFO: 127.0.0.1:49174 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:02 TP0] Decode batch. #running-req: 106, #token: 78528, token usage: 0.01, cuda graph: True, gen throughput (token/s): 15304.65, #queue-req: 0, | |
| [2025-09-06 08:47:02] INFO: 127.0.0.1:49088 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:02] INFO: 127.0.0.1:48642 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:02] INFO: 127.0.0.1:48108 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:02] INFO: 127.0.0.1:47950 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:02] INFO: 127.0.0.1:48076 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:02] INFO: 127.0.0.1:48742 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:02 TP0] Decode batch. #running-req: 100, #token: 78528, token usage: 0.01, cuda graph: True, gen throughput (token/s): 14633.89, #queue-req: 0, | |
| [2025-09-06 08:47:02] INFO: 127.0.0.1:48186 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:02] INFO: 127.0.0.1:48010 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:02] INFO: 127.0.0.1:48618 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:03] INFO: 127.0.0.1:48478 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:03] INFO: 127.0.0.1:49046 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:03 TP0] Decode batch. #running-req: 95, #token: 78272, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13824.30, #queue-req: 0, | |
| [2025-09-06 08:47:03] INFO: 127.0.0.1:47974 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:03] INFO: 127.0.0.1:48750 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:03] INFO: 127.0.0.1:48772 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:03] INFO: 127.0.0.1:48864 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:03] INFO: 127.0.0.1:48234 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:03] INFO: 127.0.0.1:49184 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:03] INFO: 127.0.0.1:47794 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:03] INFO: 127.0.0.1:48046 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:03] INFO: 127.0.0.1:48884 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:03] INFO: 127.0.0.1:48818 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:03 TP0] Decode batch. #running-req: 85, #token: 73408, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13331.56, #queue-req: 0, | |
| [2025-09-06 08:47:03] INFO: 127.0.0.1:47940 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:03] INFO: 127.0.0.1:48896 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:03] INFO: 127.0.0.1:48918 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:03] INFO: 127.0.0.1:48592 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:03] INFO: 127.0.0.1:48246 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:03] INFO: 127.0.0.1:48402 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:03] INFO: 127.0.0.1:48878 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:03] INFO: 127.0.0.1:48780 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:03 TP0] Decode batch. #running-req: 77, #token: 70080, token usage: 0.01, cuda graph: True, gen throughput (token/s): 12444.69, #queue-req: 0, | |
| [2025-09-06 08:47:03] INFO: 127.0.0.1:48390 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:03] INFO: 127.0.0.1:49010 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:03] INFO: 127.0.0.1:49234 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:03] INFO: 127.0.0.1:49128 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:03] INFO: 127.0.0.1:48248 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:03] INFO: 127.0.0.1:48134 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:03] INFO: 127.0.0.1:48166 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:03 TP0] Decode batch. #running-req: 70, #token: 66176, token usage: 0.01, cuda graph: True, gen throughput (token/s): 11601.78, #queue-req: 0, | |
| [2025-09-06 08:47:03] INFO: 127.0.0.1:47648 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:04] INFO: 127.0.0.1:48406 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:04] INFO: 127.0.0.1:49164 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:04] INFO: 127.0.0.1:47656 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:04] INFO: 127.0.0.1:49104 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:04 TP0] Decode batch. #running-req: 66, #token: 63680, token usage: 0.01, cuda graph: True, gen throughput (token/s): 11035.15, #queue-req: 0, | |
| [2025-09-06 08:47:04] INFO: 127.0.0.1:47574 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:04] INFO: 127.0.0.1:48502 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:04] INFO: 127.0.0.1:48682 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:04] INFO: 127.0.0.1:47830 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:04] INFO: 127.0.0.1:47924 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:04] INFO: 127.0.0.1:48694 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:04 TP0] Decode batch. #running-req: 60, #token: 60544, token usage: 0.01, cuda graph: True, gen throughput (token/s): 10158.47, #queue-req: 0, | |
| [2025-09-06 08:47:04] INFO: 127.0.0.1:48004 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:04] INFO: 127.0.0.1:48958 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:04] INFO: 127.0.0.1:48404 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:04] INFO: 127.0.0.1:48600 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:04] INFO: 127.0.0.1:47514 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:04] INFO: 127.0.0.1:48636 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:04] INFO: 127.0.0.1:48340 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:04] INFO: 127.0.0.1:48602 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:04 TP0] Decode batch. #running-req: 51, #token: 53696, token usage: 0.01, cuda graph: True, gen throughput (token/s): 9236.80, #queue-req: 0, | |
| [2025-09-06 08:47:04] INFO: 127.0.0.1:47628 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:04] INFO: 127.0.0.1:48462 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:04] INFO: 127.0.0.1:49156 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:04] INFO: 127.0.0.1:47578 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:04] INFO: 127.0.0.1:48262 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:04] INFO: 127.0.0.1:47968 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:04] INFO: 127.0.0.1:48474 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:04] INFO: 127.0.0.1:47718 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:04] INFO: 127.0.0.1:48798 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:04] INFO: 127.0.0.1:49210 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:04] INFO: 127.0.0.1:48542 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:04 TP0] Decode batch. #running-req: 40, #token: 44032, token usage: 0.01, cuda graph: True, gen throughput (token/s): 7980.75, #queue-req: 0, | |
| [2025-09-06 08:47:04] INFO: 127.0.0.1:48456 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:04] INFO: 127.0.0.1:47860 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:04] INFO: 127.0.0.1:47964 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:05] INFO: 127.0.0.1:48066 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:05 TP0] Decode batch. #running-req: 36, #token: 41280, token usage: 0.00, cuda graph: True, gen throughput (token/s): 6785.10, #queue-req: 0, | |
| [2025-09-06 08:47:05] INFO: 127.0.0.1:48268 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:05] INFO: 127.0.0.1:48030 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:05] INFO: 127.0.0.1:47884 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:05] INFO: 127.0.0.1:48418 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:05] INFO: 127.0.0.1:47598 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:05 TP0] Decode batch. #running-req: 31, #token: 36864, token usage: 0.00, cuda graph: True, gen throughput (token/s): 5993.33, #queue-req: 0, | |
| [2025-09-06 08:47:05] INFO: 127.0.0.1:48120 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:05] INFO: 127.0.0.1:48298 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:05] INFO: 127.0.0.1:48984 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:05] INFO: 127.0.0.1:48966 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:05] INFO: 127.0.0.1:48580 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:05] INFO: 127.0.0.1:47668 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:05 TP0] Decode batch. #running-req: 25, #token: 31104, token usage: 0.00, cuda graph: True, gen throughput (token/s): 5349.57, #queue-req: 0, | |
| [2025-09-06 08:47:05] INFO: 127.0.0.1:48382 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:05] INFO: 127.0.0.1:47678 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:05] INFO: 127.0.0.1:48810 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| 1%| | 1/198 [00:08<28:53, 8.80s/it][2025-09-06 08:47:05 TP0] Decode batch. #running-req: 22, #token: 28096, token usage: 0.00, cuda graph: True, gen throughput (token/s): 4576.68, #queue-req: 0, | |
| [2025-09-06 08:47:05] INFO: 127.0.0.1:49074 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:05] INFO: 127.0.0.1:48230 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:05] INFO: 127.0.0.1:49090 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:05 TP0] Decode batch. #running-req: 19, #token: 25280, token usage: 0.00, cuda graph: True, gen throughput (token/s): 4138.92, #queue-req: 0, | |
| [2025-09-06 08:47:06] INFO: 127.0.0.1:48492 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:06] INFO: 127.0.0.1:49058 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:06] INFO: 127.0.0.1:47858 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:06 TP0] Decode batch. #running-req: 16, #token: 21824, token usage: 0.00, cuda graph: True, gen throughput (token/s): 3493.30, #queue-req: 0, | |
| [2025-09-06 08:47:06] INFO: 127.0.0.1:48926 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:06 TP0] Decode batch. #running-req: 15, #token: 20864, token usage: 0.00, cuda graph: True, gen throughput (token/s): 3256.37, #queue-req: 0, | |
| [2025-09-06 08:47:06] INFO: 127.0.0.1:48460 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:06] INFO: 127.0.0.1:48716 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:06] INFO: 127.0.0.1:48850 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| 1%| | 2/198 [00:09<13:20, 4.08s/it][2025-09-06 08:47:06] INFO: 127.0.0.1:49178 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:06 TP0] Decode batch. #running-req: 11, #token: 15808, token usage: 0.00, cuda graph: True, gen throughput (token/s): 2734.59, #queue-req: 0, | |
| [2025-09-06 08:47:06] INFO: 127.0.0.1:48080 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:06 TP0] Decode batch. #running-req: 10, #token: 14976, token usage: 0.00, cuda graph: True, gen throughput (token/s): 2231.28, #queue-req: 0, | |
| [2025-09-06 08:47:06] INFO: 127.0.0.1:47778 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:06 TP0] Decode batch. #running-req: 9, #token: 13632, token usage: 0.00, cuda graph: True, gen throughput (token/s): 2032.86, #queue-req: 0, | |
| [2025-09-06 08:47:06] INFO: 127.0.0.1:49012 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| 12%|█▏ | 24/198 [00:10<00:40, 4.32it/s][2025-09-06 08:47:07] INFO: 127.0.0.1:48744 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:07 TP0] Decode batch. #running-req: 7, #token: 11200, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1870.16, #queue-req: 0, | |
| [2025-09-06 08:47:07] INFO: 127.0.0.1:47756 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:07 TP0] Decode batch. #running-req: 6, #token: 9600, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1521.52, #queue-req: 0, | |
| [2025-09-06 08:47:07 TP0] Decode batch. #running-req: 6, #token: 9856, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1447.87, #queue-req: 0, | |
| [2025-09-06 08:47:07 TP0] Decode batch. #running-req: 6, #token: 10176, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1450.22, #queue-req: 0, | |
| [2025-09-06 08:47:07] INFO: 127.0.0.1:47536 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:07] INFO: 127.0.0.1:48190 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:07 TP0] Decode batch. #running-req: 4, #token: 6848, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1383.04, #queue-req: 0, | |
| [2025-09-06 08:47:07 TP0] Decode batch. #running-req: 4, #token: 7104, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1055.55, #queue-req: 0, | |
| [2025-09-06 08:47:08] INFO: 127.0.0.1:47820 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:08 TP0] Decode batch. #running-req: 3, #token: 5440, token usage: 0.00, cuda graph: True, gen throughput (token/s): 969.66, #queue-req: 0, | |
| [2025-09-06 08:47:08] INFO: 127.0.0.1:48596 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:47:08] INFO: 127.0.0.1:49106 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| 23%|██▎ | 45/198 [00:11<00:20, 7.56it/s][2025-09-06 08:47:08 TP0] Decode batch. #running-req: 1, #token: 1856, token usage: 0.00, cuda graph: True, gen throughput (token/s): 487.72, #queue-req: 0, | |
| [2025-09-06 08:47:08 TP0] Decode batch. #running-req: 1, #token: 1920, token usage: 0.00, cuda graph: True, gen throughput (token/s): 331.44, #queue-req: 0, | |
| [2025-09-06 08:47:08 TP0] Decode batch. #running-req: 1, #token: 1920, token usage: 0.00, cuda graph: True, gen throughput (token/s): 333.22, #queue-req: 0, | |
| [2025-09-06 08:47:08 TP0] Decode batch. #running-req: 1, #token: 1984, token usage: 0.00, cuda graph: True, gen throughput (token/s): 331.67, #queue-req: 0, | |
| [2025-09-06 08:47:08 TP0] Decode batch. #running-req: 1, #token: 2048, token usage: 0.00, cuda graph: True, gen throughput (token/s): 331.29, #queue-req: 0, | |
| [2025-09-06 08:47:08 TP0] Decode batch. #running-req: 1, #token: 2048, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.86, #queue-req: 0, | |
| [2025-09-06 08:47:08 TP0] Decode batch. #running-req: 1, #token: 2112, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.52, #queue-req: 0, | |
| [2025-09-06 08:47:09 TP0] Decode batch. #running-req: 1, #token: 2176, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.63, #queue-req: 0, | |
| [2025-09-06 08:47:09 TP0] Decode batch. #running-req: 1, #token: 2176, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.19, #queue-req: 0, | |
| [2025-09-06 08:47:09 TP0] Decode batch. #running-req: 1, #token: 2240, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.04, #queue-req: 0, | |
| [2025-09-06 08:47:09 TP0] Decode batch. #running-req: 1, #token: 2240, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.00, #queue-req: 0, | |
| [2025-09-06 08:47:09 TP0] Decode batch. #running-req: 1, #token: 2304, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.07, #queue-req: 0, | |
| [2025-09-06 08:47:09] INFO: 127.0.0.1:48032 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| 56%|█████▌ | 110/198 [00:12<00:04, 18.79it/s] 100%|██████████| 198/198 [00:12<00:00, 15.65it/s] | |
| /usr/lib/python3.12/subprocess.py:1127: ResourceWarning: subprocess 49519 is still running | |
| _warn("subprocess %s is still running" % self.pid, | |
| ResourceWarning: Enable tracemalloc to get the object allocation traceback | |
| . | |
| ---------------------------------------------------------------------- | |
| Ran 1 test in 176.921s | |
| OK | |
| Writing report to /tmp/gpqa__home_yiliu7_models_openai_gpt-oss-120b.html | |
| {'chars': 1763.0353535353536, 'chars:std': 976.7007511699713, 'score:std': 0.4863193178670999, 'score': 0.6161616161616161} | |
| Writing results to /tmp/gpqa__home_yiliu7_models_openai_gpt-oss-120b.json | |
| Total latency: 12.715 s | |
| Score: 0.616 | |
| Evaluation end: model=/home/yiliu7/models/openai/gpt-oss-120b reasoning_effort=low expected_score=0.6 metrics={'chars': 1763.0353535353536, 'chars:std': 976.7007511699713, 'score:std': 0.4863193178670999, 'score': 0.6161616161616161} | |
| ================================================================================ | |
| Run 8: | |
| Auto-configed device: cuda | |
| WARNING:sglang.srt.server_args:Detected SM100 and MXFP4 quantization format for GPT-OSS model, enabling FlashInfer MXFP4 MOE kernel. | |
| WARNING:sglang.srt.server_args:TensorRT-LLM MHA only supports page_size of 16, 32 or 64, changing page_size from None to 64. | |
| [2025-09-06 08:47:24] server_args=ServerArgs(model_path='/home/yiliu7/models/openai/gpt-oss-120b', tokenizer_path='/home/yiliu7/models/openai/gpt-oss-120b', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=8400, skip_server_warmup=False, warmups=None, nccl_port=None, dtype='bfloat16', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', mem_fraction_static=0.93, max_running_requests=None, max_queued_requests=9223372036854775807, max_total_tokens=None, chunked_prefill_size=16384, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, page_size=64, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=True, device='cuda', tp_size=4, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=384373251, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, gc_warning_threshold_secs=0.0, api_key=None, served_model_name='/home/yiliu7/models/openai/gpt-oss-120b', weight_version='default', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_backend='triton', attention_backend='trtllm_mha', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, ep_size=1, moe_a2a_backend='none', moe_runner_backend='flashinfer_mxfp4', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=True, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', disable_radix_cache=False, cuda_graph_max_bs=200, cuda_graph_bs=None, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, torch_compile_max_bs=32, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, enable_return_hidden_states=False, scheduler_recv_interval=1, numa_node=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, debug_tensor_dump_prefill_only=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, num_reserved_decode_tokens=512, custom_weight_loader=[], weight_loader_disable_mmap=False, enable_pdmux=False, sm_group_num=3, enable_ep_moe=False, enable_deepep_moe=False, enable_flashinfer_cutlass_moe=False, enable_flashinfer_trtllm_moe=False, enable_triton_kernel_moe=False, enable_flashinfer_mxfp4_moe=False) | |
| All deep_gemm operations loaded successfully! | |
| `torch_dtype` is deprecated! Use `dtype` instead! | |
| [2025-09-06 08:47:24] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:47:24] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [2025-09-06 08:47:25] Using default HuggingFace chat template with detected content format: string | |
| `torch_dtype` is deprecated! Use `dtype` instead! | |
| [2025-09-06 08:47:31 TP0] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:47:31 TP0] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| `torch_dtype` is deprecated! Use `dtype` instead! | |
| [2025-09-06 08:47:31 TP3] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:47:31 TP3] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| `torch_dtype` is deprecated! Use `dtype` instead! | |
| [2025-09-06 08:47:32 TP2] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:47:32 TP2] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [2025-09-06 08:47:32 TP0] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:47:32 TP0] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [2025-09-06 08:47:32 TP0] Init torch distributed begin. | |
| `torch_dtype` is deprecated! Use `dtype` instead! | |
| [2025-09-06 08:47:32 TP1] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:47:32 TP1] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [2025-09-06 08:47:32 TP3] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:47:32 TP3] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [2025-09-06 08:47:32 TP2] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:47:32 TP2] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [2025-09-06 08:47:32 TP1] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:47:32 TP1] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [2025-09-06 08:47:34 TP0] sglang is using nccl==2.27.3 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [2025-09-06 08:47:36 TP0] Init torch distributed ends. mem usage=1.46 GB | |
| [2025-09-06 08:47:36 TP0] Load weight begin. avail mem=176.28 GB | |
| All deep_gemm operations loaded successfully! | |
| Loading safetensors checkpoint shards: 0% Completed | 0/15 [00:00<?, ?it/s] | |
| Loading safetensors checkpoint shards: 100% Completed | 15/15 [00:00<00:00, 1621.43it/s] | |
| All deep_gemm operations loaded successfully! | |
| All deep_gemm operations loaded successfully! | |
| [2025-09-06 08:47:49 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.0.mlp.experts), it might take a while... | |
| All deep_gemm operations loaded successfully! | |
| [2025-09-06 08:47:52 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.1.mlp.experts), it might take a while... | |
| [2025-09-06 08:47:55 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.2.mlp.experts), it might take a while... | |
| [2025-09-06 08:47:58 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.3.mlp.experts), it might take a while... | |
| [2025-09-06 08:48:01 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.4.mlp.experts), it might take a while... | |
| [2025-09-06 08:48:05 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.5.mlp.experts), it might take a while... | |
| [2025-09-06 08:48:08 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.6.mlp.experts), it might take a while... | |
| [2025-09-06 08:48:11 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.7.mlp.experts), it might take a while... | |
| [2025-09-06 08:48:14 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.8.mlp.experts), it might take a while... | |
| [2025-09-06 08:48:17 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.9.mlp.experts), it might take a while... | |
| [2025-09-06 08:48:20 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.10.mlp.experts), it might take a while... | |
| [2025-09-06 08:48:23 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.11.mlp.experts), it might take a while... | |
| [2025-09-06 08:48:26 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.12.mlp.experts), it might take a while... | |
| [2025-09-06 08:48:30 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.13.mlp.experts), it might take a while... | |
| [2025-09-06 08:48:33 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.14.mlp.experts), it might take a while... | |
| [2025-09-06 08:48:36 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.15.mlp.experts), it might take a while... | |
| [2025-09-06 08:48:39 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.16.mlp.experts), it might take a while... | |
| [2025-09-06 08:48:42 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.17.mlp.experts), it might take a while... | |
| [2025-09-06 08:48:45 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.18.mlp.experts), it might take a while... | |
| [2025-09-06 08:48:48 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.19.mlp.experts), it might take a while... | |
| [2025-09-06 08:48:51 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.20.mlp.experts), it might take a while... | |
| [2025-09-06 08:48:54 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.21.mlp.experts), it might take a while... | |
| [2025-09-06 08:48:58 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.22.mlp.experts), it might take a while... | |
| [2025-09-06 08:49:01 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.23.mlp.experts), it might take a while... | |
| [2025-09-06 08:49:04 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.24.mlp.experts), it might take a while... | |
| [2025-09-06 08:49:07 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.25.mlp.experts), it might take a while... | |
| [2025-09-06 08:49:10 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.26.mlp.experts), it might take a while... | |
| [2025-09-06 08:49:13 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.27.mlp.experts), it might take a while... | |
| [2025-09-06 08:49:16 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.28.mlp.experts), it might take a while... | |
| [2025-09-06 08:49:19 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.29.mlp.experts), it might take a while... | |
| [2025-09-06 08:49:23 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.30.mlp.experts), it might take a while... | |
| [2025-09-06 08:49:26 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.31.mlp.experts), it might take a while... | |
| [2025-09-06 08:49:29 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.32.mlp.experts), it might take a while... | |
| [2025-09-06 08:49:32 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.33.mlp.experts), it might take a while... | |
| [2025-09-06 08:49:35 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.34.mlp.experts), it might take a while... | |
| [2025-09-06 08:49:38 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.35.mlp.experts), it might take a while... | |
| [2025-09-06 08:49:41 TP0] Load weight end. type=GptOssForCausalLM, dtype=torch.bfloat16, avail mem=158.06 GB, mem usage=18.22 GB. | |
| [2025-09-06 08:49:41 TP0] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB | |
| [2025-09-06 08:49:41 TP0] Memory pool end. avail mem=10.23 GB | |
| [2025-09-06 08:49:41 TP1] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB | |
| [2025-09-06 08:49:41 TP2] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB | |
| [2025-09-06 08:49:41 TP3] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB | |
| [2025-09-06 08:49:42 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=9.54 GB | |
| [2025-09-06 08:49:42 TP0] Capture cuda graph bs [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200] | |
| 0%| | 0/28 [00:00<?, ?it/s] Capturing batches (bs=200 avail_mem=9.39 GB): 0%| | 0/28 [00:00<?, ?it/s]rank 2 allocated ipc_handles: [['0x7fc778000000', '0x7fc774000000', '0x7fed74000000', '0x7fc770000000'], ['0x7fc773000000', '0x7fc773200000', '0x7fc772e00000', '0x7fc773400000'], ['0x7fc75c000000', '0x7fc752000000', '0x7fc766000000', '0x7fc748000000']] | |
| [2025-09-06 08:49:44.171] [info] lamportInitialize start: buffer: 0x7fc766000000, size: 71303168 | |
| rank 3 allocated ipc_handles: [['0x79b95c000000', '0x79b922000000', '0x79b91e000000', '0x79df12000000'], ['0x79b921000000', '0x79b921200000', '0x79b921400000', '0x79b920e00000'], ['0x79b90a000000', '0x79b900000000', '0x79b8f6000000', '0x79b914000000']] | |
| [2025-09-06 08:49:44.223] [info] lamportInitialize start: buffer: 0x79b914000000, size: 71303168 | |
| rank 0 allocated ipc_handles: [['0x7896b2000000', '0x7870f6000000', '0x7870c4000000', '0x7870c0000000'], ['0x7870c2e00000', '0x7870c3000000', '0x7870c3200000', '0x7870c3400000'], ['0x7870b6000000', '0x7870ac000000', '0x7870a2000000', '0x787098000000']] | |
| [2025-09-06 08:49:44.271] [info] lamportInitialize start: buffer: 0x7870b6000000, size: 71303168 | |
| rank 1 allocated ipc_handles: [['0x7956e2000000', '0x797c78000000', '0x79567c000000', '0x795678000000'], ['0x79567b000000', '0x79567ae00000', '0x79567b200000', '0x79567b400000'], ['0x795664000000', '0x79566e000000', '0x79565a000000', '0x795650000000']] | |
| [2025-09-06 08:49:44.320] [info] lamportInitialize start: buffer: 0x79566e000000, size: 71303168 | |
| [2025-09-06 08:49:44 TP0] FlashInfer workspace initialized for rank 0, world_size 4 | |
| [2025-09-06 08:49:44 TP1] FlashInfer workspace initialized for rank 1, world_size 4 | |
| [2025-09-06 08:49:44 TP3] FlashInfer workspace initialized for rank 3, world_size 4 | |
| [2025-09-06 08:49:44 TP2] FlashInfer workspace initialized for rank 2, world_size 4 | |
| set flag_ptr[3] = lamport_comm_size: 47185920 | |
| Rank 0 workspace[0] 0x7896b2000000 | |
| Rank 0 workspace[1] 0x7870f6000000 | |
| Rank 0 workspace[2] 0x7870c4000000 | |
| Rank 0 workspace[3] 0x7870c0000000 | |
| Rank 0 workspace[4] 0x7870c2e00000 | |
| Rank 0 workspace[5] 0x7870c3000000 | |
| Rank 0 workspace[6] 0x7870c3200000 | |
| Rank 0 workspace[7] 0x7870c3400000 | |
| Rank 0 workspace[8] 0x7870b6000000 | |
| Rank 0 workspace[9] 0x7870ac000000 | |
| Rank 0 workspace[10] 0x7870a2000000 | |
| Rank 0 workspace[11] 0x787098000000 | |
| Rank 0 workspace[12] 0x789cab264400 | |
| set flag_ptr[3] = lamport_comm_size: 47185920 | |
| Rank 1 workspace[0] 0x7956e2000000 | |
| Rank 1 workspace[1] 0x797c78000000 | |
| Rank 1 workspace[2] 0x79567c000000 | |
| Rank 1 workspace[3] 0x795678000000 | |
| Rank 1 workspace[4] 0x79567b000000 | |
| Rank 1 workspace[5] 0x79567ae00000 | |
| Rank 1 workspace[6] 0x79567b200000 | |
| Rank 1 workspace[7] 0x79567b400000 | |
| Rank 1 workspace[8] 0x795664000000 | |
| Rank 1 workspace[9] 0x79566e000000 | |
| Rank 1 workspace[10] 0x79565a000000 | |
| Rank 1 workspace[11] 0x795650000000 | |
| Rank 1 workspace[12] 0x79827f264400 | |
| set flag_ptr[3] = lamport_comm_size: 47185920 | |
| Rank 2 workspace[0] 0x7fc778000000 | |
| Rank 2 workspace[1] 0x7fc774000000 | |
| Rank 2 workspace[2] 0x7fed74000000 | |
| Rank 2 workspace[3] 0x7fc770000000 | |
| Rank 2 workspace[4] 0x7fc773000000 | |
| Rank 2 workspace[5] 0x7fc773200000 | |
| Rank 2 workspace[6] 0x7fc772e00000 | |
| Rank 2 workspace[7] 0x7fc773400000 | |
| Rank 2 workspace[8] 0x7fc75c000000 | |
| Rank 2 workspace[9] 0x7fc752000000 | |
| Rank 2 workspace[10] 0x7fc766000000 | |
| Rank 2 workspace[11] 0x7fc748000000 | |
| Rank 2 workspace[12] 0x7ff383264400 | |
| set flag_ptr[3] = lamport_comm_size: 47185920 | |
| Rank 3 workspace[0] 0x79b95c000000 | |
| Rank 3 workspace[1] 0x79b922000000 | |
| Rank 3 workspace[2] 0x79b91e000000 | |
| Rank 3 workspace[3] 0x79df12000000 | |
| Rank 3 workspace[4] 0x79b921000000 | |
| Rank 3 workspace[5] 0x79b921200000 | |
| Rank 3 workspace[6] 0x79b921400000 | |
| Rank 3 workspace[7] 0x79b920e00000 | |
| Rank 3 workspace[8] 0x79b90a000000 | |
| Rank 3 workspace[9] 0x79b900000000 | |
| Rank 3 workspace[10] 0x79b8f6000000 | |
| Rank 3 workspace[11] 0x79b914000000 | |
| Rank 3 workspace[12] 0x79e50f264400 | |
| Capturing batches (bs=200 avail_mem=9.39 GB): 4%|▎ | 1/28 [00:02<00:57, 2.13s/it] Capturing batches (bs=192 avail_mem=8.06 GB): 4%|▎ | 1/28 [00:02<00:57, 2.13s/it] Capturing batches (bs=192 avail_mem=8.06 GB): 7%|▋ | 2/28 [00:02<00:25, 1.03it/s] Capturing batches (bs=184 avail_mem=8.05 GB): 7%|▋ | 2/28 [00:02<00:25, 1.03it/s] Capturing batches (bs=176 avail_mem=8.04 GB): 7%|▋ | 2/28 [00:02<00:25, 1.03it/s] Capturing batches (bs=176 avail_mem=8.04 GB): 14%|█▍ | 4/28 [00:02<00:10, 2.40it/s] Capturing batches (bs=168 avail_mem=8.03 GB): 14%|█▍ | 4/28 [00:02<00:10, 2.40it/s] Capturing batches (bs=160 avail_mem=8.02 GB): 14%|█▍ | 4/28 [00:02<00:10, 2.40it/s] Capturing batches (bs=160 avail_mem=8.02 GB): 21%|██▏ | 6/28 [00:02<00:05, 3.81it/s] Capturing batches (bs=152 avail_mem=8.01 GB): 21%|██▏ | 6/28 [00:02<00:05, 3.81it/s] Capturing batches (bs=144 avail_mem=8.00 GB): 21%|██▏ | 6/28 [00:02<00:05, 3.81it/s] Capturing batches (bs=144 avail_mem=8.00 GB): 29%|██▊ | 8/28 [00:02<00:03, 5.19it/s] Capturing batches (bs=136 avail_mem=8.00 GB): 29%|██▊ | 8/28 [00:02<00:03, 5.19it/s] Capturing batches (bs=128 avail_mem=7.99 GB): 29%|██▊ | 8/28 [00:02<00:03, 5.19it/s] Capturing batches (bs=128 avail_mem=7.99 GB): 36%|███▌ | 10/28 [00:03<00:02, 6.40it/s] Capturing batches (bs=120 avail_mem=7.97 GB): 36%|███▌ | 10/28 [00:03<00:02, 6.40it/s] Capturing batches (bs=112 avail_mem=7.97 GB): 36%|███▌ | 10/28 [00:03<00:02, 6.40it/s] Capturing batches (bs=112 avail_mem=7.97 GB): 43%|████▎ | 12/28 [00:03<00:02, 7.46it/s] Capturing batches (bs=104 avail_mem=7.96 GB): 43%|████▎ | 12/28 [00:03<00:02, 7.46it/s] Capturing batches (bs=96 avail_mem=7.95 GB): 43%|████▎ | 12/28 [00:03<00:02, 7.46it/s] Capturing batches (bs=96 avail_mem=7.95 GB): 50%|█████ | 14/28 [00:03<00:01, 8.32it/s] Capturing batches (bs=88 avail_mem=7.94 GB): 50%|█████ | 14/28 [00:03<00:01, 8.32it/s] Capturing batches (bs=80 avail_mem=7.93 GB): 50%|█████ | 14/28 [00:03<00:01, 8.32it/s] Capturing batches (bs=80 avail_mem=7.93 GB): 57%|█████▋ | 16/28 [00:03<00:01, 9.05it/s] Capturing batches (bs=72 avail_mem=7.92 GB): 57%|█████▋ | 16/28 [00:03<00:01, 9.05it/s] Capturing batches (bs=64 avail_mem=7.91 GB): 57%|█████▋ | 16/28 [00:03<00:01, 9.05it/s] Capturing batches (bs=64 avail_mem=7.91 GB): 64%|██████▍ | 18/28 [00:03<00:01, 9.65it/s] Capturing batches (bs=56 avail_mem=7.90 GB): 64%|██████▍ | 18/28 [00:03<00:01, 9.65it/s] Capturing batches (bs=48 avail_mem=7.89 GB): 64%|██████▍ | 18/28 [00:03<00:01, 9.65it/s] Capturing batches (bs=48 avail_mem=7.89 GB): 71%|███████▏ | 20/28 [00:03<00:00, 10.15it/s] Capturing batches (bs=40 avail_mem=7.88 GB): 71%|███████▏ | 20/28 [00:03<00:00, 10.15it/s] Capturing batches (bs=32 avail_mem=7.88 GB): 71%|███████▏ | 20/28 [00:03<00:00, 10.15it/s] Capturing batches (bs=32 avail_mem=7.88 GB): 79%|███████▊ | 22/28 [00:04<00:00, 10.58it/s] Capturing batches (bs=24 avail_mem=7.87 GB): 79%|███████▊ | 22/28 [00:04<00:00, 10.58it/s] Capturing batches (bs=16 avail_mem=7.86 GB): 79%|███████▊ | 22/28 [00:04<00:00, 10.58it/s] Capturing batches (bs=16 avail_mem=7.86 GB): 86%|████████▌ | 24/28 [00:04<00:00, 10.89it/s] Capturing batches (bs=8 avail_mem=7.85 GB): 86%|████████▌ | 24/28 [00:04<00:00, 10.89it/s] Capturing batches (bs=4 avail_mem=7.84 GB): 86%|████████▌ | 24/28 [00:04<00:00, 10.89it/s] Capturing batches (bs=4 avail_mem=7.84 GB): 93%|█████████▎| 26/28 [00:04<00:00, 11.39it/s] Capturing batches (bs=2 avail_mem=7.83 GB): 93%|█████████▎| 26/28 [00:04<00:00, 11.39it/s] Capturing batches (bs=1 avail_mem=7.82 GB): 93%|█████████▎| 26/28 [00:04<00:00, 11.39it/s][2025-09-06 08:49:47 TP2] Registering 56 cuda graph addresses | |
| Capturing batches (bs=1 avail_mem=7.82 GB): 100%|██████████| 28/28 [00:04<00:00, 10.89it/s] Capturing batches (bs=1 avail_mem=7.82 GB): 100%|██████████| 28/28 [00:04<00:00, 6.08it/s] | |
| [2025-09-06 08:49:47 TP1] Registering 56 cuda graph addresses | |
| [2025-09-06 08:49:47 TP0] Registering 56 cuda graph addresses | |
| [2025-09-06 08:49:47 TP3] Registering 56 cuda graph addresses | |
| [2025-09-06 08:49:47 TP0] Capture cuda graph end. Time elapsed: 5.11 s. mem usage=1.73 GB. avail mem=7.81 GB. | |
| [2025-09-06 08:49:47 TP0] max_total_num_tokens=8487040, chunked_prefill_size=16384, max_prefill_tokens=16384, max_running_requests=4096, context_len=131072, available_gpu_mem=7.81 GB | |
| [2025-09-06 08:49:48] INFO: Started server process [51985] | |
| [2025-09-06 08:49:48] INFO: Waiting for application startup. | |
| [2025-09-06 08:49:48] INFO: Application startup complete. | |
| [2025-09-06 08:49:48] INFO: Uvicorn running on http://127.0.0.1:8400 (Press CTRL+C to quit) | |
| [2025-09-06 08:49:49] INFO: 127.0.0.1:57250 - "GET /health_generate HTTP/1.1" 503 Service Unavailable | |
| [2025-09-06 08:49:49] INFO: 127.0.0.1:57262 - "GET /get_model_info HTTP/1.1" 200 OK | |
| [2025-09-06 08:49:49 TP0] Prefill batch. #new-seq: 1, #new-token: 64, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, | |
| [2025-09-06 08:49:50] INFO: 127.0.0.1:57266 - "POST /generate HTTP/1.1" 200 OK | |
| [2025-09-06 08:49:50] The server is fired up and ready to roll! | |
| [2025-09-06 08:49:59 TP0] Prefill batch. #new-seq: 1, #new-token: 64, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, | |
| [2025-09-06 08:50:00] INFO: 127.0.0.1:43482 - "GET /health_generate HTTP/1.1" 200 OK | |
| command=python3 -m sglang.launch_server --model-path /home/yiliu7/models/openai/gpt-oss-120b --tp 4 --cuda-graph-max-bs 200 --mem-fraction-static 0.93 --device cuda --host 127.0.0.1 --port 8400 | |
| Evaluation start: model=/home/yiliu7/models/openai/gpt-oss-120b reasoning_effort=low expected_score=0.6 | |
| ChatCompletionSampler initialized with self.system_message=None self.temperature=0.1 self.max_tokens=4096 self.reasoning_effort='low' | |
| 0%| | 0/198 [00:00<?, ?it/s][2025-09-06 08:50:00 TP0] Prefill batch. #new-seq: 1, #new-token: 320, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, | |
| [2025-09-06 08:50:00 TP0] Prefill batch. #new-seq: 1, #new-token: 384, #cached-token: 0, token usage: 0.00, #running-req: 1, #queue-req: 0, | |
| [2025-09-06 08:50:01 TP0] Prefill batch. #new-seq: 8, #new-token: 1920, #cached-token: 512, token usage: 0.00, #running-req: 2, #queue-req: 0, | |
| [2025-09-06 08:50:01 TP0] Prefill batch. #new-seq: 11, #new-token: 2688, #cached-token: 704, token usage: 0.00, #running-req: 10, #queue-req: 0, | |
| [2025-09-06 08:50:01 TP0] Prefill batch. #new-seq: 10, #new-token: 2432, #cached-token: 640, token usage: 0.00, #running-req: 21, #queue-req: 0, | |
| [2025-09-06 08:50:01 TP0] Prefill batch. #new-seq: 27, #new-token: 9728, #cached-token: 1728, token usage: 0.00, #running-req: 31, #queue-req: 0, | |
| [2025-09-06 08:50:01 TP0] Prefill batch. #new-seq: 15, #new-token: 3648, #cached-token: 960, token usage: 0.00, #running-req: 58, #queue-req: 0, | |
| [2025-09-06 08:50:01 TP0] Prefill batch. #new-seq: 36, #new-token: 9856, #cached-token: 2304, token usage: 0.00, #running-req: 73, #queue-req: 0, | |
| [2025-09-06 08:50:01 TP0] Prefill batch. #new-seq: 21, #new-token: 6144, #cached-token: 1408, token usage: 0.00, #running-req: 109, #queue-req: 0, | |
| [2025-09-06 08:50:01 TP0] Prefill batch. #new-seq: 47, #new-token: 12800, #cached-token: 3072, token usage: 0.00, #running-req: 130, #queue-req: 0, | |
| [2025-09-06 08:50:01 TP0] Prefill batch. #new-seq: 21, #new-token: 6720, #cached-token: 1408, token usage: 0.01, #running-req: 177, #queue-req: 0, | |
| [2025-09-06 08:50:01 TP0] Decode batch. #running-req: 198, #token: 62848, token usage: 0.01, cuda graph: True, gen throughput (token/s): 422.02, #queue-req: 0, | |
| [2025-09-06 08:50:01] INFO: 127.0.0.1:43990 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:02 TP0] Decode batch. #running-req: 197, #token: 70336, token usage: 0.01, cuda graph: True, gen throughput (token/s): 17371.23, #queue-req: 0, | |
| [2025-09-06 08:50:02] INFO: 127.0.0.1:44218 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:02] INFO: 127.0.0.1:45000 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:02] INFO: 127.0.0.1:43588 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:02] INFO: 127.0.0.1:44960 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:02] INFO: 127.0.0.1:44392 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:02] INFO: 127.0.0.1:44352 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:02] INFO: 127.0.0.1:44078 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:02] INFO: 127.0.0.1:44466 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:02 TP0] Decode batch. #running-req: 190, #token: 74816, token usage: 0.01, cuda graph: True, gen throughput (token/s): 17163.76, #queue-req: 0, | |
| [2025-09-06 08:50:02] INFO: 127.0.0.1:45142 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:02] INFO: 127.0.0.1:43690 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:02] INFO: 127.0.0.1:43774 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:02] INFO: 127.0.0.1:43610 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:02] INFO: 127.0.0.1:43786 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:02] INFO: 127.0.0.1:44138 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:02] INFO: 127.0.0.1:44304 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:02] INFO: 127.0.0.1:44778 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:02] INFO: 127.0.0.1:43626 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:02] INFO: 127.0.0.1:44232 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:02] INFO: 127.0.0.1:43944 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:03] INFO: 127.0.0.1:44650 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:03] INFO: 127.0.0.1:43508 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:03] INFO: 127.0.0.1:44698 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:03 TP0] Decode batch. #running-req: 175, #token: 72512, token usage: 0.01, cuda graph: True, gen throughput (token/s): 16427.91, #queue-req: 0, | |
| [2025-09-06 08:50:03] INFO: 127.0.0.1:45202 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:03] INFO: 127.0.0.1:45070 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:03] INFO: 127.0.0.1:43628 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:03] INFO: 127.0.0.1:44836 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:03] INFO: 127.0.0.1:44066 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:03] INFO: 127.0.0.1:43518 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:03] INFO: 127.0.0.1:45014 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:03] INFO: 127.0.0.1:44338 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:03] INFO: 127.0.0.1:44012 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:03 TP0] Decode batch. #running-req: 167, #token: 76032, token usage: 0.01, cuda graph: True, gen throughput (token/s): 15692.31, #queue-req: 0, | |
| [2025-09-06 08:50:03] INFO: 127.0.0.1:45040 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:04] INFO: 127.0.0.1:44714 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:04] INFO: 127.0.0.1:43904 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:04] INFO: 127.0.0.1:44410 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:04] INFO: 127.0.0.1:45106 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:04 TP0] Decode batch. #running-req: 164, #token: 79872, token usage: 0.01, cuda graph: True, gen throughput (token/s): 14907.71, #queue-req: 0, | |
| [2025-09-06 08:50:04] INFO: 127.0.0.1:44388 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:04] INFO: 127.0.0.1:44790 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:04] INFO: 127.0.0.1:45188 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:04] INFO: 127.0.0.1:45110 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:04] INFO: 127.0.0.1:44086 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:04] INFO: 127.0.0.1:44762 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:04] INFO: 127.0.0.1:43812 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:04] INFO: 127.0.0.1:45148 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:04] INFO: 127.0.0.1:43902 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:04] INFO: 127.0.0.1:44206 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:04] INFO: 127.0.0.1:45198 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:04] INFO: 127.0.0.1:44374 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:04 TP0] Decode batch. #running-req: 149, #token: 79872, token usage: 0.01, cuda graph: True, gen throughput (token/s): 14000.20, #queue-req: 0, | |
| [2025-09-06 08:50:04] INFO: 127.0.0.1:44190 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:04] INFO: 127.0.0.1:44846 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:04] INFO: 127.0.0.1:44608 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:04] INFO: 127.0.0.1:43752 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:04] INFO: 127.0.0.1:43858 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:04] INFO: 127.0.0.1:44536 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:04] INFO: 127.0.0.1:43704 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:04] INFO: 127.0.0.1:44048 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:04] INFO: 127.0.0.1:44560 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:04] INFO: 127.0.0.1:44624 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:04] INFO: 127.0.0.1:44982 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:04 TP0] Decode batch. #running-req: 138, #token: 78912, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13241.58, #queue-req: 0, | |
| [2025-09-06 08:50:04] INFO: 127.0.0.1:44932 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:04] INFO: 127.0.0.1:45036 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:05] INFO: 127.0.0.1:43620 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:05] INFO: 127.0.0.1:43866 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:05] INFO: 127.0.0.1:44816 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:05] INFO: 127.0.0.1:45008 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:05] INFO: 127.0.0.1:43568 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:05] INFO: 127.0.0.1:44362 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:05] INFO: 127.0.0.1:44162 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:05] INFO: 127.0.0.1:44420 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:05 TP0] Decode batch. #running-req: 128, #token: 79424, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13042.26, #queue-req: 0, | |
| [2025-09-06 08:50:05] INFO: 127.0.0.1:45090 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:05] INFO: 127.0.0.1:43834 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:05] INFO: 127.0.0.1:44498 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:05] INFO: 127.0.0.1:45116 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:05] INFO: 127.0.0.1:44482 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:05] INFO: 127.0.0.1:44334 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:05] INFO: 127.0.0.1:43848 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:05] INFO: 127.0.0.1:43664 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:05] INFO: 127.0.0.1:44704 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:05 TP0] Decode batch. #running-req: 119, #token: 78336, token usage: 0.01, cuda graph: True, gen throughput (token/s): 16564.89, #queue-req: 0, | |
| [2025-09-06 08:50:05] INFO: 127.0.0.1:45252 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:05] INFO: 127.0.0.1:43958 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:05] INFO: 127.0.0.1:44914 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:05] INFO: 127.0.0.1:45258 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:05] INFO: 127.0.0.1:44774 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:05] INFO: 127.0.0.1:44398 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:05] INFO: 127.0.0.1:43554 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:05] INFO: 127.0.0.1:44406 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:05] INFO: 127.0.0.1:43874 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:05] INFO: 127.0.0.1:44684 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:05] INFO: 127.0.0.1:44948 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:05] INFO: 127.0.0.1:43646 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:05] INFO: 127.0.0.1:43830 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:05] INFO: 127.0.0.1:44554 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:05 TP0] Decode batch. #running-req: 105, #token: 74176, token usage: 0.01, cuda graph: True, gen throughput (token/s): 15339.23, #queue-req: 0, | |
| [2025-09-06 08:50:06] INFO: 127.0.0.1:44254 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:06] INFO: 127.0.0.1:43816 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:06] INFO: 127.0.0.1:44852 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:06] INFO: 127.0.0.1:44116 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:06] INFO: 127.0.0.1:44506 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:06] INFO: 127.0.0.1:44126 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:06] INFO: 127.0.0.1:44582 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:06 TP0] Decode batch. #running-req: 98, #token: 72448, token usage: 0.01, cuda graph: True, gen throughput (token/s): 14430.93, #queue-req: 0, | |
| [2025-09-06 08:50:06] INFO: 127.0.0.1:43886 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:06] INFO: 127.0.0.1:44820 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:06] INFO: 127.0.0.1:43986 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:06] INFO: 127.0.0.1:43654 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:06] INFO: 127.0.0.1:45210 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:06] INFO: 127.0.0.1:43922 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:06] INFO: 127.0.0.1:43896 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:06] INFO: 127.0.0.1:43758 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:06] INFO: 127.0.0.1:44728 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:06 TP0] Decode batch. #running-req: 89, #token: 68352, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13443.96, #queue-req: 0, | |
| [2025-09-06 08:50:06] INFO: 127.0.0.1:43722 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:06] INFO: 127.0.0.1:44546 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:06] INFO: 127.0.0.1:44864 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:06] INFO: 127.0.0.1:44272 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:06] INFO: 127.0.0.1:43798 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:06] INFO: 127.0.0.1:44640 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:06] INFO: 127.0.0.1:45192 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:06] INFO: 127.0.0.1:45266 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:06] INFO: 127.0.0.1:45124 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:06] INFO: 127.0.0.1:45058 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:06] INFO: 127.0.0.1:44566 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:06] INFO: 127.0.0.1:44726 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:06 TP0] Decode batch. #running-req: 77, #token: 63232, token usage: 0.01, cuda graph: True, gen throughput (token/s): 12632.89, #queue-req: 0, | |
| [2025-09-06 08:50:06] INFO: 127.0.0.1:43974 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:06] INFO: 127.0.0.1:43636 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:06] INFO: 127.0.0.1:44490 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:06] INFO: 127.0.0.1:45084 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:06] INFO: 127.0.0.1:44064 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:06] INFO: 127.0.0.1:44376 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:07 TP0] Decode batch. #running-req: 71, #token: 60864, token usage: 0.01, cuda graph: True, gen throughput (token/s): 11695.44, #queue-req: 0, | |
| [2025-09-06 08:50:07] INFO: 127.0.0.1:43602 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:07] INFO: 127.0.0.1:43810 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:07] INFO: 127.0.0.1:44036 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:07] INFO: 127.0.0.1:43918 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:07] INFO: 127.0.0.1:45052 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:07] INFO: 127.0.0.1:45108 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:07] INFO: 127.0.0.1:44030 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:07 TP0] Decode batch. #running-req: 64, #token: 57792, token usage: 0.01, cuda graph: True, gen throughput (token/s): 10662.01, #queue-req: 0, | |
| [2025-09-06 08:50:07] INFO: 127.0.0.1:44664 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:07] INFO: 127.0.0.1:43996 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:07] INFO: 127.0.0.1:43890 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:07] INFO: 127.0.0.1:43762 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:07] INFO: 127.0.0.1:44200 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:07] INFO: 127.0.0.1:44306 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:07] INFO: 127.0.0.1:44744 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:07 TP0] Decode batch. #running-req: 57, #token: 53632, token usage: 0.01, cuda graph: True, gen throughput (token/s): 9975.86, #queue-req: 0, | |
| [2025-09-06 08:50:07] INFO: 127.0.0.1:44980 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:07] INFO: 127.0.0.1:45200 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:07] INFO: 127.0.0.1:43528 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:07] INFO: 127.0.0.1:45170 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:07] INFO: 127.0.0.1:43510 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:07] INFO: 127.0.0.1:44436 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:07] INFO: 127.0.0.1:44094 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:07] INFO: 127.0.0.1:44074 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:07 TP0] Decode batch. #running-req: 50, #token: 47808, token usage: 0.01, cuda graph: True, gen throughput (token/s): 9000.03, #queue-req: 0, | |
| [2025-09-06 08:50:07] INFO: 127.0.0.1:45138 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:07] INFO: 127.0.0.1:45238 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:07] INFO: 127.0.0.1:44154 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:07] INFO: 127.0.0.1:44880 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:07] INFO: 127.0.0.1:44860 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:08 TP0] Decode batch. #running-req: 44, #token: 45056, token usage: 0.01, cuda graph: True, gen throughput (token/s): 8202.58, #queue-req: 0, | |
| [2025-09-06 08:50:08] INFO: 127.0.0.1:44968 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:08] INFO: 127.0.0.1:43688 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:08] INFO: 127.0.0.1:44670 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:08] INFO: 127.0.0.1:43574 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:08] INFO: 127.0.0.1:44800 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:08 TP0] Decode batch. #running-req: 39, #token: 41408, token usage: 0.00, cuda graph: True, gen throughput (token/s): 7408.45, #queue-req: 0, | |
| [2025-09-06 08:50:08] INFO: 127.0.0.1:44592 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:08] INFO: 127.0.0.1:43706 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:08] INFO: 127.0.0.1:44020 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:08 TP0] Decode batch. #running-req: 36, #token: 39680, token usage: 0.00, cuda graph: True, gen throughput (token/s): 6741.61, #queue-req: 0, | |
| [2025-09-06 08:50:08] INFO: 127.0.0.1:43934 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:08] INFO: 127.0.0.1:44666 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:08] INFO: 127.0.0.1:44794 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:08] INFO: 127.0.0.1:44318 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:08 TP0] Decode batch. #running-req: 32, #token: 36544, token usage: 0.00, cuda graph: True, gen throughput (token/s): 6085.82, #queue-req: 0, | |
| [2025-09-06 08:50:08] INFO: 127.0.0.1:45028 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:08] INFO: 127.0.0.1:43542 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:08] INFO: 127.0.0.1:44520 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:08] INFO: 127.0.0.1:44176 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:08 TP0] Decode batch. #running-req: 29, #token: 33024, token usage: 0.00, cuda graph: True, gen throughput (token/s): 5870.70, #queue-req: 0, | |
| [2025-09-06 08:50:09 TP0] Decode batch. #running-req: 28, #token: 34176, token usage: 0.00, cuda graph: True, gen throughput (token/s): 5250.06, #queue-req: 0, | |
| [2025-09-06 08:50:09] INFO: 127.0.0.1:44088 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:09] INFO: 127.0.0.1:44248 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:09] INFO: 127.0.0.1:44916 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:09] INFO: 127.0.0.1:44750 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:09 TP0] Decode batch. #running-req: 24, #token: 30400, token usage: 0.00, cuda graph: True, gen throughput (token/s): 4915.10, #queue-req: 0, | |
| [2025-09-06 08:50:09] INFO: 127.0.0.1:44444 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:09] INFO: 127.0.0.1:44908 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:09] INFO: 127.0.0.1:44504 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:09] INFO: 127.0.0.1:43496 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| 1%| | 1/198 [00:08<28:17, 8.62s/it][2025-09-06 08:50:09 TP0] Decode batch. #running-req: 20, #token: 26112, token usage: 0.00, cuda graph: True, gen throughput (token/s): 4419.02, #queue-req: 0, | |
| [2025-09-06 08:50:09] INFO: 127.0.0.1:44770 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:09] INFO: 127.0.0.1:44984 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:09] INFO: 127.0.0.1:44100 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:09] INFO: 127.0.0.1:45064 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:09] INFO: 127.0.0.1:44584 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| 6%|▌ | 12/198 [00:08<01:38, 1.88it/s][2025-09-06 08:50:09] INFO: 127.0.0.1:43744 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:09 TP0] Decode batch. #running-req: 14, #token: 19328, token usage: 0.00, cuda graph: True, gen throughput (token/s): 3433.39, #queue-req: 0, | |
| [2025-09-06 08:50:09] INFO: 127.0.0.1:44288 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:09] INFO: 127.0.0.1:44458 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:09 TP0] Decode batch. #running-req: 12, #token: 16896, token usage: 0.00, cuda graph: True, gen throughput (token/s): 2626.78, #queue-req: 0, | |
| [2025-09-06 08:50:10] INFO: 127.0.0.1:44892 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:10] INFO: 127.0.0.1:44296 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:10 TP0] Decode batch. #running-req: 10, #token: 14464, token usage: 0.00, cuda graph: True, gen throughput (token/s): 2343.56, #queue-req: 0, | |
| [2025-09-06 08:50:10] INFO: 127.0.0.1:43728 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:10 TP0] Decode batch. #running-req: 9, #token: 13504, token usage: 0.00, cuda graph: True, gen throughput (token/s): 2011.41, #queue-req: 0, | |
| [2025-09-06 08:50:10] INFO: 127.0.0.1:45224 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:10] INFO: 127.0.0.1:45160 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:10 TP0] Decode batch. #running-req: 7, #token: 10688, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1811.20, #queue-req: 0, | |
| [2025-09-06 08:50:10 TP0] Decode batch. #running-req: 7, #token: 10880, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1671.46, #queue-req: 0, | |
| [2025-09-06 08:50:10 TP0] Decode batch. #running-req: 7, #token: 11264, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1669.39, #queue-req: 0, | |
| [2025-09-06 08:50:10] INFO: 127.0.0.1:44730 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| 12%|█▏ | 24/198 [00:10<00:48, 3.59it/s][2025-09-06 08:50:10 TP0] Decode batch. #running-req: 6, #token: 8448, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1617.26, #queue-req: 0, | |
| [2025-09-06 08:50:10] INFO: 127.0.0.1:44264 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:11 TP0] Decode batch. #running-req: 5, #token: 8576, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1209.93, #queue-req: 0, | |
| [2025-09-06 08:50:11 TP0] Decode batch. #running-req: 5, #token: 8832, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1203.85, #queue-req: 0, | |
| [2025-09-06 08:50:11 TP0] Decode batch. #running-req: 5, #token: 9024, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1194.25, #queue-req: 0, | |
| [2025-09-06 08:50:11] INFO: 127.0.0.1:45182 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:11] INFO: 127.0.0.1:43818 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:11 TP0] Decode batch. #running-req: 3, #token: 5312, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1010.18, #queue-req: 0, | |
| [2025-09-06 08:50:11 TP0] Decode batch. #running-req: 3, #token: 5440, token usage: 0.00, cuda graph: True, gen throughput (token/s): 800.43, #queue-req: 0, | |
| [2025-09-06 08:50:11 TP0] Decode batch. #running-req: 3, #token: 5504, token usage: 0.00, cuda graph: True, gen throughput (token/s): 801.81, #queue-req: 0, | |
| [2025-09-06 08:50:12] INFO: 127.0.0.1:45030 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:50:12] INFO: 127.0.0.1:44822 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| 15%|█▌ | 30/198 [00:11<00:41, 4.01it/s][2025-09-06 08:50:12 TP0] Decode batch. #running-req: 1, #token: 1920, token usage: 0.00, cuda graph: True, gen throughput (token/s): 607.06, #queue-req: 0, | |
| [2025-09-06 08:50:12 TP0] Decode batch. #running-req: 1, #token: 1984, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.95, #queue-req: 0, | |
| [2025-09-06 08:50:12 TP0] Decode batch. #running-req: 1, #token: 2048, token usage: 0.00, cuda graph: True, gen throughput (token/s): 331.35, #queue-req: 0, | |
| [2025-09-06 08:50:12 TP0] Decode batch. #running-req: 1, #token: 2048, token usage: 0.00, cuda graph: True, gen throughput (token/s): 331.17, #queue-req: 0, | |
| [2025-09-06 08:50:12 TP0] Decode batch. #running-req: 1, #token: 2112, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.16, #queue-req: 0, | |
| [2025-09-06 08:50:12 TP0] Decode batch. #running-req: 1, #token: 2176, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.91, #queue-req: 0, | |
| [2025-09-06 08:50:12 TP0] Decode batch. #running-req: 1, #token: 2176, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.34, #queue-req: 0, | |
| [2025-09-06 08:50:12 TP0] Decode batch. #running-req: 1, #token: 2240, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.84, #queue-req: 0, | |
| [2025-09-06 08:50:13 TP0] Decode batch. #running-req: 1, #token: 2240, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.21, #queue-req: 0, | |
| [2025-09-06 08:50:13 TP0] Decode batch. #running-req: 1, #token: 2304, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.80, #queue-req: 0, | |
| [2025-09-06 08:50:13 TP0] Decode batch. #running-req: 1, #token: 2368, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.60, #queue-req: 0, | |
| [2025-09-06 08:50:13 TP0] Decode batch. #running-req: 1, #token: 2368, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.75, #queue-req: 0, | |
| [2025-09-06 08:50:13 TP0] Decode batch. #running-req: 1, #token: 2432, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.77, #queue-req: 0, | |
| [2025-09-06 08:50:13 TP0] Decode batch. #running-req: 1, #token: 2496, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.77, #queue-req: 0, | |
| [2025-09-06 08:50:13 TP0] Decode batch. #running-req: 1, #token: 2496, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.75, #queue-req: 0, | |
| [2025-09-06 08:50:13 TP0] Decode batch. #running-req: 1, #token: 2560, token usage: 0.00, cuda graph: True, gen throughput (token/s): 328.93, #queue-req: 0, | |
| [2025-09-06 08:50:14 TP0] Decode batch. #running-req: 1, #token: 2560, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.56, #queue-req: 0, | |
| [2025-09-06 08:50:14 TP0] Decode batch. #running-req: 1, #token: 2624, token usage: 0.00, cuda graph: True, gen throughput (token/s): 328.80, #queue-req: 0, | |
| [2025-09-06 08:50:14 TP0] Decode batch. #running-req: 1, #token: 2688, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.02, #queue-req: 0, | |
| [2025-09-06 08:50:14 TP0] Decode batch. #running-req: 1, #token: 2688, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.23, #queue-req: 0, | |
| [2025-09-06 08:50:14 TP0] Decode batch. #running-req: 1, #token: 2752, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.06, #queue-req: 0, | |
| [2025-09-06 08:50:14 TP0] Decode batch. #running-req: 1, #token: 2816, token usage: 0.00, cuda graph: True, gen throughput (token/s): 328.64, #queue-req: 0, | |
| [2025-09-06 08:50:14 TP0] Decode batch. #running-req: 1, #token: 2816, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.15, #queue-req: 0, | |
| [2025-09-06 08:50:14 TP0] Decode batch. #running-req: 1, #token: 2880, token usage: 0.00, cuda graph: True, gen throughput (token/s): 328.99, #queue-req: 0, | |
| [2025-09-06 08:50:14 TP0] Decode batch. #running-req: 1, #token: 2880, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.25, #queue-req: 0, | |
| [2025-09-06 08:50:15 TP0] Decode batch. #running-req: 1, #token: 2944, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.01, #queue-req: 0, | |
| [2025-09-06 08:50:15 TP0] Decode batch. #running-req: 1, #token: 3008, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.00, #queue-req: 0, | |
| [2025-09-06 08:50:15 TP0] Decode batch. #running-req: 1, #token: 3008, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.25, #queue-req: 0, | |
| [2025-09-06 08:50:15 TP0] Decode batch. #running-req: 1, #token: 3072, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.11, #queue-req: 0, | |
| [2025-09-06 08:50:15 TP0] Decode batch. #running-req: 1, #token: 3136, token usage: 0.00, cuda graph: True, gen throughput (token/s): 326.28, #queue-req: 0, | |
| [2025-09-06 08:50:15 TP0] Decode batch. #running-req: 1, #token: 3136, token usage: 0.00, cuda graph: True, gen throughput (token/s): 328.20, #queue-req: 0, | |
| [2025-09-06 08:50:15 TP0] Decode batch. #running-req: 1, #token: 3200, token usage: 0.00, cuda graph: True, gen throughput (token/s): 328.24, #queue-req: 0, | |
| [2025-09-06 08:50:15 TP0] Decode batch. #running-req: 1, #token: 3200, token usage: 0.00, cuda graph: True, gen throughput (token/s): 327.81, #queue-req: 0, | |
| [2025-09-06 08:50:16 TP0] Decode batch. #running-req: 1, #token: 3264, token usage: 0.00, cuda graph: True, gen throughput (token/s): 328.10, #queue-req: 0, | |
| [2025-09-06 08:50:16 TP0] Decode batch. #running-req: 1, #token: 3328, token usage: 0.00, cuda graph: True, gen throughput (token/s): 327.98, #queue-req: 0, | |
| [2025-09-06 08:50:16 TP0] Decode batch. #running-req: 1, #token: 3328, token usage: 0.00, cuda graph: True, gen throughput (token/s): 328.30, #queue-req: 0, | |
| [2025-09-06 08:50:16 TP0] Decode batch. #running-req: 1, #token: 3392, token usage: 0.00, cuda graph: True, gen throughput (token/s): 327.75, #queue-req: 0, | |
| [2025-09-06 08:50:16 TP0] Decode batch. #running-req: 1, #token: 3456, token usage: 0.00, cuda graph: True, gen throughput (token/s): 327.72, #queue-req: 0, | |
| [2025-09-06 08:50:16 TP0] Decode batch. #running-req: 1, #token: 3456, token usage: 0.00, cuda graph: True, gen throughput (token/s): 327.36, #queue-req: 0, | |
| [2025-09-06 08:50:16 TP0] Decode batch. #running-req: 1, #token: 3520, token usage: 0.00, cuda graph: True, gen throughput (token/s): 327.55, #queue-req: 0, | |
| [2025-09-06 08:50:16 TP0] Decode batch. #running-req: 1, #token: 3520, token usage: 0.00, cuda graph: True, gen throughput (token/s): 327.20, #queue-req: 0, | |
| [2025-09-06 08:50:17 TP0] Decode batch. #running-req: 1, #token: 3584, token usage: 0.00, cuda graph: True, gen throughput (token/s): 327.35, #queue-req: 0, | |
| [2025-09-06 08:50:17 TP0] Decode batch. #running-req: 1, #token: 3648, token usage: 0.00, cuda graph: True, gen throughput (token/s): 327.56, #queue-req: 0, | |
| [2025-09-06 08:50:17 TP0] Decode batch. #running-req: 1, #token: 3648, token usage: 0.00, cuda graph: True, gen throughput (token/s): 326.78, #queue-req: 0, | |
| [2025-09-06 08:50:17 TP0] Decode batch. #running-req: 1, #token: 3712, token usage: 0.00, cuda graph: True, gen throughput (token/s): 326.70, #queue-req: 0, | |
| [2025-09-06 08:50:17 TP0] Decode batch. #running-req: 1, #token: 3776, token usage: 0.00, cuda graph: True, gen throughput (token/s): 326.99, #queue-req: 0, | |
| [2025-09-06 08:50:17 TP0] Decode batch. #running-req: 1, #token: 3776, token usage: 0.00, cuda graph: True, gen throughput (token/s): 326.84, #queue-req: 0, | |
| [2025-09-06 08:50:17 TP0] Decode batch. #running-req: 1, #token: 3840, token usage: 0.00, cuda graph: True, gen throughput (token/s): 326.78, #queue-req: 0, | |
| [2025-09-06 08:50:17 TP0] Decode batch. #running-req: 1, #token: 3840, token usage: 0.00, cuda graph: True, gen throughput (token/s): 327.08, #queue-req: 0, | |
| [2025-09-06 08:50:18 TP0] Decode batch. #running-req: 1, #token: 3904, token usage: 0.00, cuda graph: True, gen throughput (token/s): 328.42, #queue-req: 0, | |
| [2025-09-06 08:50:18 TP0] Decode batch. #running-req: 1, #token: 3968, token usage: 0.00, cuda graph: True, gen throughput (token/s): 328.55, #queue-req: 0, | |
| [2025-09-06 08:50:18 TP0] Decode batch. #running-req: 1, #token: 3968, token usage: 0.00, cuda graph: True, gen throughput (token/s): 328.20, #queue-req: 0, | |
| [2025-09-06 08:50:18 TP0] Decode batch. #running-req: 1, #token: 4032, token usage: 0.00, cuda graph: True, gen throughput (token/s): 318.00, #queue-req: 0, | |
| [2025-09-06 08:50:18 TP0] Decode batch. #running-req: 1, #token: 4096, token usage: 0.00, cuda graph: True, gen throughput (token/s): 328.58, #queue-req: 0, | |
| [2025-09-06 08:50:18 TP0] Decode batch. #running-req: 1, #token: 4096, token usage: 0.00, cuda graph: True, gen throughput (token/s): 328.54, #queue-req: 0, | |
| [2025-09-06 08:50:18 TP0] Decode batch. #running-req: 1, #token: 4160, token usage: 0.00, cuda graph: True, gen throughput (token/s): 327.81, #queue-req: 0, | |
| [2025-09-06 08:50:18 TP0] Decode batch. #running-req: 1, #token: 4160, token usage: 0.00, cuda graph: True, gen throughput (token/s): 327.50, #queue-req: 0, | |
| [2025-09-06 08:50:19 TP0] Decode batch. #running-req: 1, #token: 4224, token usage: 0.00, cuda graph: True, gen throughput (token/s): 326.93, #queue-req: 0, | |
| [2025-09-06 08:50:19 TP0] Decode batch. #running-req: 1, #token: 4288, token usage: 0.00, cuda graph: True, gen throughput (token/s): 327.17, #queue-req: 0, | |
| [2025-09-06 08:50:19 TP0] Decode batch. #running-req: 1, #token: 4288, token usage: 0.00, cuda graph: True, gen throughput (token/s): 327.33, #queue-req: 0, | |
| [2025-09-06 08:50:19 TP0] Decode batch. #running-req: 1, #token: 4352, token usage: 0.00, cuda graph: True, gen throughput (token/s): 327.43, #queue-req: 0, | |
| [2025-09-06 08:50:19 TP0] Decode batch. #running-req: 1, #token: 4416, token usage: 0.00, cuda graph: True, gen throughput (token/s): 327.05, #queue-req: 0, | |
| [2025-09-06 08:50:19] INFO: 127.0.0.1:43678 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| 56%|█████▌ | 110/198 [00:18<00:10, 8.49it/s] 100%|██████████| 198/198 [00:18<00:00, 10.57it/s] | |
| /usr/lib/python3.12/subprocess.py:1127: ResourceWarning: subprocess 51985 is still running | |
| _warn("subprocess %s is still running" % self.pid, | |
| ResourceWarning: Enable tracemalloc to get the object allocation traceback | |
| . | |
| ---------------------------------------------------------------------- | |
| Ran 1 test in 182.997s | |
| OK | |
| Writing report to /tmp/gpqa__home_yiliu7_models_openai_gpt-oss-120b.html | |
| {'chars': 1658.3989898989898, 'chars:std': 1027.2337509987194, 'score:std': 0.478067053179767, 'score': 0.6464646464646465} | |
| Writing results to /tmp/gpqa__home_yiliu7_models_openai_gpt-oss-120b.json | |
| Total latency: 18.779 s | |
| Score: 0.646 | |
| Evaluation end: model=/home/yiliu7/models/openai/gpt-oss-120b reasoning_effort=low expected_score=0.6 metrics={'chars': 1658.3989898989898, 'chars:std': 1027.2337509987194, 'score:std': 0.478067053179767, 'score': 0.6464646464646465} | |
| ================================================================================ | |
| Run 9: | |
| Auto-configed device: cuda | |
| WARNING:sglang.srt.server_args:Detected SM100 and MXFP4 quantization format for GPT-OSS model, enabling FlashInfer MXFP4 MOE kernel. | |
| WARNING:sglang.srt.server_args:TensorRT-LLM MHA only supports page_size of 16, 32 or 64, changing page_size from None to 64. | |
| [2025-09-06 08:50:34] server_args=ServerArgs(model_path='/home/yiliu7/models/openai/gpt-oss-120b', tokenizer_path='/home/yiliu7/models/openai/gpt-oss-120b', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=8400, skip_server_warmup=False, warmups=None, nccl_port=None, dtype='bfloat16', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', mem_fraction_static=0.93, max_running_requests=None, max_queued_requests=9223372036854775807, max_total_tokens=None, chunked_prefill_size=16384, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, page_size=64, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=True, device='cuda', tp_size=4, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=8902787, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, gc_warning_threshold_secs=0.0, api_key=None, served_model_name='/home/yiliu7/models/openai/gpt-oss-120b', weight_version='default', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_backend='triton', attention_backend='trtllm_mha', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, ep_size=1, moe_a2a_backend='none', moe_runner_backend='flashinfer_mxfp4', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=True, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', disable_radix_cache=False, cuda_graph_max_bs=200, cuda_graph_bs=None, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, torch_compile_max_bs=32, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, enable_return_hidden_states=False, scheduler_recv_interval=1, numa_node=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, debug_tensor_dump_prefill_only=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, num_reserved_decode_tokens=512, custom_weight_loader=[], weight_loader_disable_mmap=False, enable_pdmux=False, sm_group_num=3, enable_ep_moe=False, enable_deepep_moe=False, enable_flashinfer_cutlass_moe=False, enable_flashinfer_trtllm_moe=False, enable_triton_kernel_moe=False, enable_flashinfer_mxfp4_moe=False) | |
| All deep_gemm operations loaded successfully! | |
| `torch_dtype` is deprecated! Use `dtype` instead! | |
| [2025-09-06 08:50:34] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:50:34] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [2025-09-06 08:50:34] Using default HuggingFace chat template with detected content format: string | |
| `torch_dtype` is deprecated! Use `dtype` instead! | |
| [2025-09-06 08:50:41 TP2] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:50:41 TP2] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| `torch_dtype` is deprecated! Use `dtype` instead! | |
| [2025-09-06 08:50:41 TP1] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:50:41 TP1] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| `torch_dtype` is deprecated! Use `dtype` instead! | |
| [2025-09-06 08:50:41 TP3] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:50:41 TP3] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| `torch_dtype` is deprecated! Use `dtype` instead! | |
| [2025-09-06 08:50:41 TP0] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:50:41 TP0] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [2025-09-06 08:50:41 TP2] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:50:41 TP2] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [2025-09-06 08:50:42 TP1] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:50:42 TP1] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [2025-09-06 08:50:42 TP3] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:50:42 TP3] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [2025-09-06 08:50:42 TP0] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:50:42 TP0] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [2025-09-06 08:50:42 TP0] Init torch distributed begin. | |
| [Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [2025-09-06 08:50:44 TP0] sglang is using nccl==2.27.3 | |
| [Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [2025-09-06 08:50:45 TP0] Init torch distributed ends. mem usage=1.46 GB | |
| [2025-09-06 08:50:46 TP0] Load weight begin. avail mem=176.28 GB | |
| All deep_gemm operations loaded successfully! | |
| Loading safetensors checkpoint shards: 0% Completed | 0/15 [00:00<?, ?it/s] | |
| Loading safetensors checkpoint shards: 100% Completed | 15/15 [00:00<00:00, 1553.52it/s] | |
| All deep_gemm operations loaded successfully! | |
| [2025-09-06 08:51:01 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.0.mlp.experts), it might take a while... | |
| All deep_gemm operations loaded successfully! | |
| All deep_gemm operations loaded successfully! | |
| [2025-09-06 08:51:04 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.1.mlp.experts), it might take a while... | |
| [2025-09-06 08:51:07 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.2.mlp.experts), it might take a while... | |
| [2025-09-06 08:51:10 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.3.mlp.experts), it might take a while... | |
| [2025-09-06 08:51:13 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.4.mlp.experts), it might take a while... | |
| [2025-09-06 08:51:16 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.5.mlp.experts), it might take a while... | |
| [2025-09-06 08:51:19 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.6.mlp.experts), it might take a while... | |
| [2025-09-06 08:51:22 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.7.mlp.experts), it might take a while... | |
| [2025-09-06 08:51:25 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.8.mlp.experts), it might take a while... | |
| [2025-09-06 08:51:28 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.9.mlp.experts), it might take a while... | |
| [2025-09-06 08:51:32 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.10.mlp.experts), it might take a while... | |
| [2025-09-06 08:51:35 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.11.mlp.experts), it might take a while... | |
| [2025-09-06 08:51:38 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.12.mlp.experts), it might take a while... | |
| [2025-09-06 08:51:41 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.13.mlp.experts), it might take a while... | |
| [2025-09-06 08:51:44 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.14.mlp.experts), it might take a while... | |
| [2025-09-06 08:51:47 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.15.mlp.experts), it might take a while... | |
| [2025-09-06 08:51:50 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.16.mlp.experts), it might take a while... | |
| [2025-09-06 08:51:53 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.17.mlp.experts), it might take a while... | |
| [2025-09-06 08:51:56 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.18.mlp.experts), it might take a while... | |
| [2025-09-06 08:51:59 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.19.mlp.experts), it might take a while... | |
| [2025-09-06 08:52:02 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.20.mlp.experts), it might take a while... | |
| [2025-09-06 08:52:05 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.21.mlp.experts), it might take a while... | |
| [2025-09-06 08:52:08 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.22.mlp.experts), it might take a while... | |
| [2025-09-06 08:52:11 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.23.mlp.experts), it might take a while... | |
| [2025-09-06 08:52:15 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.24.mlp.experts), it might take a while... | |
| [2025-09-06 08:52:18 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.25.mlp.experts), it might take a while... | |
| [2025-09-06 08:52:21 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.26.mlp.experts), it might take a while... | |
| [2025-09-06 08:52:24 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.27.mlp.experts), it might take a while... | |
| [2025-09-06 08:52:27 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.28.mlp.experts), it might take a while... | |
| [2025-09-06 08:52:30 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.29.mlp.experts), it might take a while... | |
| [2025-09-06 08:52:33 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.30.mlp.experts), it might take a while... | |
| [2025-09-06 08:52:36 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.31.mlp.experts), it might take a while... | |
| [2025-09-06 08:52:39 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.32.mlp.experts), it might take a while... | |
| [2025-09-06 08:52:42 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.33.mlp.experts), it might take a while... | |
| [2025-09-06 08:52:45 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.34.mlp.experts), it might take a while... | |
| [2025-09-06 08:52:48 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.35.mlp.experts), it might take a while... | |
| [2025-09-06 08:52:51 TP0] Load weight end. type=GptOssForCausalLM, dtype=torch.bfloat16, avail mem=158.06 GB, mem usage=18.22 GB. | |
| [2025-09-06 08:52:51 TP0] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB | |
| [2025-09-06 08:52:51 TP0] Memory pool end. avail mem=10.23 GB | |
| [2025-09-06 08:52:51 TP1] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB | |
| [2025-09-06 08:52:51 TP3] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB | |
| [2025-09-06 08:52:51 TP2] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB | |
| [2025-09-06 08:52:52 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=9.54 GB | |
| [2025-09-06 08:52:52 TP0] Capture cuda graph bs [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200] | |
| 0%| | 0/28 [00:00<?, ?it/s] Capturing batches (bs=200 avail_mem=9.39 GB): 0%| | 0/28 [00:00<?, ?it/s]rank 2 allocated ipc_handles: [['0x7db224000000', '0x7db1c0000000', '0x7dd7ba000000', '0x7db1bc000000'], ['0x7db1bf000000', '0x7db1bf200000', '0x7db1bee00000', '0x7db1bf400000'], ['0x7db1a8000000', '0x7db19e000000', '0x7db1b2000000', '0x7db194000000']] | |
| [2025-09-06 08:52:54.237] [info] lamportInitialize start: buffer: 0x7db1b2000000, size: 71303168 | |
| rank 3 allocated ipc_handles: [['0x77f504000000', '0x77f4a8000000', '0x77f4a4000000', '0x781aa0000000'], ['0x77f4a7000000', '0x77f4a7200000', '0x77f4a7400000', '0x77f4a6e00000'], ['0x77f490000000', '0x77f486000000', '0x77f47c000000', '0x77f49a000000']] | |
| [2025-09-06 08:52:54.285] [info] lamportInitialize start: buffer: 0x77f49a000000, size: 71303168 | |
| rank 0 allocated ipc_handles: [['0x771a6e000000', '0x76f4d2000000', '0x76f47c000000', '0x76f478000000'], ['0x76f47ae00000', '0x76f47b000000', '0x76f47b200000', '0x76f47b400000'], ['0x76f46e000000', '0x76f464000000', '0x76f45a000000', '0x76f450000000']] | |
| [2025-09-06 08:52:54.334] [info] lamportInitialize start: buffer: 0x76f46e000000, size: 71303168 | |
| rank 1 allocated ipc_handles: [['0x7f92a2000000', '0x7fb838000000', '0x7f923c000000', '0x7f9238000000'], ['0x7f923b000000', '0x7f923ae00000', '0x7f923b200000', '0x7f923b400000'], ['0x7f9224000000', '0x7f922e000000', '0x7f921a000000', '0x7f9210000000']] | |
| [2025-09-06 08:52:54.384] [info] lamportInitialize start: buffer: 0x7f922e000000, size: 71303168 | |
| [2025-09-06 08:52:54 TP0] FlashInfer workspace initialized for rank 0, world_size 4 | |
| [2025-09-06 08:52:54 TP1] FlashInfer workspace initialized for rank 1, world_size 4 | |
| [2025-09-06 08:52:54 TP3] FlashInfer workspace initialized for rank 3, world_size 4 | |
| [2025-09-06 08:52:54 TP2] FlashInfer workspace initialized for rank 2, world_size 4 | |
| set flag_ptr[3] = lamport_comm_size: 47185920 | |
| Rank 0 workspace[0] 0x771a6e000000 | |
| Rank 0 workspace[1] 0x76f4d2000000 | |
| Rank 0 workspace[2] 0x76f47c000000 | |
| Rank 0 workspace[3] 0x76f478000000 | |
| Rank 0 workspace[4] 0x76f47ae00000 | |
| Rank 0 workspace[5] 0x76f47b000000 | |
| Rank 0 workspace[6] 0x76f47b200000 | |
| Rank 0 workspace[7] 0x76f47b400000 | |
| Rank 0 workspace[8] 0x76f46e000000 | |
| Rank 0 workspace[9] 0x76f464000000 | |
| Rank 0 workspace[10] 0x76f45a000000 | |
| Rank 0 workspace[11] 0x76f450000000 | |
| Rank 0 workspace[12] 0x772063264400 | |
| set flag_ptr[3] = lamport_comm_size: 47185920 | |
| Rank 3 workspace[0] 0x77f504000000 | |
| Rank 3 workspace[1] 0x77f4a8000000 | |
| Rank 3 workspace[2] 0x77f4a4000000 | |
| Rank 3 workspace[3] 0x781aa0000000 | |
| Rank 3 workspace[4] 0x77f4a7000000 | |
| Rank 3 workspace[5] 0x77f4a7200000 | |
| Rank 3 workspace[6] 0x77f4a7400000 | |
| Rank 3 workspace[7] 0x77f4a6e00000 | |
| Rank 3 workspace[8] 0x77f490000000 | |
| Rank 3 workspace[9] 0x77f486000000 | |
| Rank 3 workspace[10] 0x77f47c000000 | |
| Rank 3 workspace[11] 0x77f49a000000 | |
| Rank 3 workspace[12] 0x78209b264400 | |
| set flag_ptr[3] = lamport_comm_size: 47185920 | |
| Rank 1 workspace[0] 0x7f92a2000000 | |
| Rank 1 workspace[1] 0x7fb838000000 | |
| Rank 1 workspace[2] 0x7f923c000000 | |
| Rank 1 workspace[3] 0x7f9238000000 | |
| Rank 1 workspace[4] 0x7f923b000000 | |
| Rank 1 workspace[5] 0x7f923ae00000 | |
| Rank 1 workspace[6] 0x7f923b200000 | |
| Rank 1 workspace[7] 0x7f923b400000 | |
| Rank 1 workspace[8] 0x7f9224000000 | |
| Rank 1 workspace[9] 0x7f922e000000 | |
| Rank 1 workspace[10] 0x7f921a000000 | |
| Rank 1 workspace[11] 0x7f9210000000 | |
| Rank 1 workspace[12] 0x7fbe43264400 | |
| set flag_ptr[3] = lamport_comm_size: 47185920 | |
| Rank 2 workspace[0] 0x7db224000000 | |
| Rank 2 workspace[1] 0x7db1c0000000 | |
| Rank 2 workspace[2] 0x7dd7ba000000 | |
| Rank 2 workspace[3] 0x7db1bc000000 | |
| Rank 2 workspace[4] 0x7db1bf000000 | |
| Rank 2 workspace[5] 0x7db1bf200000 | |
| Rank 2 workspace[6] 0x7db1bee00000 | |
| Rank 2 workspace[7] 0x7db1bf400000 | |
| Rank 2 workspace[8] 0x7db1a8000000 | |
| Rank 2 workspace[9] 0x7db19e000000 | |
| Rank 2 workspace[10] 0x7db1b2000000 | |
| Rank 2 workspace[11] 0x7db194000000 | |
| Rank 2 workspace[12] 0x7dddc3264400 | |
| Capturing batches (bs=200 avail_mem=9.39 GB): 4%|▎ | 1/28 [00:02<00:56, 2.08s/it] Capturing batches (bs=192 avail_mem=8.06 GB): 4%|▎ | 1/28 [00:02<00:56, 2.08s/it] Capturing batches (bs=192 avail_mem=8.06 GB): 7%|▋ | 2/28 [00:02<00:24, 1.06it/s] Capturing batches (bs=184 avail_mem=8.05 GB): 7%|▋ | 2/28 [00:02<00:24, 1.06it/s] Capturing batches (bs=176 avail_mem=8.04 GB): 7%|▋ | 2/28 [00:02<00:24, 1.06it/s] Capturing batches (bs=176 avail_mem=8.04 GB): 14%|█▍ | 4/28 [00:02<00:09, 2.47it/s] Capturing batches (bs=168 avail_mem=8.03 GB): 14%|█▍ | 4/28 [00:02<00:09, 2.47it/s] Capturing batches (bs=160 avail_mem=8.02 GB): 14%|█▍ | 4/28 [00:02<00:09, 2.47it/s] Capturing batches (bs=160 avail_mem=8.02 GB): 21%|██▏ | 6/28 [00:02<00:05, 3.95it/s] Capturing batches (bs=152 avail_mem=8.01 GB): 21%|██▏ | 6/28 [00:02<00:05, 3.95it/s] Capturing batches (bs=144 avail_mem=8.00 GB): 21%|██▏ | 6/28 [00:02<00:05, 3.95it/s] Capturing batches (bs=144 avail_mem=8.00 GB): 29%|██▊ | 8/28 [00:02<00:03, 5.41it/s] Capturing batches (bs=136 avail_mem=8.00 GB): 29%|██▊ | 8/28 [00:02<00:03, 5.41it/s] Capturing batches (bs=128 avail_mem=7.99 GB): 29%|██▊ | 8/28 [00:02<00:03, 5.41it/s] Capturing batches (bs=128 avail_mem=7.99 GB): 36%|███▌ | 10/28 [00:02<00:02, 6.69it/s] Capturing batches (bs=120 avail_mem=7.97 GB): 36%|███▌ | 10/28 [00:02<00:02, 6.69it/s] Capturing batches (bs=112 avail_mem=7.97 GB): 36%|███▌ | 10/28 [00:02<00:02, 6.69it/s] Capturing batches (bs=112 avail_mem=7.97 GB): 43%|████▎ | 12/28 [00:03<00:02, 7.89it/s] Capturing batches (bs=104 avail_mem=7.96 GB): 43%|████▎ | 12/28 [00:03<00:02, 7.89it/s] Capturing batches (bs=96 avail_mem=7.95 GB): 43%|████▎ | 12/28 [00:03<00:02, 7.89it/s] Capturing batches (bs=96 avail_mem=7.95 GB): 50%|█████ | 14/28 [00:03<00:01, 8.86it/s] Capturing batches (bs=88 avail_mem=7.94 GB): 50%|█████ | 14/28 [00:03<00:01, 8.86it/s] Capturing batches (bs=80 avail_mem=7.93 GB): 50%|█████ | 14/28 [00:03<00:01, 8.86it/s] Capturing batches (bs=80 avail_mem=7.93 GB): 57%|█████▋ | 16/28 [00:03<00:01, 9.62it/s] Capturing batches (bs=72 avail_mem=7.92 GB): 57%|█████▋ | 16/28 [00:03<00:01, 9.62it/s] Capturing batches (bs=64 avail_mem=7.91 GB): 57%|█████▋ | 16/28 [00:03<00:01, 9.62it/s] Capturing batches (bs=64 avail_mem=7.91 GB): 64%|██████▍ | 18/28 [00:03<00:00, 10.27it/s] Capturing batches (bs=56 avail_mem=7.90 GB): 64%|██████▍ | 18/28 [00:03<00:00, 10.27it/s] Capturing batches (bs=48 avail_mem=7.89 GB): 64%|██████▍ | 18/28 [00:03<00:00, 10.27it/s] Capturing batches (bs=48 avail_mem=7.89 GB): 71%|███████▏ | 20/28 [00:03<00:00, 10.84it/s] Capturing batches (bs=40 avail_mem=7.88 GB): 71%|███████▏ | 20/28 [00:03<00:00, 10.84it/s] Capturing batches (bs=32 avail_mem=7.88 GB): 71%|███████▏ | 20/28 [00:03<00:00, 10.84it/s] Capturing batches (bs=32 avail_mem=7.88 GB): 79%|███████▊ | 22/28 [00:03<00:00, 11.27it/s] Capturing batches (bs=24 avail_mem=7.87 GB): 79%|███████▊ | 22/28 [00:03<00:00, 11.27it/s] Capturing batches (bs=16 avail_mem=7.86 GB): 79%|███████▊ | 22/28 [00:03<00:00, 11.27it/s] Capturing batches (bs=16 avail_mem=7.86 GB): 86%|████████▌ | 24/28 [00:04<00:00, 11.64it/s] Capturing batches (bs=8 avail_mem=7.85 GB): 86%|████████▌ | 24/28 [00:04<00:00, 11.64it/s] Capturing batches (bs=4 avail_mem=7.84 GB): 86%|████████▌ | 24/28 [00:04<00:00, 11.64it/s] Capturing batches (bs=4 avail_mem=7.84 GB): 93%|█████████▎| 26/28 [00:04<00:00, 12.24it/s] Capturing batches (bs=2 avail_mem=7.83 GB): 93%|█████████▎| 26/28 [00:04<00:00, 12.24it/s] Capturing batches (bs=1 avail_mem=7.82 GB): 93%|█████████▎| 26/28 [00:04<00:00, 12.24it/s] Capturing batches (bs=1 avail_mem=7.82 GB): 100%|██████████| 28/28 [00:04<00:00, 11.67it/s] Capturing batches (bs=1 avail_mem=7.82 GB): 100%|██████████| 28/28 [00:04<00:00, 6.37it/s] | |
| [2025-09-06 08:52:57 TP0] Registering 56 cuda graph addresses | |
| [2025-09-06 08:52:57 TP1] Registering 56 cuda graph addresses | |
| [2025-09-06 08:52:57 TP3] Registering 56 cuda graph addresses | |
| [2025-09-06 08:52:57 TP2] Registering 56 cuda graph addresses | |
| [2025-09-06 08:52:57 TP0] Capture cuda graph end. Time elapsed: 4.91 s. mem usage=1.73 GB. avail mem=7.81 GB. | |
| [2025-09-06 08:52:57 TP0] max_total_num_tokens=8487040, chunked_prefill_size=16384, max_prefill_tokens=16384, max_running_requests=4096, context_len=131072, available_gpu_mem=7.81 GB | |
| [2025-09-06 08:52:58] INFO: Started server process [54406] | |
| [2025-09-06 08:52:58] INFO: Waiting for application startup. | |
| [2025-09-06 08:52:58] INFO: Application startup complete. | |
| [2025-09-06 08:52:58] INFO: Uvicorn running on http://127.0.0.1:8400 (Press CTRL+C to quit) | |
| [2025-09-06 08:52:58] INFO: 127.0.0.1:36566 - "GET /health_generate HTTP/1.1" 503 Service Unavailable | |
| [2025-09-06 08:52:59] INFO: 127.0.0.1:36570 - "GET /get_model_info HTTP/1.1" 200 OK | |
| [2025-09-06 08:52:59 TP0] Prefill batch. #new-seq: 1, #new-token: 64, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, | |
| [2025-09-06 08:53:00] INFO: 127.0.0.1:36572 - "POST /generate HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:00] The server is fired up and ready to roll! | |
| [2025-09-06 08:53:08 TP0] Prefill batch. #new-seq: 1, #new-token: 64, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, | |
| [2025-09-06 08:53:09] INFO: 127.0.0.1:46788 - "GET /health_generate HTTP/1.1" 200 OK | |
| command=python3 -m sglang.launch_server --model-path /home/yiliu7/models/openai/gpt-oss-120b --tp 4 --cuda-graph-max-bs 200 --mem-fraction-static 0.93 --device cuda --host 127.0.0.1 --port 8400 | |
| Evaluation start: model=/home/yiliu7/models/openai/gpt-oss-120b reasoning_effort=low expected_score=0.6 | |
| ChatCompletionSampler initialized with self.system_message=None self.temperature=0.1 self.max_tokens=4096 self.reasoning_effort='low' | |
| 0%| | 0/198 [00:00<?, ?it/s][2025-09-06 08:53:10 TP0] Prefill batch. #new-seq: 1, #new-token: 320, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, | |
| [2025-09-06 08:53:10 TP0] Prefill batch. #new-seq: 10, #new-token: 3136, #cached-token: 640, token usage: 0.00, #running-req: 1, #queue-req: 0, | |
| [2025-09-06 08:53:10 TP0] Prefill batch. #new-seq: 1, #new-token: 256, #cached-token: 64, token usage: 0.00, #running-req: 11, #queue-req: 0, | |
| [2025-09-06 08:53:10 TP0] Prefill batch. #new-seq: 54, #new-token: 16192, #cached-token: 3456, token usage: 0.00, #running-req: 12, #queue-req: 31, | |
| [2025-09-06 08:53:10 TP0] Prefill batch. #new-seq: 49, #new-token: 13056, #cached-token: 3136, token usage: 0.00, #running-req: 66, #queue-req: 0, | |
| [2025-09-06 08:53:10 TP0] Prefill batch. #new-seq: 58, #new-token: 16320, #cached-token: 3776, token usage: 0.00, #running-req: 115, #queue-req: 1, | |
| [2025-09-06 08:53:11 TP0] Prefill batch. #new-seq: 25, #new-token: 7296, #cached-token: 1728, token usage: 0.01, #running-req: 173, #queue-req: 0, | |
| [2025-09-06 08:53:11 TP0] Decode batch. #running-req: 198, #token: 62016, token usage: 0.01, cuda graph: True, gen throughput (token/s): 390.03, #queue-req: 0, | |
| [2025-09-06 08:53:11] INFO: 127.0.0.1:48268 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:11] INFO: 127.0.0.1:48224 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:11 TP0] Decode batch. #running-req: 196, #token: 68800, token usage: 0.01, cuda graph: True, gen throughput (token/s): 17298.89, #queue-req: 0, | |
| [2025-09-06 08:53:11] INFO: 127.0.0.1:47764 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:12] INFO: 127.0.0.1:47106 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:12] INFO: 127.0.0.1:48460 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:12] INFO: 127.0.0.1:48182 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:12] INFO: 127.0.0.1:46824 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:12] INFO: 127.0.0.1:46920 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:12] INFO: 127.0.0.1:48220 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:12] INFO: 127.0.0.1:47478 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:12] INFO: 127.0.0.1:47336 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:12] INFO: 127.0.0.1:46986 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:12 TP0] Decode batch. #running-req: 186, #token: 70656, token usage: 0.01, cuda graph: True, gen throughput (token/s): 16979.59, #queue-req: 0, | |
| [2025-09-06 08:53:12] INFO: 127.0.0.1:47982 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:12] INFO: 127.0.0.1:47408 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:12] INFO: 127.0.0.1:47600 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:12] INFO: 127.0.0.1:47710 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:12] INFO: 127.0.0.1:46938 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:12] INFO: 127.0.0.1:47628 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:12] INFO: 127.0.0.1:47538 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:12] INFO: 127.0.0.1:46946 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:12] INFO: 127.0.0.1:47550 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:12] INFO: 127.0.0.1:47498 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:12] INFO: 127.0.0.1:48412 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:12] INFO: 127.0.0.1:48310 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:12] INFO: 127.0.0.1:47366 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:12 TP0] Decode batch. #running-req: 173, #token: 71168, token usage: 0.01, cuda graph: True, gen throughput (token/s): 16251.87, #queue-req: 0, | |
| [2025-09-06 08:53:12] INFO: 127.0.0.1:47114 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:12] INFO: 127.0.0.1:48048 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:13] INFO: 127.0.0.1:47368 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:13] INFO: 127.0.0.1:46972 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:13] INFO: 127.0.0.1:47662 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:13] INFO: 127.0.0.1:47724 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:13 TP0] Decode batch. #running-req: 167, #token: 75840, token usage: 0.01, cuda graph: True, gen throughput (token/s): 15575.15, #queue-req: 0, | |
| [2025-09-06 08:53:13] INFO: 127.0.0.1:46860 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:13] INFO: 127.0.0.1:46950 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:13] INFO: 127.0.0.1:48162 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:13] INFO: 127.0.0.1:47128 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:13] INFO: 127.0.0.1:48444 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:13] INFO: 127.0.0.1:47672 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:13] INFO: 127.0.0.1:48016 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:13] INFO: 127.0.0.1:48070 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:13 TP0] Decode batch. #running-req: 159, #token: 78272, token usage: 0.01, cuda graph: True, gen throughput (token/s): 14796.13, #queue-req: 0, | |
| [2025-09-06 08:53:13] INFO: 127.0.0.1:47268 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:13] INFO: 127.0.0.1:48384 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:13] INFO: 127.0.0.1:47822 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:14] INFO: 127.0.0.1:47512 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:14] INFO: 127.0.0.1:47528 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:14] INFO: 127.0.0.1:47278 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:14] INFO: 127.0.0.1:47612 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:14] INFO: 127.0.0.1:47198 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:14 TP0] Decode batch. #running-req: 152, #token: 80832, token usage: 0.01, cuda graph: True, gen throughput (token/s): 14010.88, #queue-req: 0, | |
| [2025-09-06 08:53:14] INFO: 127.0.0.1:47334 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:14] INFO: 127.0.0.1:47084 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:14] INFO: 127.0.0.1:48236 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:14] INFO: 127.0.0.1:47016 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:14] INFO: 127.0.0.1:48196 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:14] INFO: 127.0.0.1:47886 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:14 TP0] Decode batch. #running-req: 145, #token: 83648, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13505.45, #queue-req: 0, | |
| [2025-09-06 08:53:14] INFO: 127.0.0.1:47402 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:14] INFO: 127.0.0.1:48102 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:14] INFO: 127.0.0.1:47756 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:14] INFO: 127.0.0.1:48128 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:14] INFO: 127.0.0.1:47502 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:14] INFO: 127.0.0.1:47172 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:15] INFO: 127.0.0.1:47892 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:15 TP0] Decode batch. #running-req: 139, #token: 84736, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13143.80, #queue-req: 0, | |
| [2025-09-06 08:53:15] INFO: 127.0.0.1:47994 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:15] INFO: 127.0.0.1:47630 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:15] INFO: 127.0.0.1:47174 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:15] INFO: 127.0.0.1:46852 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:15] INFO: 127.0.0.1:46998 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:15] INFO: 127.0.0.1:48274 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:15] INFO: 127.0.0.1:47870 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:15] INFO: 127.0.0.1:47980 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:15] INFO: 127.0.0.1:47300 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:15] INFO: 127.0.0.1:47316 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:15] INFO: 127.0.0.1:47976 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:15] INFO: 127.0.0.1:48346 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:15] INFO: 127.0.0.1:47806 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:15] INFO: 127.0.0.1:47138 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:15] INFO: 127.0.0.1:48352 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:15] INFO: 127.0.0.1:46794 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:15] INFO: 127.0.0.1:48138 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:15] INFO: 127.0.0.1:46818 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:15] INFO: 127.0.0.1:47226 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:15 TP0] Decode batch. #running-req: 119, #token: 78144, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13663.45, #queue-req: 0, | |
| [2025-09-06 08:53:15] INFO: 127.0.0.1:48212 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:15] INFO: 127.0.0.1:47792 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:15] INFO: 127.0.0.1:47682 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:15 TP0] Decode batch. #running-req: 116, #token: 80704, token usage: 0.01, cuda graph: True, gen throughput (token/s): 16118.27, #queue-req: 0, | |
| [2025-09-06 08:53:15] INFO: 127.0.0.1:47848 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:15] INFO: 127.0.0.1:47030 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:15] INFO: 127.0.0.1:46926 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:15] INFO: 127.0.0.1:47166 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:15] INFO: 127.0.0.1:46814 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:15] INFO: 127.0.0.1:47568 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:15] INFO: 127.0.0.1:48254 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:15] INFO: 127.0.0.1:46996 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:15] INFO: 127.0.0.1:48430 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:15 TP0] Decode batch. #running-req: 107, #token: 78400, token usage: 0.01, cuda graph: True, gen throughput (token/s): 15427.31, #queue-req: 0, | |
| [2025-09-06 08:53:16] INFO: 127.0.0.1:47922 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:16] INFO: 127.0.0.1:47956 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:16] INFO: 127.0.0.1:47054 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:16] INFO: 127.0.0.1:47212 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:16] INFO: 127.0.0.1:47804 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:16] INFO: 127.0.0.1:48002 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:16] INFO: 127.0.0.1:47422 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:16] INFO: 127.0.0.1:47716 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:16] INFO: 127.0.0.1:47776 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:16] INFO: 127.0.0.1:47552 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:16] INFO: 127.0.0.1:47092 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:16] INFO: 127.0.0.1:48330 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:16] INFO: 127.0.0.1:48390 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:16 TP0] Decode batch. #running-req: 96, #token: 73408, token usage: 0.01, cuda graph: True, gen throughput (token/s): 14358.06, #queue-req: 0, | |
| [2025-09-06 08:53:16] INFO: 127.0.0.1:47702 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:16] INFO: 127.0.0.1:48234 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:16] INFO: 127.0.0.1:47058 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:16] INFO: 127.0.0.1:47690 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:16] INFO: 127.0.0.1:46958 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:16 TP0] Decode batch. #running-req: 89, #token: 72000, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13386.14, #queue-req: 0, | |
| [2025-09-06 08:53:16] INFO: 127.0.0.1:47606 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:16] INFO: 127.0.0.1:47436 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:16] INFO: 127.0.0.1:47964 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:16] INFO: 127.0.0.1:48114 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:16] INFO: 127.0.0.1:47486 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:16] INFO: 127.0.0.1:47930 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:16] INFO: 127.0.0.1:46868 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:16] INFO: 127.0.0.1:46890 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:16] INFO: 127.0.0.1:47556 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:16] INFO: 127.0.0.1:46836 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:16] INFO: 127.0.0.1:47832 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:16 TP0] Decode batch. #running-req: 78, #token: 67264, token usage: 0.01, cuda graph: True, gen throughput (token/s): 12540.26, #queue-req: 0, | |
| [2025-09-06 08:53:16] INFO: 127.0.0.1:48300 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:16] INFO: 127.0.0.1:48286 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:17] INFO: 127.0.0.1:47112 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:17] INFO: 127.0.0.1:47742 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:17 TP0] Decode batch. #running-req: 74, #token: 65792, token usage: 0.01, cuda graph: True, gen throughput (token/s): 11994.78, #queue-req: 0, | |
| [2025-09-06 08:53:17] INFO: 127.0.0.1:47880 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:17] INFO: 127.0.0.1:47646 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:17] INFO: 127.0.0.1:47492 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:17] INFO: 127.0.0.1:48414 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:17] INFO: 127.0.0.1:47196 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:17] INFO: 127.0.0.1:47186 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:17] INFO: 127.0.0.1:47938 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:17] INFO: 127.0.0.1:48042 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:17] INFO: 127.0.0.1:47350 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:17 TP0] Decode batch. #running-req: 65, #token: 60992, token usage: 0.01, cuda graph: True, gen throughput (token/s): 11233.39, #queue-req: 0, | |
| [2025-09-06 08:53:17] INFO: 127.0.0.1:48226 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:17] INFO: 127.0.0.1:47638 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:17] INFO: 127.0.0.1:47900 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:17] INFO: 127.0.0.1:47718 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:17] INFO: 127.0.0.1:46908 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:17 TP0] Decode batch. #running-req: 60, #token: 59008, token usage: 0.01, cuda graph: True, gen throughput (token/s): 10051.77, #queue-req: 0, | |
| [2025-09-06 08:53:17] INFO: 127.0.0.1:48210 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:17] INFO: 127.0.0.1:47514 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:17] INFO: 127.0.0.1:46892 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:17] INFO: 127.0.0.1:48370 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:17] INFO: 127.0.0.1:47696 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:17] INFO: 127.0.0.1:47464 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:17] INFO: 127.0.0.1:47570 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:17] INFO: 127.0.0.1:47990 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:17] INFO: 127.0.0.1:48246 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:17] INFO: 127.0.0.1:46880 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:17] INFO: 127.0.0.1:47234 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:17 TP0] Decode batch. #running-req: 49, #token: 49856, token usage: 0.01, cuda graph: True, gen throughput (token/s): 9153.33, #queue-req: 0, | |
| [2025-09-06 08:53:17] INFO: 127.0.0.1:48160 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:17] INFO: 127.0.0.1:47284 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:17] INFO: 127.0.0.1:47070 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:17] INFO: 127.0.0.1:48146 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:17] INFO: 127.0.0.1:48320 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:17] INFO: 127.0.0.1:47390 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:18] INFO: 127.0.0.1:47584 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:18] INFO: 127.0.0.1:47432 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:18 TP0] Decode batch. #running-req: 42, #token: 43328, token usage: 0.01, cuda graph: True, gen throughput (token/s): 7998.64, #queue-req: 0, | |
| [2025-09-06 08:53:18] INFO: 127.0.0.1:48054 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:18] INFO: 127.0.0.1:48040 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:18 TP0] Decode batch. #running-req: 39, #token: 43008, token usage: 0.01, cuda graph: True, gen throughput (token/s): 7167.57, #queue-req: 0, | |
| [2025-09-06 08:53:18] INFO: 127.0.0.1:47116 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:18] INFO: 127.0.0.1:47244 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:18] INFO: 127.0.0.1:47406 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:18] INFO: 127.0.0.1:47452 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:18] INFO: 127.0.0.1:47004 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:18] INFO: 127.0.0.1:47828 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:18 TP0] Decode batch. #running-req: 33, #token: 37888, token usage: 0.00, cuda graph: True, gen throughput (token/s): 6712.80, #queue-req: 0, | |
| [2025-09-06 08:53:18] INFO: 127.0.0.1:48216 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:18] INFO: 127.0.0.1:47608 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:18] INFO: 127.0.0.1:47940 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:18] INFO: 127.0.0.1:47626 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:18 TP0] Decode batch. #running-req: 30, #token: 34560, token usage: 0.00, cuda graph: True, gen throughput (token/s): 5671.46, #queue-req: 0, | |
| [2025-09-06 08:53:18] INFO: 127.0.0.1:47706 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:18] INFO: 127.0.0.1:48086 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:18] INFO: 127.0.0.1:47862 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:18] INFO: 127.0.0.1:48380 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:18] INFO: 127.0.0.1:48046 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:18] INFO: 127.0.0.1:47082 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:18] INFO: 127.0.0.1:46806 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:18 TP0] Decode batch. #running-req: 22, #token: 27456, token usage: 0.00, cuda graph: True, gen throughput (token/s): 4806.00, #queue-req: 0, | |
| [2025-09-06 08:53:18] INFO: 127.0.0.1:48032 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:18] INFO: 127.0.0.1:48358 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:18] INFO: 127.0.0.1:47332 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:19] INFO: 127.0.0.1:47384 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:19 TP0] Decode batch. #running-req: 18, #token: 23360, token usage: 0.00, cuda graph: True, gen throughput (token/s): 3878.30, #queue-req: 0, | |
| [2025-09-06 08:53:19] INFO: 127.0.0.1:47950 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:19] INFO: 127.0.0.1:47748 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:19 TP0] Decode batch. #running-req: 16, #token: 21504, token usage: 0.00, cuda graph: True, gen throughput (token/s): 3382.74, #queue-req: 0, | |
| [2025-09-06 08:53:19] INFO: 127.0.0.1:47388 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:19] INFO: 127.0.0.1:47668 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:19 TP0] Decode batch. #running-req: 14, #token: 19584, token usage: 0.00, cuda graph: True, gen throughput (token/s): 3287.37, #queue-req: 0, | |
| [2025-09-06 08:53:19] INFO: 127.0.0.1:48170 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:19 TP0] Decode batch. #running-req: 13, #token: 18624, token usage: 0.00, cuda graph: True, gen throughput (token/s): 2878.57, #queue-req: 0, | |
| [2025-09-06 08:53:19] INFO: 127.0.0.1:46886 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:19 TP0] Decode batch. #running-req: 12, #token: 17792, token usage: 0.00, cuda graph: True, gen throughput (token/s): 2658.46, #queue-req: 0, | |
| [2025-09-06 08:53:19] INFO: 127.0.0.1:47152 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:19] INFO: 127.0.0.1:47494 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:20 TP0] Decode batch. #running-req: 10, #token: 15296, token usage: 0.00, cuda graph: True, gen throughput (token/s): 2273.44, #queue-req: 0, | |
| [2025-09-06 08:53:20] INFO: 127.0.0.1:47578 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:20] INFO: 127.0.0.1:47694 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:20 TP0] Decode batch. #running-req: 8, #token: 12416, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1975.55, #queue-req: 0, | |
| [2025-09-06 08:53:20 TP0] Decode batch. #running-req: 8, #token: 12864, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1922.54, #queue-req: 0, | |
| [2025-09-06 08:53:20] INFO: 127.0.0.1:47380 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:20 TP0] Decode batch. #running-req: 7, #token: 11392, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1674.58, #queue-req: 0, | |
| [2025-09-06 08:53:20] INFO: 127.0.0.1:48404 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:20 TP0] Decode batch. #running-req: 6, #token: 10048, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1481.07, #queue-req: 0, | |
| [2025-09-06 08:53:20] INFO: 127.0.0.1:46874 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:20 TP0] Decode batch. #running-req: 5, #token: 8640, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1395.08, #queue-req: 0, | |
| [2025-09-06 08:53:20] INFO: 127.0.0.1:48090 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:21 TP0] Decode batch. #running-req: 4, #token: 7040, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1068.41, #queue-req: 0, | |
| [2025-09-06 08:53:21] INFO: 127.0.0.1:47908 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:53:21] INFO: 127.0.0.1:47736 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| 1%| | 1/198 [00:10<35:11, 10.72s/it][2025-09-06 08:53:21 TP0] Decode batch. #running-req: 2, #token: 3456, token usage: 0.00, cuda graph: True, gen throughput (token/s): 814.09, #queue-req: 0, | |
| [2025-09-06 08:53:21 TP0] Decode batch. #running-req: 2, #token: 3584, token usage: 0.00, cuda graph: True, gen throughput (token/s): 576.20, #queue-req: 0, | |
| [2025-09-06 08:53:21 TP0] Decode batch. #running-req: 2, #token: 3648, token usage: 0.00, cuda graph: True, gen throughput (token/s): 580.52, #queue-req: 0, | |
| [2025-09-06 08:53:21 TP0] Decode batch. #running-req: 2, #token: 3712, token usage: 0.00, cuda graph: True, gen throughput (token/s): 579.37, #queue-req: 0, | |
| [2025-09-06 08:53:21 TP0] Decode batch. #running-req: 2, #token: 3776, token usage: 0.00, cuda graph: True, gen throughput (token/s): 581.49, #queue-req: 0, | |
| [2025-09-06 08:53:21 TP0] Decode batch. #running-req: 2, #token: 3840, token usage: 0.00, cuda graph: True, gen throughput (token/s): 580.49, #queue-req: 0, | |
| [2025-09-06 08:53:22 TP0] Decode batch. #running-req: 2, #token: 3968, token usage: 0.00, cuda graph: True, gen throughput (token/s): 580.24, #queue-req: 0, | |
| [2025-09-06 08:53:22 TP0] Decode batch. #running-req: 2, #token: 4032, token usage: 0.00, cuda graph: True, gen throughput (token/s): 581.03, #queue-req: 0, | |
| [2025-09-06 08:53:22 TP0] Decode batch. #running-req: 2, #token: 4096, token usage: 0.00, cuda graph: True, gen throughput (token/s): 579.36, #queue-req: 0, | |
| [2025-09-06 08:53:22] INFO: 127.0.0.1:47040 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| 56%|█████▌ | 110/198 [00:11<00:06, 12.59it/s][2025-09-06 08:53:22 TP0] Decode batch. #running-req: 1, #token: 2112, token usage: 0.00, cuda graph: True, gen throughput (token/s): 414.27, #queue-req: 0, | |
| [2025-09-06 08:53:22 TP0] Decode batch. #running-req: 1, #token: 2176, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.78, #queue-req: 0, | |
| [2025-09-06 08:53:22 TP0] Decode batch. #running-req: 1, #token: 2176, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.47, #queue-req: 0, | |
| [2025-09-06 08:53:22 TP0] Decode batch. #running-req: 1, #token: 2240, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.03, #queue-req: 0, | |
| [2025-09-06 08:53:22 TP0] Decode batch. #running-req: 1, #token: 2240, token usage: 0.00, cuda graph: True, gen throughput (token/s): 330.20, #queue-req: 0, | |
| [2025-09-06 08:53:23 TP0] Decode batch. #running-req: 1, #token: 2304, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.86, #queue-req: 0, | |
| [2025-09-06 08:53:23 TP0] Decode batch. #running-req: 1, #token: 2368, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.89, #queue-req: 0, | |
| [2025-09-06 08:53:23 TP0] Decode batch. #running-req: 1, #token: 2368, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.49, #queue-req: 0, | |
| [2025-09-06 08:53:23 TP0] Decode batch. #running-req: 1, #token: 2432, token usage: 0.00, cuda graph: True, gen throughput (token/s): 329.50, #queue-req: 0, | |
| [2025-09-06 08:53:23] INFO: 127.0.0.1:47256 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| 69%|██████▊ | 136/198 [00:13<00:04, 14.30it/s] 100%|██████████| 198/198 [00:13<00:00, 15.21it/s] | |
| /usr/lib/python3.12/subprocess.py:1127: ResourceWarning: subprocess 54406 is still running | |
| _warn("subprocess %s is still running" % self.pid, | |
| ResourceWarning: Enable tracemalloc to get the object allocation traceback | |
| . | |
| ---------------------------------------------------------------------- | |
| Ran 1 test in 177.303s | |
| OK | |
| Writing report to /tmp/gpqa__home_yiliu7_models_openai_gpt-oss-120b.html | |
| {'chars': 1690.560606060606, 'chars:std': 967.3517968413342, 'score:std': 0.48749802152178456, 'score': 0.6111111111111112} | |
| Writing results to /tmp/gpqa__home_yiliu7_models_openai_gpt-oss-120b.json | |
| Total latency: 13.068 s | |
| Score: 0.611 | |
| Evaluation end: model=/home/yiliu7/models/openai/gpt-oss-120b reasoning_effort=low expected_score=0.6 metrics={'chars': 1690.560606060606, 'chars:std': 967.3517968413342, 'score:std': 0.48749802152178456, 'score': 0.6111111111111112} | |
| ================================================================================ | |
| Run 10: | |
| Auto-configed device: cuda | |
| WARNING:sglang.srt.server_args:Detected SM100 and MXFP4 quantization format for GPT-OSS model, enabling FlashInfer MXFP4 MOE kernel. | |
| WARNING:sglang.srt.server_args:TensorRT-LLM MHA only supports page_size of 16, 32 or 64, changing page_size from None to 64. | |
| [2025-09-06 08:53:38] server_args=ServerArgs(model_path='/home/yiliu7/models/openai/gpt-oss-120b', tokenizer_path='/home/yiliu7/models/openai/gpt-oss-120b', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=8400, skip_server_warmup=False, warmups=None, nccl_port=None, dtype='bfloat16', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', mem_fraction_static=0.93, max_running_requests=None, max_queued_requests=9223372036854775807, max_total_tokens=None, chunked_prefill_size=16384, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, page_size=64, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=True, device='cuda', tp_size=4, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=460438675, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, gc_warning_threshold_secs=0.0, api_key=None, served_model_name='/home/yiliu7/models/openai/gpt-oss-120b', weight_version='default', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_backend='triton', attention_backend='trtllm_mha', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, ep_size=1, moe_a2a_backend='none', moe_runner_backend='flashinfer_mxfp4', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=True, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', disable_radix_cache=False, cuda_graph_max_bs=200, cuda_graph_bs=None, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, torch_compile_max_bs=32, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, enable_return_hidden_states=False, scheduler_recv_interval=1, numa_node=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, debug_tensor_dump_prefill_only=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, num_reserved_decode_tokens=512, custom_weight_loader=[], weight_loader_disable_mmap=False, enable_pdmux=False, sm_group_num=3, enable_ep_moe=False, enable_deepep_moe=False, enable_flashinfer_cutlass_moe=False, enable_flashinfer_trtllm_moe=False, enable_triton_kernel_moe=False, enable_flashinfer_mxfp4_moe=False) | |
| All deep_gemm operations loaded successfully! | |
| `torch_dtype` is deprecated! Use `dtype` instead! | |
| [2025-09-06 08:53:38] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:53:38] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [2025-09-06 08:53:39] Using default HuggingFace chat template with detected content format: string | |
| `torch_dtype` is deprecated! Use `dtype` instead! | |
| [2025-09-06 08:53:45 TP0] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:53:45 TP0] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| `torch_dtype` is deprecated! Use `dtype` instead! | |
| [2025-09-06 08:53:45 TP3] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:53:45 TP3] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| `torch_dtype` is deprecated! Use `dtype` instead! | |
| [2025-09-06 08:53:45 TP1] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:53:45 TP1] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [2025-09-06 08:53:45 TP0] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:53:45 TP0] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [2025-09-06 08:53:45 TP0] Init torch distributed begin. | |
| `torch_dtype` is deprecated! Use `dtype` instead! | |
| [2025-09-06 08:53:45 TP2] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:53:45 TP2] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [2025-09-06 08:53:46 TP3] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:53:46 TP3] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [2025-09-06 08:53:46 TP1] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:53:46 TP1] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [2025-09-06 08:53:46 TP2] Downcasting torch.float32 to torch.bfloat16. | |
| [2025-09-06 08:53:46 TP2] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models. | |
| [Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [2025-09-06 08:53:47 TP0] sglang is using nccl==2.27.3 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3 | |
| [2025-09-06 08:53:50 TP0] Init torch distributed ends. mem usage=1.46 GB | |
| [2025-09-06 08:53:50 TP0] Load weight begin. avail mem=176.28 GB | |
| All deep_gemm operations loaded successfully! | |
| Loading safetensors checkpoint shards: 0% Completed | 0/15 [00:00<?, ?it/s] | |
| Loading safetensors checkpoint shards: 100% Completed | 15/15 [00:00<00:00, 1587.43it/s] | |
| All deep_gemm operations loaded successfully! | |
| [2025-09-06 08:54:05 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.0.mlp.experts), it might take a while... | |
| All deep_gemm operations loaded successfully! | |
| All deep_gemm operations loaded successfully! | |
| [2025-09-06 08:54:08 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.1.mlp.experts), it might take a while... | |
| [2025-09-06 08:54:11 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.2.mlp.experts), it might take a while... | |
| [2025-09-06 08:54:14 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.3.mlp.experts), it might take a while... | |
| [2025-09-06 08:54:17 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.4.mlp.experts), it might take a while... | |
| [2025-09-06 08:54:20 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.5.mlp.experts), it might take a while... | |
| [2025-09-06 08:54:23 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.6.mlp.experts), it might take a while... | |
| [2025-09-06 08:54:26 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.7.mlp.experts), it might take a while... | |
| [2025-09-06 08:54:30 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.8.mlp.experts), it might take a while... | |
| [2025-09-06 08:54:33 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.9.mlp.experts), it might take a while... | |
| [2025-09-06 08:54:36 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.10.mlp.experts), it might take a while... | |
| [2025-09-06 08:54:39 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.11.mlp.experts), it might take a while... | |
| [2025-09-06 08:54:42 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.12.mlp.experts), it might take a while... | |
| [2025-09-06 08:54:45 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.13.mlp.experts), it might take a while... | |
| [2025-09-06 08:54:48 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.14.mlp.experts), it might take a while... | |
| [2025-09-06 08:54:51 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.15.mlp.experts), it might take a while... | |
| [2025-09-06 08:54:54 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.16.mlp.experts), it might take a while... | |
| [2025-09-06 08:54:57 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.17.mlp.experts), it might take a while... | |
| [2025-09-06 08:55:00 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.18.mlp.experts), it might take a while... | |
| [2025-09-06 08:55:03 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.19.mlp.experts), it might take a while... | |
| [2025-09-06 08:55:06 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.20.mlp.experts), it might take a while... | |
| [2025-09-06 08:55:09 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.21.mlp.experts), it might take a while... | |
| [2025-09-06 08:55:12 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.22.mlp.experts), it might take a while... | |
| [2025-09-06 08:55:15 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.23.mlp.experts), it might take a while... | |
| [2025-09-06 08:55:18 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.24.mlp.experts), it might take a while... | |
| [2025-09-06 08:55:22 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.25.mlp.experts), it might take a while... | |
| [2025-09-06 08:55:25 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.26.mlp.experts), it might take a while... | |
| [2025-09-06 08:55:28 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.27.mlp.experts), it might take a while... | |
| [2025-09-06 08:55:31 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.28.mlp.experts), it might take a while... | |
| [2025-09-06 08:55:34 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.29.mlp.experts), it might take a while... | |
| [2025-09-06 08:55:37 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.30.mlp.experts), it might take a while... | |
| [2025-09-06 08:55:40 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.31.mlp.experts), it might take a while... | |
| [2025-09-06 08:55:43 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.32.mlp.experts), it might take a while... | |
| [2025-09-06 08:55:46 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.33.mlp.experts), it might take a while... | |
| [2025-09-06 08:55:49 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.34.mlp.experts), it might take a while... | |
| [2025-09-06 08:55:52 TP0] Shuffling MoE weights for FlashInfer MXFP4 moe kernel (layer: model.layers.35.mlp.experts), it might take a while... | |
| [2025-09-06 08:55:55 TP0] Load weight end. type=GptOssForCausalLM, dtype=torch.bfloat16, avail mem=158.06 GB, mem usage=18.22 GB. | |
| [2025-09-06 08:55:57 TP0] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB | |
| [2025-09-06 08:55:57 TP2] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB | |
| [2025-09-06 08:55:57 TP3] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB | |
| [2025-09-06 08:55:57 TP0] Memory pool end. avail mem=10.23 GB | |
| [2025-09-06 08:55:57 TP1] KV Cache is allocated. #tokens: 8487040, K size: 72.85 GB, V size: 72.85 GB | |
| [2025-09-06 08:55:57 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=9.54 GB | |
| [2025-09-06 08:55:57 TP0] Capture cuda graph bs [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200] | |
| 0%| | 0/28 [00:00<?, ?it/s] Capturing batches (bs=200 avail_mem=9.39 GB): 0%| | 0/28 [00:00<?, ?it/s]rank 1 allocated ipc_handles: [['0x709c34000000', '0x70c1ca000000', '0x709bd0000000', '0x709bcc000000'], ['0x709bcf000000', '0x709bcee00000', '0x709bcf200000', '0x709bcf400000'], ['0x709bb8000000', '0x709bc2000000', '0x709bae000000', '0x709ba4000000']] | |
| [2025-09-06 08:55:59.485] [info] lamportInitialize start: buffer: 0x709bc2000000, size: 71303168 | |
| rank 0 allocated ipc_handles: [['0x707408000000', '0x704e16000000', '0x704e12000000', '0x704e0e000000'], ['0x704e10e00000', '0x704e11000000', '0x704e11200000', '0x704e11400000'], ['0x704e04000000', '0x704dfa000000', '0x704df0000000', '0x704de6000000']] | |
| [2025-09-06 08:55:59.537] [info] lamportInitialize start: buffer: 0x704e04000000, size: 71303168 | |
| rank 2 allocated ipc_handles: [['0x73c650000000', '0x73c5ec000000', '0x73ebe6000000', '0x73c5e8000000'], ['0x73c5eb000000', '0x73c5eb200000', '0x73c5eae00000', '0x73c5eb400000'], ['0x73c5d4000000', '0x73c5ca000000', '0x73c5de000000', '0x73c5c0000000']] | |
| [2025-09-06 08:55:59.586] [info] lamportInitialize start: buffer: 0x73c5de000000, size: 71303168 | |
| rank 3 allocated ipc_handles: [['0x795490000000', '0x795432000000', '0x79542e000000', '0x797a26000000'], ['0x795431000000', '0x795431200000', '0x795431400000', '0x795430e00000'], ['0x79541a000000', '0x795410000000', '0x795406000000', '0x795424000000']] | |
| [2025-09-06 08:55:59.636] [info] lamportInitialize start: buffer: 0x795424000000, size: 71303168 | |
| [2025-09-06 08:55:59 TP0] FlashInfer workspace initialized for rank 0, world_size 4 | |
| [2025-09-06 08:55:59 TP1] FlashInfer workspace initialized for rank 1, world_size 4 | |
| [2025-09-06 08:55:59 TP3] FlashInfer workspace initialized for rank 3, world_size 4 | |
| [2025-09-06 08:55:59 TP2] FlashInfer workspace initialized for rank 2, world_size 4 | |
| set flag_ptr[3] = lamport_comm_size: 47185920 | |
| Rank 1 workspace[0] 0x709c34000000 | |
| Rank 1 workspace[1] 0x70c1ca000000 | |
| Rank 1 workspace[2] 0x709bd0000000 | |
| Rank 1 workspace[3] 0x709bcc000000 | |
| Rank 1 workspace[4] 0x709bcf000000 | |
| Rank 1 workspace[5] 0x709bcee00000 | |
| Rank 1 workspace[6] 0x709bcf200000 | |
| Rank 1 workspace[7] 0x709bcf400000 | |
| Rank 1 workspace[8] 0x709bb8000000 | |
| Rank 1 workspace[9] 0x709bc2000000 | |
| Rank 1 workspace[10] 0x709bae000000 | |
| Rank 1 workspace[11] 0x709ba4000000 | |
| Rank 1 workspace[12] 0x70c7d3264400 | |
| set flag_ptr[3] = lamport_comm_size: 47185920 | |
| Rank 3 workspace[0] 0x795490000000 | |
| Rank 3 workspace[1] 0x795432000000 | |
| Rank 3 workspace[2] 0x79542e000000 | |
| Rank 3 workspace[3] 0x797a26000000 | |
| Rank 3 workspace[4] 0x795431000000 | |
| Rank 3 workspace[5] 0x795431200000 | |
| Rank 3 workspace[6] 0x795431400000 | |
| Rank 3 workspace[7] 0x795430e00000 | |
| Rank 3 workspace[8] 0x79541a000000 | |
| Rank 3 workspace[9] 0x795410000000 | |
| Rank 3 workspace[10] 0x795406000000 | |
| Rank 3 workspace[11] 0x795424000000 | |
| Rank 3 workspace[12] 0x798021264400 | |
| set flag_ptr[3] = lamport_comm_size: 47185920 | |
| Rank 0 workspace[0] 0x707408000000 | |
| Rank 0 workspace[1] 0x704e16000000 | |
| Rank 0 workspace[2] 0x704e12000000 | |
| Rank 0 workspace[3] 0x704e0e000000 | |
| Rank 0 workspace[4] 0x704e10e00000 | |
| Rank 0 workspace[5] 0x704e11000000 | |
| Rank 0 workspace[6] 0x704e11200000 | |
| Rank 0 workspace[7] 0x704e11400000 | |
| Rank 0 workspace[8] 0x704e04000000 | |
| Rank 0 workspace[9] 0x704dfa000000 | |
| Rank 0 workspace[10] 0x704df0000000 | |
| Rank 0 workspace[11] 0x704de6000000 | |
| Rank 0 workspace[12] 0x7079ff264400 | |
| set flag_ptr[3] = lamport_comm_size: 47185920 | |
| Rank 2 workspace[0] 0x73c650000000 | |
| Rank 2 workspace[1] 0x73c5ec000000 | |
| Rank 2 workspace[2] 0x73ebe6000000 | |
| Rank 2 workspace[3] 0x73c5e8000000 | |
| Rank 2 workspace[4] 0x73c5eb000000 | |
| Rank 2 workspace[5] 0x73c5eb200000 | |
| Rank 2 workspace[6] 0x73c5eae00000 | |
| Rank 2 workspace[7] 0x73c5eb400000 | |
| Rank 2 workspace[8] 0x73c5d4000000 | |
| Rank 2 workspace[9] 0x73c5ca000000 | |
| Rank 2 workspace[10] 0x73c5de000000 | |
| Rank 2 workspace[11] 0x73c5c0000000 | |
| Rank 2 workspace[12] 0x73f1f3264400 | |
| Capturing batches (bs=200 avail_mem=9.39 GB): 4%|▎ | 1/28 [00:02<00:57, 2.14s/it] Capturing batches (bs=192 avail_mem=8.06 GB): 4%|▎ | 1/28 [00:02<00:57, 2.14s/it] Capturing batches (bs=192 avail_mem=8.06 GB): 7%|▋ | 2/28 [00:02<00:25, 1.00it/s] Capturing batches (bs=184 avail_mem=8.05 GB): 7%|▋ | 2/28 [00:02<00:25, 1.00it/s] Capturing batches (bs=176 avail_mem=8.04 GB): 7%|▋ | 2/28 [00:02<00:25, 1.00it/s] Capturing batches (bs=176 avail_mem=8.04 GB): 14%|█▍ | 4/28 [00:02<00:10, 2.36it/s] Capturing batches (bs=168 avail_mem=8.03 GB): 14%|█▍ | 4/28 [00:02<00:10, 2.36it/s] Capturing batches (bs=160 avail_mem=8.02 GB): 14%|█▍ | 4/28 [00:02<00:10, 2.36it/s] Capturing batches (bs=160 avail_mem=8.02 GB): 21%|██▏ | 6/28 [00:02<00:05, 3.81it/s] Capturing batches (bs=152 avail_mem=8.01 GB): 21%|██▏ | 6/28 [00:02<00:05, 3.81it/s] Capturing batches (bs=144 avail_mem=8.00 GB): 21%|██▏ | 6/28 [00:02<00:05, 3.81it/s] Capturing batches (bs=144 avail_mem=8.00 GB): 29%|██▊ | 8/28 [00:02<00:03, 5.25it/s] Capturing batches (bs=136 avail_mem=8.00 GB): 29%|██▊ | 8/28 [00:02<00:03, 5.25it/s] Capturing batches (bs=128 avail_mem=7.99 GB): 29%|██▊ | 8/28 [00:02<00:03, 5.25it/s] Capturing batches (bs=128 avail_mem=7.99 GB): 36%|███▌ | 10/28 [00:03<00:02, 6.54it/s] Capturing batches (bs=120 avail_mem=7.97 GB): 36%|███▌ | 10/28 [00:03<00:02, 6.54it/s] Capturing batches (bs=112 avail_mem=7.97 GB): 36%|███▌ | 10/28 [00:03<00:02, 6.54it/s] Capturing batches (bs=112 avail_mem=7.97 GB): 43%|████▎ | 12/28 [00:03<00:02, 7.71it/s] Capturing batches (bs=104 avail_mem=7.96 GB): 43%|████▎ | 12/28 [00:03<00:02, 7.71it/s] Capturing batches (bs=96 avail_mem=7.95 GB): 43%|████▎ | 12/28 [00:03<00:02, 7.71it/s] Capturing batches (bs=96 avail_mem=7.95 GB): 50%|█████ | 14/28 [00:03<00:01, 8.70it/s] Capturing batches (bs=88 avail_mem=7.94 GB): 50%|█████ | 14/28 [00:03<00:01, 8.70it/s] Capturing batches (bs=80 avail_mem=7.93 GB): 50%|█████ | 14/28 [00:03<00:01, 8.70it/s] Capturing batches (bs=80 avail_mem=7.93 GB): 57%|█████▋ | 16/28 [00:03<00:01, 9.50it/s] Capturing batches (bs=72 avail_mem=7.92 GB): 57%|█████▋ | 16/28 [00:03<00:01, 9.50it/s] Capturing batches (bs=64 avail_mem=7.91 GB): 57%|█████▋ | 16/28 [00:03<00:01, 9.50it/s] Capturing batches (bs=64 avail_mem=7.91 GB): 64%|██████▍ | 18/28 [00:03<00:00, 10.17it/s] Capturing batches (bs=56 avail_mem=7.90 GB): 64%|██████▍ | 18/28 [00:03<00:00, 10.17it/s] Capturing batches (bs=48 avail_mem=7.89 GB): 64%|██████▍ | 18/28 [00:03<00:00, 10.17it/s] Capturing batches (bs=48 avail_mem=7.89 GB): 71%|███████▏ | 20/28 [00:03<00:00, 10.78it/s] Capturing batches (bs=40 avail_mem=7.88 GB): 71%|███████▏ | 20/28 [00:03<00:00, 10.78it/s] Capturing batches (bs=32 avail_mem=7.88 GB): 71%|███████▏ | 20/28 [00:03<00:00, 10.78it/s] Capturing batches (bs=32 avail_mem=7.88 GB): 79%|███████▊ | 22/28 [00:04<00:00, 11.20it/s] Capturing batches (bs=24 avail_mem=7.87 GB): 79%|███████▊ | 22/28 [00:04<00:00, 11.20it/s] Capturing batches (bs=16 avail_mem=7.86 GB): 79%|███████▊ | 22/28 [00:04<00:00, 11.20it/s] Capturing batches (bs=16 avail_mem=7.86 GB): 86%|████████▌ | 24/28 [00:04<00:00, 11.60it/s] Capturing batches (bs=8 avail_mem=7.85 GB): 86%|████████▌ | 24/28 [00:04<00:00, 11.60it/s] Capturing batches (bs=4 avail_mem=7.84 GB): 86%|████████▌ | 24/28 [00:04<00:00, 11.60it/s] Capturing batches (bs=4 avail_mem=7.84 GB): 93%|█████████▎| 26/28 [00:04<00:00, 12.19it/s] Capturing batches (bs=2 avail_mem=7.83 GB): 93%|█████████▎| 26/28 [00:04<00:00, 12.19it/s] Capturing batches (bs=1 avail_mem=7.82 GB): 93%|█████████▎| 26/28 [00:04<00:00, 12.19it/s][2025-09-06 08:56:02 TP3] Registering 56 cuda graph addresses | |
| Capturing batches (bs=1 avail_mem=7.82 GB): 100%|██████████| 28/28 [00:04<00:00, 11.67it/s] Capturing batches (bs=1 avail_mem=7.82 GB): 100%|██████████| 28/28 [00:04<00:00, 6.22it/s] | |
| [2025-09-06 08:56:02 TP0] Registering 56 cuda graph addresses | |
| [2025-09-06 08:56:02 TP1] Registering 56 cuda graph addresses | |
| [2025-09-06 08:56:02 TP2] Registering 56 cuda graph addresses | |
| [2025-09-06 08:56:02 TP0] Capture cuda graph end. Time elapsed: 5.03 s. mem usage=1.73 GB. avail mem=7.81 GB. | |
| [2025-09-06 08:56:02 TP0] max_total_num_tokens=8487040, chunked_prefill_size=16384, max_prefill_tokens=16384, max_running_requests=4096, context_len=131072, available_gpu_mem=7.81 GB | |
| [2025-09-06 08:56:03] INFO: Started server process [56872] | |
| [2025-09-06 08:56:03] INFO: Waiting for application startup. | |
| [2025-09-06 08:56:03] INFO: Application startup complete. | |
| [2025-09-06 08:56:03] INFO: Uvicorn running on http://127.0.0.1:8400 (Press CTRL+C to quit) | |
| [2025-09-06 08:56:04] INFO: 127.0.0.1:33132 - "GET /get_model_info HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:04 TP0] Prefill batch. #new-seq: 1, #new-token: 64, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, | |
| [2025-09-06 08:56:05] INFO: 127.0.0.1:33148 - "POST /generate HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:05] The server is fired up and ready to roll! | |
| [2025-09-06 08:56:12 TP0] Prefill batch. #new-seq: 1, #new-token: 64, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, | |
| [2025-09-06 08:56:13] INFO: 127.0.0.1:34932 - "GET /health_generate HTTP/1.1" 200 OK | |
| command=python3 -m sglang.launch_server --model-path /home/yiliu7/models/openai/gpt-oss-120b --tp 4 --cuda-graph-max-bs 200 --mem-fraction-static 0.93 --device cuda --host 127.0.0.1 --port 8400 | |
| Evaluation start: model=/home/yiliu7/models/openai/gpt-oss-120b reasoning_effort=low expected_score=0.6 | |
| ChatCompletionSampler initialized with self.system_message=None self.temperature=0.1 self.max_tokens=4096 self.reasoning_effort='low' | |
| 0%| | 0/198 [00:00<?, ?it/s][2025-09-06 08:56:14 TP0] Prefill batch. #new-seq: 1, #new-token: 256, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, | |
| [2025-09-06 08:56:14 TP0] Prefill batch. #new-seq: 2, #new-token: 640, #cached-token: 0, token usage: 0.00, #running-req: 1, #queue-req: 0, | |
| [2025-09-06 08:56:14 TP0] Prefill batch. #new-seq: 9, #new-token: 3328, #cached-token: 576, token usage: 0.00, #running-req: 3, #queue-req: 0, | |
| [2025-09-06 08:56:14 TP0] Prefill batch. #new-seq: 7, #new-token: 2304, #cached-token: 448, token usage: 0.00, #running-req: 12, #queue-req: 0, | |
| [2025-09-06 08:56:14 TP0] Prefill batch. #new-seq: 52, #new-token: 16192, #cached-token: 3328, token usage: 0.00, #running-req: 19, #queue-req: 35, | |
| [2025-09-06 08:56:15 TP0] Prefill batch. #new-seq: 60, #new-token: 16128, #cached-token: 3840, token usage: 0.00, #running-req: 71, #queue-req: 3, | |
| [2025-09-06 08:56:15 TP0] Prefill batch. #new-seq: 60, #new-token: 16000, #cached-token: 3968, token usage: 0.00, #running-req: 131, #queue-req: 0, | |
| [2025-09-06 08:56:15 TP0] Prefill batch. #new-seq: 7, #new-token: 1920, #cached-token: 448, token usage: 0.01, #running-req: 191, #queue-req: 0, | |
| [2025-09-06 08:56:15 TP0] Decode batch. #running-req: 198, #token: 62848, token usage: 0.01, cuda graph: True, gen throughput (token/s): 471.38, #queue-req: 0, | |
| [2025-09-06 08:56:15] INFO: 127.0.0.1:34996 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:15] INFO: 127.0.0.1:34970 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:15] INFO: 127.0.0.1:36314 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:16 TP0] Decode batch. #running-req: 195, #token: 69056, token usage: 0.01, cuda graph: True, gen throughput (token/s): 17246.54, #queue-req: 0, | |
| [2025-09-06 08:56:16] INFO: 127.0.0.1:35984 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:16] INFO: 127.0.0.1:35740 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:16] INFO: 127.0.0.1:35238 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:16] INFO: 127.0.0.1:34938 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:16] INFO: 127.0.0.1:36640 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:16] INFO: 127.0.0.1:35298 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:16] INFO: 127.0.0.1:35828 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:16] INFO: 127.0.0.1:36500 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:16] INFO: 127.0.0.1:35320 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:16] INFO: 127.0.0.1:35844 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:16 TP0] Decode batch. #running-req: 187, #token: 73600, token usage: 0.01, cuda graph: True, gen throughput (token/s): 16961.41, #queue-req: 0, | |
| [2025-09-06 08:56:16] INFO: 127.0.0.1:36112 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:16] INFO: 127.0.0.1:35928 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:16] INFO: 127.0.0.1:35336 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:16] INFO: 127.0.0.1:35972 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:16] INFO: 127.0.0.1:35698 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:16] INFO: 127.0.0.1:35370 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:16 TP0] Decode batch. #running-req: 179, #token: 76032, token usage: 0.01, cuda graph: True, gen throughput (token/s): 16309.95, #queue-req: 0, | |
| [2025-09-06 08:56:16] INFO: 127.0.0.1:35012 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:16] INFO: 127.0.0.1:35156 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:17] INFO: 127.0.0.1:36228 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:17] INFO: 127.0.0.1:36050 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:17] INFO: 127.0.0.1:35464 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:17] INFO: 127.0.0.1:35730 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:17] INFO: 127.0.0.1:35342 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:17] INFO: 127.0.0.1:36282 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:17] INFO: 127.0.0.1:36592 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:17] INFO: 127.0.0.1:36056 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:17] INFO: 127.0.0.1:36468 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:17] INFO: 127.0.0.1:35766 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:17] INFO: 127.0.0.1:35260 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:17 TP0] Decode batch. #running-req: 166, #token: 77760, token usage: 0.01, cuda graph: True, gen throughput (token/s): 15304.53, #queue-req: 0, | |
| [2025-09-06 08:56:17] INFO: 127.0.0.1:36100 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:17] INFO: 127.0.0.1:35098 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:17] INFO: 127.0.0.1:35462 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:17] INFO: 127.0.0.1:35932 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:17] INFO: 127.0.0.1:35660 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:17] INFO: 127.0.0.1:35070 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:17 TP0] Decode batch. #running-req: 160, #token: 79232, token usage: 0.01, cuda graph: True, gen throughput (token/s): 14478.16, #queue-req: 0, | |
| [2025-09-06 08:56:17] INFO: 127.0.0.1:35326 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:17] INFO: 127.0.0.1:35500 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:17] INFO: 127.0.0.1:36370 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:18] INFO: 127.0.0.1:36614 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:18] INFO: 127.0.0.1:36668 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:18] INFO: 127.0.0.1:36246 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:18] INFO: 127.0.0.1:35104 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:18] INFO: 127.0.0.1:35218 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:18] INFO: 127.0.0.1:36070 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:18] INFO: 127.0.0.1:35408 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:18] INFO: 127.0.0.1:35726 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:18 TP0] Decode batch. #running-req: 149, #token: 79552, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13743.11, #queue-req: 0, | |
| [2025-09-06 08:56:18] INFO: 127.0.0.1:36358 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:18] INFO: 127.0.0.1:35644 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:18] INFO: 127.0.0.1:36182 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:18] INFO: 127.0.0.1:36278 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:18] INFO: 127.0.0.1:35560 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:18] INFO: 127.0.0.1:36158 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:18] INFO: 127.0.0.1:35374 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:18] INFO: 127.0.0.1:35814 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:18 TP0] Decode batch. #running-req: 141, #token: 81216, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13274.29, #queue-req: 0, | |
| [2025-09-06 08:56:18] INFO: 127.0.0.1:36116 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:18] INFO: 127.0.0.1:34964 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:18] INFO: 127.0.0.1:36140 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:18] INFO: 127.0.0.1:35166 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:18] INFO: 127.0.0.1:35228 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:18] INFO: 127.0.0.1:35936 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:18] INFO: 127.0.0.1:36518 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:18] INFO: 127.0.0.1:36414 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:19] INFO: 127.0.0.1:35444 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:19] INFO: 127.0.0.1:35480 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:19] INFO: 127.0.0.1:35552 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:19] INFO: 127.0.0.1:34972 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:19] INFO: 127.0.0.1:35484 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:19] INFO: 127.0.0.1:36472 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:19] INFO: 127.0.0.1:35124 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:19 TP0] Decode batch. #running-req: 126, #token: 77248, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13223.64, #queue-req: 0, | |
| [2025-09-06 08:56:19] INFO: 127.0.0.1:35172 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:19] INFO: 127.0.0.1:36692 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:19] INFO: 127.0.0.1:35564 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:19] INFO: 127.0.0.1:36580 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:19] INFO: 127.0.0.1:35856 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:19] INFO: 127.0.0.1:36464 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:19] INFO: 127.0.0.1:36566 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:19] INFO: 127.0.0.1:36624 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:19] INFO: 127.0.0.1:36654 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:19] INFO: 127.0.0.1:36430 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:19] INFO: 127.0.0.1:36234 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:19 TP0] Decode batch. #running-req: 115, #token: 75712, token usage: 0.01, cuda graph: True, gen throughput (token/s): 16263.58, #queue-req: 0, | |
| [2025-09-06 08:56:19] INFO: 127.0.0.1:35538 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:19] INFO: 127.0.0.1:35046 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:19] INFO: 127.0.0.1:36630 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:19] INFO: 127.0.0.1:35684 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:19] INFO: 127.0.0.1:36404 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:19] INFO: 127.0.0.1:35614 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:19] INFO: 127.0.0.1:35364 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:19] INFO: 127.0.0.1:36172 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:19 TP0] Decode batch. #running-req: 107, #token: 74240, token usage: 0.01, cuda graph: True, gen throughput (token/s): 15262.25, #queue-req: 0, | |
| [2025-09-06 08:56:19] INFO: 127.0.0.1:35368 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:19] INFO: 127.0.0.1:35148 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:19] INFO: 127.0.0.1:35958 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:19] INFO: 127.0.0.1:35024 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:19] INFO: 127.0.0.1:35088 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:19] INFO: 127.0.0.1:35574 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:19] INFO: 127.0.0.1:36208 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:19] INFO: 127.0.0.1:35192 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:20] INFO: 127.0.0.1:35038 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:20] INFO: 127.0.0.1:35008 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:20 TP0] Decode batch. #running-req: 97, #token: 71808, token usage: 0.01, cuda graph: True, gen throughput (token/s): 14424.31, #queue-req: 0, | |
| [2025-09-06 08:56:20] INFO: 127.0.0.1:36300 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:20] INFO: 127.0.0.1:35870 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:20] INFO: 127.0.0.1:36258 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:20] INFO: 127.0.0.1:35528 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:20] INFO: 127.0.0.1:36336 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:20] INFO: 127.0.0.1:35268 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:20] INFO: 127.0.0.1:35416 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:20] INFO: 127.0.0.1:36138 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:20 TP0] Decode batch. #running-req: 89, #token: 69952, token usage: 0.01, cuda graph: True, gen throughput (token/s): 13512.18, #queue-req: 0, | |
| [2025-09-06 08:56:20] INFO: 127.0.0.1:36298 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:20] INFO: 127.0.0.1:35020 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:20] INFO: 127.0.0.1:35452 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:20] INFO: 127.0.0.1:36132 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:20] INFO: 127.0.0.1:36000 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:20] INFO: 127.0.0.1:35474 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:20] INFO: 127.0.0.1:36408 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:20 TP0] Decode batch. #running-req: 82, #token: 67136, token usage: 0.01, cuda graph: True, gen throughput (token/s): 12757.55, #queue-req: 0, | |
| [2025-09-06 08:56:20] INFO: 127.0.0.1:35208 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:20] INFO: 127.0.0.1:36504 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:20] INFO: 127.0.0.1:36388 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:20] INFO: 127.0.0.1:35782 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:20] INFO: 127.0.0.1:35312 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:20 TP0] Decode batch. #running-req: 77, #token: 65984, token usage: 0.01, cuda graph: True, gen throughput (token/s): 12161.34, #queue-req: 0, | |
| [2025-09-06 08:56:20] INFO: 127.0.0.1:35584 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:20] INFO: 127.0.0.1:36210 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:20] INFO: 127.0.0.1:34952 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:20] INFO: 127.0.0.1:36528 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:20] INFO: 127.0.0.1:35942 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:20] INFO: 127.0.0.1:36288 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:21] INFO: 127.0.0.1:36026 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:21] INFO: 127.0.0.1:36280 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:21] INFO: 127.0.0.1:35252 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:21] INFO: 127.0.0.1:36554 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:21 TP0] Decode batch. #running-req: 67, #token: 59968, token usage: 0.01, cuda graph: True, gen throughput (token/s): 11380.33, #queue-req: 0, | |
| [2025-09-06 08:56:21] INFO: 127.0.0.1:36348 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:21] INFO: 127.0.0.1:35598 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:21] INFO: 127.0.0.1:36596 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:21] INFO: 127.0.0.1:34958 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:21] INFO: 127.0.0.1:35276 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:21] INFO: 127.0.0.1:36236 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:21] INFO: 127.0.0.1:35756 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:21] INFO: 127.0.0.1:36454 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:21 TP0] Decode batch. #running-req: 59, #token: 54592, token usage: 0.01, cuda graph: True, gen throughput (token/s): 9912.63, #queue-req: 0, | |
| [2025-09-06 08:56:21] INFO: 127.0.0.1:36446 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:21] INFO: 127.0.0.1:35896 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:21] INFO: 127.0.0.1:35286 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:21 TP0] Decode batch. #running-req: 56, #token: 54912, token usage: 0.01, cuda graph: True, gen throughput (token/s): 9541.72, #queue-req: 0, | |
| [2025-09-06 08:56:21] INFO: 127.0.0.1:36010 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:21] INFO: 127.0.0.1:35902 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:21] INFO: 127.0.0.1:35672 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:21] INFO: 127.0.0.1:35146 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:21] INFO: 127.0.0.1:36086 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:21] INFO: 127.0.0.1:34994 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:21 TP0] Decode batch. #running-req: 50, #token: 50112, token usage: 0.01, cuda graph: True, gen throughput (token/s): 9111.40, #queue-req: 0, | |
| [2025-09-06 08:56:21] INFO: 127.0.0.1:36684 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:21] INFO: 127.0.0.1:35112 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:21] INFO: 127.0.0.1:36480 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:21] INFO: 127.0.0.1:35712 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:21] INFO: 127.0.0.1:36708 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:21] INFO: 127.0.0.1:35282 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:21] INFO: 127.0.0.1:35392 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:22] INFO: 127.0.0.1:36036 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:22] INFO: 127.0.0.1:35884 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:22 TP0] Decode batch. #running-req: 41, #token: 42432, token usage: 0.00, cuda graph: True, gen throughput (token/s): 7750.34, #queue-req: 0, | |
| [2025-09-06 08:56:22] INFO: 127.0.0.1:35798 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:22] INFO: 127.0.0.1:36326 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:22] INFO: 127.0.0.1:36702 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:22] INFO: 127.0.0.1:35206 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:22] INFO: 127.0.0.1:35914 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:22] INFO: 127.0.0.1:34988 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:22 TP0] Decode batch. #running-req: 35, #token: 38464, token usage: 0.00, cuda graph: True, gen throughput (token/s): 6543.68, #queue-req: 0, | |
| [2025-09-06 08:56:22] INFO: 127.0.0.1:35352 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:22] INFO: 127.0.0.1:36542 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:22] INFO: 127.0.0.1:36078 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:22 TP0] Decode batch. #running-req: 32, #token: 35520, token usage: 0.00, cuda graph: True, gen throughput (token/s): 6058.73, #queue-req: 0, | |
| [2025-09-06 08:56:22] INFO: 127.0.0.1:35860 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:22] INFO: 127.0.0.1:34976 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:22] INFO: 127.0.0.1:36490 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:22] INFO: 127.0.0.1:35628 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:22] INFO: 127.0.0.1:35784 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:22 TP0] Decode batch. #running-req: 28, #token: 31872, token usage: 0.00, cuda graph: True, gen throughput (token/s): 5606.79, #queue-req: 0, | |
| [2025-09-06 08:56:22] INFO: 127.0.0.1:36218 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:22] INFO: 127.0.0.1:35882 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:22] INFO: 127.0.0.1:35450 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:22 TP0] Decode batch. #running-req: 24, #token: 29888, token usage: 0.00, cuda graph: True, gen throughput (token/s): 4809.02, #queue-req: 0, | |
| [2025-09-06 08:56:23] INFO: 127.0.0.1:35770 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:23] INFO: 127.0.0.1:36156 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:23] INFO: 127.0.0.1:36474 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:23 TP0] Decode batch. #running-req: 21, #token: 26880, token usage: 0.00, cuda graph: True, gen throughput (token/s): 4527.52, #queue-req: 0, | |
| [2025-09-06 08:56:23] INFO: 127.0.0.1:35056 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:23] INFO: 127.0.0.1:36622 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:23] INFO: 127.0.0.1:36276 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| 1%| | 1/198 [00:08<28:36, 8.71s/it][2025-09-06 08:56:23] INFO: 127.0.0.1:35514 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:23 TP0] Decode batch. #running-req: 17, #token: 22400, token usage: 0.00, cuda graph: True, gen throughput (token/s): 3719.03, #queue-req: 0, | |
| [2025-09-06 08:56:23] INFO: 127.0.0.1:35640 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:23 TP0] Decode batch. #running-req: 16, #token: 21760, token usage: 0.00, cuda graph: True, gen throughput (token/s): 3341.24, #queue-req: 0, | |
| [2025-09-06 08:56:23] INFO: 127.0.0.1:36546 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:23] INFO: 127.0.0.1:35820 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:23] INFO: 127.0.0.1:35186 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:23] INFO: 127.0.0.1:35226 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:23 TP0] Decode batch. #running-req: 12, #token: 16896, token usage: 0.00, cuda graph: True, gen throughput (token/s): 2893.92, #queue-req: 0, | |
| [2025-09-06 08:56:23] INFO: 127.0.0.1:36448 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:23 TP0] Decode batch. #running-req: 11, #token: 16192, token usage: 0.00, cuda graph: True, gen throughput (token/s): 2433.17, #queue-req: 0, | |
| [2025-09-06 08:56:23] INFO: 127.0.0.1:36386 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:24] INFO: 127.0.0.1:35996 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:24 TP0] Decode batch. #running-req: 9, #token: 13376, token usage: 0.00, cuda graph: True, gen throughput (token/s): 2083.63, #queue-req: 0, | |
| [2025-09-06 08:56:24] INFO: 127.0.0.1:36194 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:24] INFO: 127.0.0.1:35428 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:24 TP0] Decode batch. #running-req: 7, #token: 9408, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1850.52, #queue-req: 0, | |
| [2025-09-06 08:56:24] INFO: 127.0.0.1:35130 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:24] INFO: 127.0.0.1:35488 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:24 TP0] Decode batch. #running-req: 5, #token: 8064, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1423.57, #queue-req: 0, | |
| [2025-09-06 08:56:24 TP0] Decode batch. #running-req: 5, #token: 8192, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1208.97, #queue-req: 0, | |
| [2025-09-06 08:56:24] INFO: 127.0.0.1:36600 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:24 TP0] Decode batch. #running-req: 4, #token: 6784, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1136.89, #queue-req: 0, | |
| [2025-09-06 08:56:24 TP0] Decode batch. #running-req: 4, #token: 6912, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1051.13, #queue-req: 0, | |
| [2025-09-06 08:56:25 TP0] Decode batch. #running-req: 4, #token: 7040, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1060.20, #queue-req: 0, | |
| [2025-09-06 08:56:25 TP0] Decode batch. #running-req: 4, #token: 7296, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1063.30, #queue-req: 0, | |
| [2025-09-06 08:56:25] INFO: 127.0.0.1:35380 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:25 TP0] Decode batch. #running-req: 3, #token: 5568, token usage: 0.00, cuda graph: True, gen throughput (token/s): 1036.20, #queue-req: 0, | |
| [2025-09-06 08:56:25] INFO: 127.0.0.1:36696 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:25 TP0] Decode batch. #running-req: 2, #token: 3968, token usage: 0.00, cuda graph: True, gen throughput (token/s): 634.01, #queue-req: 0, | |
| [2025-09-06 08:56:25 TP0] Decode batch. #running-req: 2, #token: 3968, token usage: 0.00, cuda graph: True, gen throughput (token/s): 570.67, #queue-req: 0, | |
| [2025-09-06 08:56:25 TP0] Decode batch. #running-req: 2, #token: 4096, token usage: 0.00, cuda graph: True, gen throughput (token/s): 578.23, #queue-req: 0, | |
| [2025-09-06 08:56:25 TP0] Decode batch. #running-req: 2, #token: 4224, token usage: 0.00, cuda graph: True, gen throughput (token/s): 578.82, #queue-req: 0, | |
| [2025-09-06 08:56:25] INFO: 127.0.0.1:35086 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| [2025-09-06 08:56:26] INFO: 127.0.0.1:36272 - "POST /v1/chat/completions HTTP/1.1" 200 OK | |
| 1%| | 2/198 [00:11<17:06, 5.24s/it] 100%|██████████| 198/198 [00:11<00:00, 17.18it/s] | |
| /usr/lib/python3.12/subprocess.py:1127: ResourceWarning: subprocess 56872 is still running | |
| _warn("subprocess %s is still running" % self.pid, | |
| ResourceWarning: Enable tracemalloc to get the object allocation traceback | |
| . | |
| ---------------------------------------------------------------------- | |
| Ran 1 test in 175.811s | |
| OK | |
| Writing report to /tmp/gpqa__home_yiliu7_models_openai_gpt-oss-120b.html | |
| {'chars': 1651.1565656565656, 'chars:std': 960.4908734139215, 'score:std': 0.47958527980756577, 'score': 0.6414141414141414} | |
| Writing results to /tmp/gpqa__home_yiliu7_models_openai_gpt-oss-120b.json | |
| Total latency: 11.574 s | |
| Score: 0.641 | |
| Evaluation end: model=/home/yiliu7/models/openai/gpt-oss-120b reasoning_effort=low expected_score=0.6 metrics={'chars': 1651.1565656565656, 'chars:std': 960.4908734139215, 'score:std': 0.47958527980756577, 'score': 0.6414141414141414} | |
| ================================================================================ | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment