Config: https://huggingface.co/deepseek-ai/DeepSeek-V3-Base/blob/main/config.json
This configuration file defines the architecture and hyperparameters for a model named DeepseekV3ForCausalLM, which is a causal language model (LM) based on the DeepseekV3 architecture. Below is an explanation of the key configurations:
architectures: Specifies the model class, which isDeepseekV3ForCausalLM. This indicates the model is designed for causal language modeling (e.g., text generation).model_type: The type of model, which isdeepseek_v3. This is used to identify the model architecture in the Hugging Face Transformers library.
attention_bias: Disables attention bias (e.g., no additional bias terms in attention layers).attention_dropout: Dropout rate for attention layers (set to0.0, meaning no dropout).num_attention_heads: Number of attention heads in the multi-head attention mechanism (128).num_key_value_heads: Number of key/value heads in the attention mechanism (128). This is often used in grouped-query attention.qk_nope_head_dim: Dimension of the head for queries and keys without positional encoding (128).qk_rope_head_dim: Dimension of the head for queries and keys with rotary positional encoding (64).
hidden_size: The size of the hidden layers in the model (7168).intermediate_size: The size of the intermediate layer in the feed-forward network (18432).num_hidden_layers: The number of hidden layers in the model (61).vocab_size: The size of the vocabulary (129280 tokens).
max_position_embeddings: The maximum sequence length the model can handle (163840 tokens).rope_theta: The base value for rotary positional embeddings (10000).rope_scaling: Configuration for scaling rotary positional embeddings:type: The scaling type isyarn.factor: Scaling factor (40).beta_fastandbeta_slow: Parameters for controlling the speed of scaling.mscale: Multiplicative scaling factor (1.0).original_max_position_embeddings: The original maximum sequence length before scaling (4096).
moe_intermediate_size: Intermediate size for MoE layers (2048).moe_layer_freq: Frequency of MoE layers in the model (1, meaning every layer is an MoE layer).n_routed_experts: Number of routed experts in the MoE layer (256).n_shared_experts: Number of shared experts in the MoE layer (1).num_experts_per_tok: Number of experts activated per token (8).routed_scaling_factor: Scaling factor for routed experts (2.5).scoring_func: The scoring function used for expert routing (sigmoid).topk_method: The method used for selecting top-k experts (noaux_tc).topk_group: The group size for top-k expert selection (4).
kv_lora_rank: The rank for LoRA adaptation in key/value projections (512).q_lora_rank: The rank for LoRA adaptation in query projections (1536).
rms_norm_eps: The epsilon value for RMS normalization (1e-06).
initializer_range: The range for initializing model weights (0.02).
hidden_act: The activation function used in the model (silu, also known as Swish).
quantization_config: Configuration for quantization:quant_method: The quantization method (fp8, 8-bit floating point).fmt: The format for quantization (e4m3, a floating-point format).activation_scheme: The scheme for quantizing activations (dynamic).weight_block_size: The block size for quantizing weights ([128, 128]).
bos_token_id: The ID of the beginning-of-sequence token (0).eos_token_id: The ID of the end-of-sequence token (1).
pretraining_tp: Tensor parallelism during pretraining (1, meaning no parallelism).use_cache: Whether to use caching during inference (true).torch_dtype: The data type used for tensors (bfloat16).transformers_version: The version of the Hugging Face Transformers library used (4.33.1).
aux_loss_alpha: The weight for auxiliary loss (0.001).seq_aux: Whether to use sequence-level auxiliary loss (true).
ep_size: Expert parallelism size (1, meaning no parallelism).first_k_dense_replace: The number of dense layers to replace in the beginning (3).norm_topk_prob: Whether to normalize top-k probabilities (true).num_nextn_predict_layers: The number of layers used for next-n prediction (1).tie_word_embeddings: Whether to tie the word embeddings to the output layer (false).v_head_dim: The dimension of the value head in attention (128).
This configuration defines a large-scale causal language model with a mixture of experts (MoE) architecture, rotary positional embeddings, and low-rank adaptations (LoRA). It supports long sequences (up to 163840 tokens) and uses 8-bit floating-point quantization for efficiency. The model is designed for high-performance text generation tasks.