
This is the configuration class to store the configuration of a [`Qwen2_5OmniAudioEncoder`]. It is used to instantiate a
Qwen2.5-Omni-Thinker audio encoder according to the specified arguments, defining the model architecture. Instantiating a
configuration with the defaults will yield a similar configuration to that of the audio encoder of the Qwen2-Audio
architecture.

e.g. [Qwen/Qwen2.5-Omni-7B](https://huggingface.co/Qwen/Qwen2.5-Omni-7B)

Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.

Args:
    num_mel_bins (`int`, *optional*, defaults to 128):
        Number of mel features used per input features. Should correspond to the value used in the
        `Qwen2_5OmniProcessor` class.
    encoder_layers (`int`, *optional*, defaults to 32):
        Number of encoder layers.
    encoder_attention_heads (`int`, *optional*, defaults to 20):
        Number of attention heads for each attention layer in the Transformer encoder.
    encoder_ffn_dim (`int`, *optional*, defaults to 5120):
        Dimensionality of the "intermediate" (often named feed-forward) layer in encoder.
    d_model (`int`, *optional*, defaults to 1280):
        Dimensionality of the layers.
    dropout (`float`, *optional*, defaults to 0.0):
        The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
    attention_dropout (`float`, *optional*, defaults to 0.0):
        The dropout ratio for the attention probabilities.
    activation_function (`str`, *optional*, defaults to `"gelu"`):
        The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
        `"relu"`, `"silu"` and `"gelu_new"` are supported.
    activation_dropout (`float`, *optional*, defaults to 0.0):
        The dropout ratio for activations inside the fully connected layer.
    scale_embedding (`bool`, *optional*, defaults to `False`):
        Scale embeddings by diving by sqrt(d_model).
    initializer_range (`float`, *optional*, defaults to 0.02):
        The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
    max_source_positions (`int`, *optional*, defaults to 1500):
        The maximum sequence length of log-mel filter-bank features that this model might ever be used with.
    n_window (`int`, *optional*, defaults to 100):
        The chunk for conv and flash attn in AudioEncoder.
    output_dim (`int`, *optional*, defaults to 3584):
        The output dimension of AudioEncoder.

Example:

```python
>>> from transformers import Qwen2_5OmniAudioEncoderConfig, Qwen2_5OmniAudioEncoder

>>> # Initializing a Qwen2_5OmniAudioEncoderConfig
>>> configuration = Qwen2_5OmniAudioEncoderConfig()

>>> # Initializing a Qwen2_5OmniAudioEncoder (with random weights)
>>> model = Qwen2_5OmniAudioEncoder(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config
```Ú