Demystifying Mixture of Experts (MoE): A Scalable Solution for Large-Scale Deep Learning

Demystifying Mixture of Experts (MoE): A Scalable Solution for Large-Scale Deep Learning

As the complexity of deep learning tasks grows, the need for scalable and efficient models has led to increased interest in the Mixture of Experts (MoE) architecture. Known for its potential to harness the power of massive models while keeping computational costs low, MoE has become a staple in advanced applications, from natural language processing (NLP) to recommendation systems. In this article, we’ll dive deep into MoE, covering its fundamentals, benefits, challenges, practical implementations, and code examples.


What is Mixture of Experts?

Mixture of Experts (MoE) is a type of ensemble learning model architecture where multiple sub-models, known as experts, specialize in different aspects of a task. For any given input, only a subset of these experts is activated, making MoE models computationally efficient for large-scale problems. This selective activation is managed by a gating network, which determines the best experts to handle a specific input.


Why Mixture of Experts?

Deep learning models with billions of parameters, such as large language models, require significant computational power and memory. However, these models may not need all parameters activated for each input. MoE addresses this by:

  • Activating only a fraction of the parameters for each input.
  • Allowing multiple experts to focus on different parts of the input space.
  • Enabling large models with controlled computational costs, by activating only the relevant experts.

The sparsity and dynamic routing achieved by MoE make it suitable for tasks where different aspects of data are better served by specialized subnetworks.


Architecture of Mixture of Experts

An MoE model typically consists of three primary components:

  1. Experts: Multiple neural networks that specialize in specific areas of the input data space.
  2. Gating Network: A gating function determines which experts to activate for a given input.
  3. Aggregation: Outputs from the selected experts are combined to produce the final output.

Each expert is a separate neural network, which could be a simple feedforward layer, a convolutional network, or even an LSTM. The gating network often takes the form of a softmax layer, calculating a probability distribution over the experts and selecting those with the highest probabilities.


Architecture

Here's a high-level flow of an MoE model:

Input ----> Gating Network (chooses experts) ----> Experts (subset) ----> Aggregator ----> Output        


Let’s break down these components in detail.


1. Experts: The Specialized Subnetworks

In a Mixture of Experts (MoE) model, the "experts" are specialized subnetworks that each focus on a particular region of the input space or a specific subtask. Unlike traditional neural networks where every neuron or layer processes every part of the input, MoE experts are dynamically selected based on the input, making them both efficient and adaptable. These experts work in harmony to divide and conquer complex problems, creating a more powerful model that can handle diverse data or multiple tasks.

Let's delve deeper into how these experts function, how they’re structured, and how they improve the model’s performance.


Key Characteristics of Experts in MoE

Experts in MoE are designed to focus on different parts of a data distribution or subtask, leading to the following benefits:

  1. Specialization: Each expert can focus on a specific type of input, enabling the model to learn specialized features. For instance, in an NLP model, certain experts may specialize in syntactic structures, while others may focus on semantic nuances.
  2. Efficiency: By only activating a subset of experts, the model conserves computational resources. Instead of the entire model working on every input, only a few experts are actively engaged per instance.
  3. Scalability: The model can grow to a larger number of experts without significantly increasing the computation per forward pass, as only a few experts are selected each time.


Designing an Expert Network

Each expert in an MoE model is typically structured as a standalone neural network. The architecture of these experts can vary depending on the nature of the problem, ranging from simple feedforward layers for basic tasks to more complex architectures like convolutional neural networks (CNNs) or recurrent neural networks (RNNs) for image and sequence data, respectively.


1. Feedforward Experts

For tasks involving numerical or tabular data, feedforward networks are often sufficient. Each expert in a feedforward setup may look something like this:

import torch
import torch.nn as nn

class FeedforwardExpert(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(FeedforwardExpert, self).__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        hidden = self.relu(self.fc1(x))
        output = self.fc2(hidden)
        return output        


2. Convolutional Experts for Image Data

When dealing with images, each expert can be a convolutional neural network (CNN) designed to detect specific image features like edges, shapes, or textures. Convolutional experts are particularly useful in scenarios where different regions or patterns in images require distinct processing.

class ConvExpert(nn.Module):
    def __init__(self, input_channels, output_dim):
        super(ConvExpert, self).__init__()
        self.conv1 = nn.Conv2d(input_channels, 16, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(16, 32, kernel_size=3, padding=1)
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
        self.fc = nn.Linear(32 * 8 * 8, output_dim)

    def forward(self, x):
        x = self.pool(torch.relu(self.conv1(x)))
        x = self.pool(torch.relu(self.conv2(x)))
        x = x.view(-1, 32 * 8 * 8)  # Flatten for fully connected layer
        output = self.fc(x)
        return output        

In this ConvExpert, the convolutional layers extract features, and the fully connected layer at the end produces the final output. This setup enables each expert to specialize in different visual aspects of the data.


3. Recurrent Experts for Sequence Data

For sequential tasks such as language modeling or time series forecasting, recurrent neural networks (RNNs), Long Short-Term Memory networks (LSTMs), or even Transformer-based experts are often employed. Here’s an example of an LSTM-based expert:

class LSTMExpert(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(LSTMExpert, self).__init__()
        self.lstm = nn.LSTM(input_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        _, (hidden, _) = self.lstm(x)
        output = self.fc(hidden[-1])
        return output        

This LSTMExpert specializes in sequential dependencies, making it suitable for handling tasks like text, speech, or time series analysis.


Training Experts for Specialization

The training process of MoE models encourages each expert to specialize. Here are some key strategies used to foster specialization:

  1. Sparse Activation: By limiting the number of experts that are active for each input, the model forces experts to specialize. Each expert only gets to see the subset of data it’s most suited for, which naturally leads to a division of labor among experts.
  2. Diverse Data Distribution: When training the MoE on varied data, the experts learn to handle specific distributions or features. For example, some experts might focus on simpler patterns, while others handle more complex or nuanced cases.
  3. Regularization Techniques: Techniques like L2 regularization or dropout can be applied within each expert to prevent overfitting. Additionally, encouraging diversity among experts (e.g., by penalizing correlations between experts’ outputs) can lead to more distinct specialization.


Advantages of Expert Specialization

  1. Improved Performance on Diverse Data: Each expert’s focus on a subset of the problem allows the MoE to generalize better across varied data distributions. This adaptability is especially useful for tasks where the input data can have diverse patterns.
  2. Better Generalization: With each expert focusing on a specific subset of the data, MoE models can generalize well across tasks or data types. For instance, experts in a language model may generalize better by learning different linguistic aspects like grammar, vocabulary, or style.
  3. Enhanced Model Robustness: Because each expert specializes, the model can leverage the most relevant expert (or experts) when encountering out-of-distribution data. This can improve robustness to new or previously unseen data.


Challenges of Expert Specialization

  1. Underutilization: Some experts may remain underutilized if they are rarely selected by the gating network. This can lead to imbalances, where certain experts are heavily used while others receive limited training.
  2. Managing Overlap: Ensuring that experts specialize without excessive overlap can be difficult. If multiple experts start learning similar patterns, the efficiency of the MoE model is reduced.
  3. Complexity in Gradient Computation: During backpropagation, only the activated experts receive gradient updates. This selective activation requires careful management to avoid training imbalances.


Practical Example: Expert Specialization in a Text Classification Task

To see expert specialization in action, let’s consider a text classification task where each expert in the MoE specializes in different types of text (e.g., positive, negative, neutral sentiment). Suppose we have an MoE model with three experts:

  • Expert 1 specializes in positive sentiment patterns.
  • Expert 2 specializes in negative sentiment patterns.
  • Expert 3 focuses on neutral or ambiguous patterns.

The gating network dynamically selects the expert or combination of experts based on the input text, routing positive texts to Expert 1, negative to Expert 2, and neutral to Expert 3.

# Example text routing through MoE
input_text = torch.tensor(...)  # Some input representation

gate_values = gating_network(input_text)  # Select experts
top_k_values, top_k_indices = torch.topk(gate_values, k=2)  # Select top 2 experts

# Collect outputs from the selected experts
expert_outputs = [experts[idx](input_text) for idx in top_k_indices]
final_output = aggregate_expert_outputs(top_k_values, expert_outputs)        

By allowing each expert to focus on a specific sentiment type, the MoE model can achieve high accuracy with reduced computational requirements.

In Mixture of Experts models, specialized subnetworks (experts) provide an efficient way to scale deep learning models to handle complex and varied data. The design of experts allows them to focus on unique aspects of a problem, promoting generalization and robustness in large-scale models.


2. Gating Network: Routing the Input to the Right Experts

In a Mixture of Experts (MoE) model, the gating network plays a critical role in dynamically selecting which experts to activate for a given input. This mechanism allows the MoE architecture to leverage sparsity by engaging only a subset of experts, which saves computational resources and enables the model to handle large and complex tasks efficiently. Let’s take a closer look at the inner workings, types, challenges, and practical implementations of the gating network.


Overview of the Gating Network

The gating network is responsible for determining which experts should handle a specific input, making MoE models flexible and adaptable to varying inputs. For each input sample, the gating network computes a set of probabilities or scores that indicate how relevant each expert is to the input. Typically, only the top-k experts—those with the highest probabilities—are activated for each forward pass.

In addition to determining expert selection, the gating network also provides the weight (or importance) for each selected expert’s output, which affects the final aggregation step. By assigning higher weights to the more relevant experts, the gating network ensures that the model's output is accurate and optimized for the input at hand.


Key Properties of the Gating Network

  1. Input-Dependent Selection: Unlike a static model where all neurons are always active, the gating network dynamically selects which experts to use based on the characteristics of the input data.
  2. Sparse Activation: Only a small subset of experts is activated for each input, reducing computational load and improving scalability.
  3. Weight Assignment: The gating network not only selects experts but also assigns importance weights to each selected expert, influencing the aggregation process.
  4. Adaptability: The gating network can be adjusted to select a different number of experts based on the complexity of the input, allowing more experts to be active for difficult inputs.


Types of Gating Networks

The design of the gating network can vary based on the specific requirements of the model and task. Here are some common types of gating networks used in MoE:


Softmax-Based Gating Network

  • The softmax function is commonly used to produce a probability distribution over the experts, ensuring that each expert receives a non-negative score that sums to one.
  • Given the input features, the gating network applies a fully connected layer followed by a softmax to produce probabilities for each expert.
  • A top-k operation selects the experts with the highest probabilities, and the selected experts’ outputs are weighted by these probabilities.

Example in PyTorch:

import torch
import torch.nn as nn

class SoftmaxGatingNetwork(nn.Module):
    def __init__(self, input_dim, num_experts):
        super(SoftmaxGatingNetwork, self).__init__()
        self.fc = nn.Linear(input_dim, num_experts)

    def forward(self, x):
        # Compute gating probabilities
        gate_values = torch.softmax(self.fc(x), dim=-1)
        return gate_values        


Sigmoid-Based Gating Network

  • In some cases, a sigmoid activation is used instead of softmax, producing independent probabilities for each expert rather than a normalized distribution.
  • This setup can be helpful when we want the possibility of multiple experts to be highly activated without a strict probability constraint, potentially allowing overlap between expert activations.

class SigmoidGatingNetwork(nn.Module):
    def __init__(self, input_dim, num_experts):
        super(SigmoidGatingNetwork, self).__init__()
        self.fc = nn.Linear(input_dim, num_experts)

    def forward(self, x):
        gate_values = torch.sigmoid(self.fc(x))  # Independent probability for each expert
        return gate_values        


Threshold-Based Gating Network

  • This method applies a threshold to select experts. For each input, if an expert’s probability exceeds a certain threshold, it is activated.
  • This approach can allow more flexibility in the number of active experts per input, rather than a fixed top-k count.

def threshold_gating(gate_values, threshold=0.5):
    return (gate_values > threshold).float() * gate_values        


Learned Sparse Gating (Top-k Selection)

  • In a learned sparse gating approach, the gating network is designed to activate only the top-k experts for each input, ensuring sparsity.
  • This technique is ideal when computational efficiency is critical, as it guarantees that only a limited number of experts are activated.

def top_k_gating(gate_values, k):
    top_k_values, top_k_indices = torch.topk(gate_values, k)
    mask = torch.zeros_like(gate_values).scatter_(1, top_k_indices, 1.0)
    return mask * gate_values        


Training the Gating Network

Training the gating network is essential to ensure that it learns to activate the right experts for different types of input. Here are some critical considerations in training the gating network:

  1. Backpropagation Through Sparse Selection: The gating network’s parameters are updated via backpropagation, but only the experts selected by the gating mechanism receive gradient updates. This selective updating requires careful handling to ensure all experts are sufficiently trained over time.
  2. Load Balancing: One common challenge in MoE models is load imbalance among experts. If some experts are rarely selected, they may not receive enough training, resulting in degraded performance. Load balancing strategies, such as regularizing the gating probabilities, can help encourage the gating network to select a more even distribution of experts over time.
  3. Regularization Techniques: Regularization can help ensure that the gating network avoids overfitting and generalizes well. Common approaches include:

  • Entropy Regularization: Adding an entropy-based loss term to encourage a balanced selection of experts. High entropy in the gating distribution indicates a more even spread, avoiding dominance by a few experts.
  • L2 Regularization: Applying L2 regularization to the gating network’s weights to prevent over-reliance on specific experts.


Challenges in Designing and Training Gating Networks

Despite its advantages, designing an effective gating network comes with challenges:

  1. Sparse Gradient Flow: Since only the selected experts are activated, gradient updates only flow through the activated pathways, which may limit the training signal reaching other parts of the network. Techniques like entropy regularization or auxiliary loss functions can help encourage more balanced expert utilization.
  2. Dynamic Load Balancing: Ensuring an even load distribution across experts is challenging, especially in models with many experts. Without proper regularization, the gating network may favor a subset of experts, resulting in underutilization of others. Dynamic load balancing constraints or auxiliary loss terms can mitigate this issue.
  3. Scalability: The computational cost of selecting top-k experts grows with the number of experts. In very large models, optimizing the gating network to scale efficiently is essential. Recent MoE models often implement approximate top-k selection or use specialized hardware to manage the load.


The gating network is the heart of the Mixture of Experts model, dynamically routing inputs to the appropriate experts and making MoE models efficient, adaptable, and scalable. Through sparse activation, dynamic routing, and selective expert weighting, the gating network enables the MoE architecture to handle diverse data with specialized processing.


3. Aggregator: Combining Outputs from Active Experts

In a Mixture of Experts (MoE) model, after the gating network selects a subset of experts to handle an input, the outputs from these experts must be combined to produce a cohesive final output. This process, known as aggregation, is critical in the MoE architecture as it ensures that the model leverages the expertise of the selected experts while maintaining consistency in the output.

Let’s explore different aggregation methods, their advantages and trade-offs, implementation approaches, and considerations for ensuring optimal performance in an MoE model.


Overview of the Aggregation Process

The aggregation step follows the gating and expert selection processes, where:

  1. Selected Experts: The gating network identifies the top-k experts relevant for a given input, passing their indices and associated weights to the aggregator.
  2. Weighted Combination: The aggregator combines the outputs of these selected experts into a single, coherent output, often using weights assigned by the gating network.
  3. Final Output: The combined result is passed to subsequent layers in the network or used as the final prediction, depending on the model’s design.

The aggregation process is crucial because it allows the model to leverage the specialized knowledge of multiple experts, each contributing uniquely based on their expertise in certain aspects of the input space.


Types of Aggregation Methods

There are several ways to combine the outputs of selected experts. Each aggregation method has its own strengths, and the choice depends on the specific requirements of the task, the nature of the experts, and computational efficiency considerations. Here are some common aggregation techniques:


Weighted Sum Aggregation

  • In the weighted sum method, the outputs of each expert are scaled by the weights provided by the gating network and then summed.
  • This approach is commonly used because it is computationally efficient and preserves the proportional contributions of each expert based on their gating weights.


Mathematical Representation:

Code Example:

def weighted_sum_aggregation(gate_values, expert_outputs):
    # gate_values: weights from the gating network (top-k experts)
    # expert_outputs: list of outputs from the top-k experts
    combined_output = sum(w * output for w, output in zip(gate_values, expert_outputs))
    return combined_output        

In this example, each expert’s output is multiplied by its corresponding weight, and all weighted outputs are summed to create the final output.


Concatenation Aggregation

  • In the concatenation method, the outputs from each selected expert are concatenated along a specific dimension, creating a larger feature representation.
  • This approach is helpful when the task benefits from preserving the distinct features learned by each expert, as it maintains separate channels for each expert’s contribution.
  • Concatenation is often followed by additional layers (e.g., fully connected layers) that reduce the concatenated features to the desired output size.

Code Example:

def concatenation_aggregation(expert_outputs):
    # Concatenate expert outputs along the last dimension
    combined_output = torch.cat(expert_outputs, dim=-1)
    return combined_output        

This approach can be computationally more expensive but allows the model to retain more information about each expert’s output.


Ensemble Averaging (Mean)

  • In ensemble averaging, the outputs from the selected experts are averaged (or weighted and averaged) to produce the final result.
  • This approach is simple and effective, especially when experts have learned similar or complementary features. It smooths out the predictions, which can improve robustness.
  • Ensemble averaging may ignore the individual weights provided by the gating network, but it can be particularly useful when each expert is expected to contribute equally.

Code Example:

def ensemble_averaging(expert_outputs):
    # Average the outputs from the selected experts
    combined_output = sum(expert_outputs) / len(expert_outputs)
    return combined_output        


Attention-Based Aggregation

  • Attention mechanisms can be used to compute dynamic weights for each expert’s output based on the input features, allowing the model to emphasize certain experts more flexibly.
  • This method is computationally more intensive but allows for highly adaptable expert contributions based on both the input and the specific requirements of the task.
  • Attention-based aggregation is especially useful when different parts of the input are likely to benefit from different levels of expert contribution.

Code Example:

import torch.nn.functional as F

def attention_based_aggregation(expert_outputs, attention_weights):
    # Compute weighted sum using dynamic attention weights
    combined_output = sum(w * output for w, output in zip(attention_weights, expert_outputs))
    return combined_output        

Here, attention_weights can be generated dynamically by a separate attention mechanism, adding another layer of control over the aggregation process.


Key Considerations for Aggregation

  1. Balancing Information Retention and Computational Cost: Aggregation methods like concatenation preserve more information but require additional computation and memory. Weighted sum is computationally lighter and works well when expert outputs are similar in scale.
  2. Handling Varying Output Scales: If different experts produce outputs with significantly different magnitudes, normalizing or standardizing the expert outputs before aggregation can improve performance.
  3. Regularization in Aggregation: When aggregating outputs from multiple experts, regularization techniques like dropout or layer normalization on the combined output can help prevent overfitting and ensure robust generalization.
  4. Dynamic Aggregation: Using attention-based or gating-weight-based aggregation allows for more adaptive aggregation, enabling the model to dynamically adjust the influence of each expert based on the input. This flexibility is useful in cases where the model needs to respond to diverse patterns in the data.


Challenges in Aggregation

  1. Alignment of Output Dimensions: In MoE models where experts have different output shapes, it can be challenging to ensure alignment during aggregation. Ensuring that each expert’s output matches the expected shape is essential.
  2. Gradient Flow Through Sparse Aggregation: Aggregation in sparse MoE models can lead to limited gradient flow to inactive experts. Techniques such as auxiliary losses or regularization on unused experts may be necessary to ensure all experts receive sufficient training.
  3. Load Imbalance: Aggregation approaches that overly rely on a few experts may lead to load imbalance, where certain experts are frequently used while others are underutilized. Regularizing the gating weights to encourage more balanced expert utilization can address this issue.


The aggregator in an MoE model is the final piece that brings together the specialized knowledge of multiple experts to produce a unified output. Different aggregation strategies, from simple weighted sums to complex attention-based approaches, offer flexibility in balancing computational efficiency with model performance.


Full MoE Model Implementation

Here’s how an end-to-end MoE model can be constructed in PyTorch:

class MixtureOfExperts(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim, num_experts, k=2):
        super(MixtureOfExperts, self).__init__()
        self.experts = nn.ModuleList([Expert(input_dim, hidden_dim, output_dim) for _ in range(num_experts)])
        self.gate = GatingNetwork(input_dim, num_experts)
        self.k = k  # Number of experts to select

    def forward(self, x):
        gate_values = self.gate(x)
        top_k_values, top_k_indices = torch.topk(gate_values, self.k)
        
        expert_outputs = [self.experts[idx](x) for idx in top_k_indices]
        output = aggregate_expert_outputs(top_k_values, expert_outputs)
        return output        

In this model:

  • top_k is used to select the most relevant experts based on gating probabilities.
  • Only k experts process each input, ensuring computational efficiency.


Advantages of MoE

  1. Efficiency: MoE activates only a subset of experts, making it computationally efficient.
  2. Scalability: The model scales well with more experts, increasing capacity without proportional computation.
  3. Specialization: Experts learn specialized skills, improving model accuracy and generalization.


Challenges of MoE

  1. Load Balancing: Ensuring all experts are utilized optimally is challenging; some experts may be underused.
  2. Training Complexity: Managing gradients with sparsity is complex, and optimizing MoE models can be tricky.
  3. Implementation Overhead: Selecting and routing experts adds architectural complexity.


Real-World Applications of MoE

  1. Natural Language Processing (NLP): Google’s Switch Transformer, a large-scale MoE-based model, demonstrates how MoE architectures achieve high efficiency in NLP.
  2. Recommendation Systems: MoE enables systems to focus on specialized user preferences.
  3. Computer Vision: Specialized experts in MoE can focus on different image regions or patterns, enhancing image recognition tasks.


Conclusion

Mixture of Experts offers a powerful framework for scaling deep learning models while managing computational cost. By leveraging multiple specialized networks, MoE allows models to learn complex tasks efficiently. Though it poses unique challenges in training and balancing, MoE is a promising direction for advancing deep learning models to tackle ever-larger datasets and complex tasks.

The MoE architecture holds great promise for future innovations in AI, particularly as models grow in size and computational requirements continue to climb. By mastering MoE, data scientists and ML engineers can be at the forefront of scalable AI solutions.


Sources

1. Mixture of Experts Explained - Hugging Face: This article provides an in-depth explanation of MoE layers, including their structure and benefits in transformer models.

https://huggingface.co/blog/moe

2. What is Mixture of Experts? - IBM: IBM's overview of MoE discusses its origins, functionality, and applications in machine learning.

https://www.ibm.com/topics/mixture-of-experts

3. Mixture of Experts: How an Ensemble of AI Models Decide As One - Deepgram: This guide explores the ensemble learning aspect of MoE and its implementation in AI models.

https://deepgram.com/learn/mixture-of-experts-ml-model-guide

4. A Gentle Introduction to Mixture of Experts Ensembles - Machine Learning Mastery: This tutorial offers a comprehensive introduction to MoE ensembles, including their components and how they function together.

https://machinelearningmastery.com/mixture-of-experts/

5. Towards Understanding Mixture of Experts in Deep Learning - arXiv: This research paper delves into the theoretical aspects of MoE, providing insights into its performance and behavior in deep learning.

https://arxiv.org/abs/2208.02813

6. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer - arXiv: This paper introduces the sparsely-gated MoE layer, discussing its capacity and computational efficiency.

https://arxiv.org/abs/1701.06538

7. Mixture-of-Experts with Expert Choice Routing - Google Research: This article presents advancements in MoE models, focusing on expert choice routing mechanisms.

https://research.google/blog/mixture-of-experts-with-expert-choice-routing/

8. The Sparsely Gated Mixture of Experts Layer for PyTorch - GitHub: This GitHub repository provides a PyTorch implementation of the sparsely gated MoE layer, including examples and code.

https://github.com/davidmrau/mixture-of-experts


#MixtureOfExperts #DeepLearning #MachineLearning #NeuralNetworks #AIResearch #ModelEfficiency #MLArchitecture #GatingNetwork #ExpertSelection #ArtificialIntelligence #ScalableAI #FutureOfAI #TechInnovation #DataScience #NeuralNetworkArchitectures #ViewsMyOwn #GenerativeAI #LLM


要查看或添加评论,请登录

Nick Gupta的更多文章

社区洞察

其他会员也浏览了