NewMind AI Journal #34

Evaluation of Large Language Models on Turkish Reasoning Datasets?

By NewMind AI Team

A. Introduction

Recent advancements in large language models (LLMs) have enhanced their ability to tackle complex reasoning tasks across varied datasets.?

Evaluating LLM competency across diverse reasoning datasets remains a significant challenge.?

This study compares two advanced models, Qwen/QwQ-32B-Preview and DeepSeek-R1-Distill-Qwen-32B, using Turkish reasoning datasets: MMLU-TR, GPQA-TR, and ARC-TR.?

These datasets, originally in English, were translated into Turkish via machine learning-based techniques.?

The evaluation examines accuracy, token efficiency, and latency to highlight each model’s strengths and trade-offs in processing Turkish-language tasks.?

GPT-4o-mini serves as the judge model to ensure fair and consistent evaluation across datasets.?

Structured prompt formatting is used to enhance response consistency and alignment with expected answer formats.?

B. Structured Evaluation Pipeline for Machine-Translated Turkish Reasoning Datasets

I. Pipeline Overview

The evaluation pipeline for machine-translated Turkish reasoning datasets consists of three main components:

1. Inference Models

The evaluation involves two state-of-the-art models:

DeepSeek-R1-Distill-Qwen-32B (accessed via Groq API)
Qwen/QwQ-32B-Preview (accessed via Nebius AI API)

2. Scoring Mechanism

Once model responses are generated, they are evaluated using GPT-4o-mini as a judge model to determine their correctness.?

Key Features:

Automated Response Parsing: Extracts the selected answer from text-based model outputs.
Comparison with Ground Truth: Checks whether the selected answer matches the expected label.
JSON-Formatted Evaluation Output: Ensures structured and interpretable evaluation results.

3. Dataset Processing & Evaluation Execution

The pipeline utilizes machine-translated Turkish reasoning datasets for model evaluation. The dataset processing component includes:

Dataset Loading: Retrieves reasoning datasets (MMLU-TR, GPQA-TR, ARC-TR) via the dataset library.
Standardized Formatting: Converts each dataset sample into a uniform multiple-choice question format.
Evaluation Execution: Calls the inference models and scoring mechanism to generate structured evaluation reports.

4. Dataset Details

4.1. MMLU-TR (Massive Multitask Language Understanding - Turkish Version)

?? Source: MMLU-TR Dataset

?? Description: The MMLU-TR dataset is a Turkish adaptation of the MMLU benchmark, covering diverse topics. In this study, the dataset specifically focuses on professional law questions, evaluating LLMs’ ability to process and reason about legal concepts, principles, and case-based scenarios.
Task Type: Multiple-choice reasoning
Evaluation Scope: Assesses model competency in professional law-related knowledge and legal reasoning.

4.2. GPQA-TR (General-Purpose Question Answering - Turkish Version)

?? Source: GPQA-Formatted-TR Dataset

?? Description: The GPQA-TR dataset is a Turkish adaptation of the DIAMOND dataset from Jegger/GPQA. This dataset includes various scientific and general knowledge-based multiple-choice questions.
Task Type: Open-domain reasoning questions
Evaluation Scope: Tests LLMs’ ability to process scientific reasoning and factual knowledge in Turkish.

4.3. ARC-TR (AI2 Reasoning Challenge - Turkish Version)

?? Source: ARC-TR-v0.2 Dataset

?? Description: The ARC-TR dataset is a Turkish version of the?ARC, a dataset designed to test complex reasoning skills in AI models. It focuses on scientific question answering using structured multiple-choice formats.?

Task Type: Complex reasoning in science-related questions.

II. Model & Inference Platform Summary

In this evaluation pipeline, large language models (LLMs) serve as generators, producing responses for machine-translated Turkish reasoning datasets. These models generate outputs based on structured prompts and predefined evaluation settings. Below is a summary of the inference platforms used:

1. Nebius AI Studio (Qwen-QwQ-32B-Preview)

2. Groq Cloud (DeepSeek-R1-Distill-Qwen-32B)?

3. Qwen/QwQ-32B-Preview Model Parameters?

4. DeepSeek-R1-Distill-Qwen-32B Model Parameters?

III. Evaluation Insights?

1. DeepSeek-R1-Distill-Qwen-32B outperformed Qwen/QwQ-32B-Preview in GPQA-formatted-TR but was weaker in MMLU and ARC.?

GPQA-formatted-TR: DeepSeek (46.6%) vs. Qwen (40.6%).?
MMLU-TR-v0.2: Qwen achieved a higher score (45%) compared to DeepSeek (44.1%).?
ARC-TR-v0.2: Qwen led significantly with 87.2%, while DeepSeek scored 81.1%.?

2. Latency Comparison:?

DeepSeek-R1-Distill-Qwen-32B exhibited significantly lower latency across all datasets.?

On GPQA-formatted-TR, DeepSeek completed evaluations in 1h 58m, whereas Qwen took 3h 24m.?

In MMLU-TR-v0.2 and ARC-TR-v0.2, DeepSeek was nearly twice as fast.?

3. Token Consumption:?

DeepSeek processed ARC-TR-v0.2 with fewer tokens (650K vs. 1.07M), showing better token efficiency.?

However, in MMLU-TR-v0.2, Qwen used more tokens (1.75M) but achieved a marginally better score.?

IV. Prompt Formatting and Answer Selection?

To ensure model responses aligned with the prompt's multiple-choice format (four options with indexed answers), structured formatting was used. Indices were placed before each answer choice, and the correct index was highlighted within \boxed{}. The model was instructed to select an answer based on these indices, improving consistency and preventing format deviations.?

MMLU-TR-v0.2 Data?

2. GPQA-formatted-TR Data?

3. ARC-TR-v0.2 Data?

C. Our Mind?

DeepSeek-R1-Distill-Qwen-32B provided notable advantages in speed and token efficiency, particularly in GPQA-formatted-TR.?

Qwen/QwQ-32B-Preview performed better on MMLU and ARC, suggesting it may be more suited for general knowledge and professional law-related tasks.?

While DeepSeek was significantly faster, Qwen’s higher accuracy in two datasets implies trade-offs between speed and precision.?

Further fine-tuning and parameter optimization may enhance both models’ performance in specific domains.?

Prompt Formatting with \boxed{}: When providing datasets to the model, we enclosed key information within \boxed{} to highlight structured prompts. This approach improved model consistency in reasoning tasks.?

D. Key Takeaways?

1. Model Performance Trade-offs?

Qwen-QwQ-32B-Preview (Nebius AI) performs better in professional law-based questions (MMLU-TR), demonstrating strong reasoning capabilities in domain-specific knowledge.?

DeepSeek-R1-Distill-Qwen-32B (Groq Cloud) exhibits higher token throughput (388 tokens/sec), making it faster and more efficient for large-scale inference, but its accuracy trade-offs need further analysis.?

2. Speed vs. Accuracy Considerations?

Groq’s model is significantly faster, which makes it ideal for real-time inference. However, its reasoning accuracy across datasets requires additional benchmarking.?

Qwen-QwQ-32B-Preview is slower but achieves higher accuracy, particularly in structured reasoning tasks, making it more suitable for legal and scientific reasoning.?

3. Dataset Challenges in Turkish Adaptation?

Machine-translated datasets introduce linguistic challenges that impact model comprehension. Certain legal and scientific terminology might not be optimally translated.?

Evaluation on professional law (MMLU-TR), general knowledge (GPQA-TR), and scientific reasoning (ARC-TR) showcases varying performance trends, suggesting that models may need further domain adaptation for Turkish reasoning tasks.?

E. References?

[1] Malhajar, M. MMLU-TR: Massive Multitask Language Understanding - Turkish Version [Dataset]. Available at: https://huggingface.co/datasets/malhajar/mmlu_tr-v0.2?

[2] Jegger, A. GPQA-TR: General-Purpose Question Answering - Turkish Version [Dataset]. Adapted from DIAMOND dataset. Available at: https://huggingface.co/datasets/Jegger/GPQA?

[3] AI2 (Allen Institute for AI). ARC-TR: AI2 Reasoning Challenge - Turkish Version [Dataset]. Adapted using machine learning-based translation techniques. Available at: https://allenai.org/data/arc?

[4] OpenAI. GPT-4o Mini: Optimized Judge Model for Large-Scale Evaluations. OpenAI API Documentation, 2024. Available at: https://openai.com/research?

[5] Groq Inc. DeepSeek-R1-Distill-Qwen-32B Model: High-Speed LLM Serving with Optimized Throughput, Groq API Documentation, 2024. Available at: https://groq.com?

[6] Nebius AI. Qwen/QwQ-32B-Preview: Large-Scale Reasoning Model for Turkish NLP Tasks, Nebius AI API Documentation, 2024. Available at: https://nebius.ai?

[7] Grootendorst, M. A Visual Guide to Reasoning LLMs, 2024. First image from: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-reasoning-llms?

[8] Qwen Team. QwQ-32B-Preview: Reflect Deeply on the Boundaries of the Unknown, Qwen Blog, 2024. Available at: https://qwenlm.github.io/blog/qwq-32b-preview