NewMind AI Journal #34
Evaluation of Large Language Models on Turkish Reasoning Datasets?
By NewMind AI Team
A. Introduction
B. Structured Evaluation Pipeline for Machine-Translated Turkish Reasoning Datasets
I. Pipeline Overview
The evaluation pipeline for machine-translated Turkish reasoning datasets consists of three main components:
1. Inference Models
The evaluation involves two state-of-the-art models:
2. Scoring Mechanism
Once model responses are generated, they are evaluated using GPT-4o-mini as a judge model to determine their correctness.?
Key Features:
3. Dataset Processing & Evaluation Execution
The pipeline utilizes machine-translated Turkish reasoning datasets for model evaluation. The dataset processing component includes:
4. Dataset Details
4.1. MMLU-TR (Massive Multitask Language Understanding - Turkish Version)
?? Source: MMLU-TR Dataset
4.2. GPQA-TR (General-Purpose Question Answering - Turkish Version)
?? Source: GPQA-Formatted-TR Dataset
4.3. ARC-TR (AI2 Reasoning Challenge - Turkish Version)
?? Source: ARC-TR-v0.2 Dataset
II. Model & Inference Platform Summary
In this evaluation pipeline, large language models (LLMs) serve as generators, producing responses for machine-translated Turkish reasoning datasets. These models generate outputs based on structured prompts and predefined evaluation settings. Below is a summary of the inference platforms used:
1. Nebius AI Studio (Qwen-QwQ-32B-Preview)
2. Groq Cloud (DeepSeek-R1-Distill-Qwen-32B)?
3. Qwen/QwQ-32B-Preview Model Parameters?
4. DeepSeek-R1-Distill-Qwen-32B Model Parameters?
III. Evaluation Insights?
1. DeepSeek-R1-Distill-Qwen-32B outperformed Qwen/QwQ-32B-Preview in GPQA-formatted-TR but was weaker in MMLU and ARC.?
2. Latency Comparison:?
3. Token Consumption:?
IV. Prompt Formatting and Answer Selection?
To ensure model responses aligned with the prompt's multiple-choice format (four options with indexed answers), structured formatting was used. Indices were placed before each answer choice, and the correct index was highlighted within \boxed{}. The model was instructed to select an answer based on these indices, improving consistency and preventing format deviations.?
2. GPQA-formatted-TR Data?
3. ARC-TR-v0.2 Data?
C. Our Mind?
D. Key Takeaways?
1. Model Performance Trade-offs?
2. Speed vs. Accuracy Considerations?
3. Dataset Challenges in Turkish Adaptation?
E. References?
[1] Malhajar, M. MMLU-TR: Massive Multitask Language Understanding - Turkish Version [Dataset]. Available at: https://huggingface.co/datasets/malhajar/mmlu_tr-v0.2?
[2] Jegger, A. GPQA-TR: General-Purpose Question Answering - Turkish Version [Dataset]. Adapted from DIAMOND dataset. Available at: https://huggingface.co/datasets/Jegger/GPQA?
[3] AI2 (Allen Institute for AI). ARC-TR: AI2 Reasoning Challenge - Turkish Version [Dataset]. Adapted using machine learning-based translation techniques. Available at: https://allenai.org/data/arc?
[4] OpenAI. GPT-4o Mini: Optimized Judge Model for Large-Scale Evaluations. OpenAI API Documentation, 2024. Available at: https://openai.com/research?
[5] Groq Inc. DeepSeek-R1-Distill-Qwen-32B Model: High-Speed LLM Serving with Optimized Throughput, Groq API Documentation, 2024. Available at: https://groq.com?
[6] Nebius AI. Qwen/QwQ-32B-Preview: Large-Scale Reasoning Model for Turkish NLP Tasks, Nebius AI API Documentation, 2024. Available at: https://nebius.ai?
[7] Grootendorst, M. A Visual Guide to Reasoning LLMs, 2024. First image from: https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-reasoning-llms?
[8] Qwen Team. QwQ-32B-Preview: Reflect Deeply on the Boundaries of the Unknown, Qwen Blog, 2024. Available at: https://qwenlm.github.io/blog/qwq-32b-preview
Executive Director & Product Manager | 2* TEDx Speaker | Legal Tech | Hiring Developers (Node, React), Automation Engineers, PMs
4 天前This is a brilliant paper. Kudos!