Advent of Code Solutions Dataset
Introduction
This dataset contains over 10,000 solutions and input data for the Advent of Code programming puzzles from 2015 to 2023. The Advent of Code is an annual set of programming challenges that can be solved in any language. At the moment, the dataset contains all solutions in Python, Go, and many solutions in JavaScript, CoffeeScript, TypeScript, Java, Scala, Kotlin, Groovy, Clojure, C#, F#, Swift, Objective-C, R, Haskell, OCaml, Racket, Scheme, Ruby, Erlang, Elixir, Rust, C, C++, Zig, Fortran90, Perl, Pascal, Crystal, Julia, Lua, PHP, Dart, Bash, AWK, Nim, D, V, Prolog, Tcl, and Wren.
You can access the dataset here.
The dataset is organised to store all years of Advent of Code puzzles together in a single dataset "train.json".
Dataset Structure
Data Fields
Each entry in the dataset consists of the following fields:
Sample Entry
{
"answer": "117946",
"input": "ckczppom",
"name": "day4_part1_2015",
"solution": "package main\n\nimport (\n\t\"crypto/md5\"\n\t\"fmt\"\n\t\"log\"\n\t\"os\"\n\t\"strconv\"\n\t\"strings\"\n)\n\nfunc main() {\n\tdata, err := os.ReadFile(\"input.txt\")\n\tif err != nil {\n\t\tlog.Fatal(err)\n\t}\n\n\tsecretKey := strings.TrimSpace(string(data))\n\tvar number int\n\tfor {\n\t\thash := md5.Sum([]byte(secretKey + strconv.Itoa(number)))\n\t\thashString := fmt.Sprintf(\"%x\", hash)\n\n\t\tif strings.HasPrefix(hashString, \"00000\") {\n\t\t\tfmt.Printf(\"%d\\n\", number)\n\t\t\tbreak\n\t\t}\n\t\tnumber++\n\t}\n}",
"solution_lang": "go",
"task": "--- Day 4: The Ideal Stocking Stuffer ---\nSanta needs help mining some AdventCoins (very similar to bitcoins) to use as gifts for all the economically forward-thinking little girls and boys.\n\nTo do this, he needs to find MD5 hashes which, in hexadecimal, start with at least five zeroes. The input to the MD5 hash is some secret key (your puzzle input, given below) followed by a number in decimal. To mine AdventCoins, you must find Santa the lowest positive number (no leading zeroes: 1, 2, 3, ...) that produces such a hash.\n\nFor example:\n\nIf your secret key is abcdef, the answer is 609043, because the MD5 hash of abcdef609043 starts with five zeroes (000001dbbfa...), and it is the lowest such number to do so.\nIf your secret key is pqrstuv, the lowest number it combines with to make an MD5 hash starting with five zeroes is 1048970; that is, the MD5 hash of pqrstuv1048970 looks like 000006136ef....",
"year": 2015
}
Creation Process
I implemented and verified solutions for various Advent of Code challenges. For each challenge, I solved the puzzles using my personal input data from Advent of Code or generated, tested, and modified solutions by open-source models (e.g. Llama 3.1 70B, DeepSeek Coder, Qwen2.5-Coder-32B, and others). This dataset contains my verified solutions and associated input data for these challenges. All solutions are not only verified but also run in under 20 seconds. However, they are not highly optimised, and much faster solutions are possible in many cases.
Input Handling
All solutions read their input from a file named input.txt saved in the current directory where the solution script is located. Here is an example of the first solution in Python:
with open("input.txt", "r") as file:
data = file.read()
floor = 0
for char in data:
if char == "(":
floor += 1
elif char == ")":
floor -= 1
print(floor)
Output Verification
Some solutions have complex outputs that should form readable patterns. For example, the solution for day8_part2_2016 should produce the following patterns to verify correctness:
领英推荐
Additionally, language-specific formatting for big numbers should be considered. It is better to check them in a few different formats. For example:
instead of 3465154.
Usage
Filtering Solutions by Programming Language
Here's an example of how to use the script to filter solutions by programming language (e.g. Elixir) using the Hugging Face datasets library:
from datasets import load_dataset
import pandas as pd
# Load the dataset from Hugging Face
dataset = load_dataset("isavita/advent-of-code", split="train")
# Filter the dataset for solutions written in Elixir
elixir_solutions = dataset.filter(lambda example: example['solution_lang'].lower() == 'elixir')
# Convert the filtered dataset to a Pandas DataFrame
elixir_solutions_df = elixir_solutions.to_pandas()
# Display the filtered solutions
pd.set_option('display.max_colwidth', None)
print(elixir_solutions_df)
How the dataset can be used to assess the performance of LLMs
The Advent of Code Solutions Dataset provides a comprehensive resource for evaluating the performance of Large Language Models (LLMs) in various programming tasks. Here are some ways the dataset can be utilised:
By leveraging this dataset, researchers and developers can gain valuable insights into the strengths and weaknesses of LLMs in programming tasks, ultimately contributing to the development of more robust and versatile language models.
Future Expansion
The dataset currently includes data for the years 2015 to 2023 and contains all solutions in Go and Python, along with solutions in over 40 other languages. There are plans to expand it to include additional years and more solutions in different languages. As new years are added, the dataset structure will remain consistent.
SWE Fellow @HeadstarterAI | xNeuroLeapCorp, Technical Writer @OpenGenus, Content Advisor @LogRocket, Computer Engineering.
3 个月Thtas great!
Homemaker at none
3 个月Interesting