LeetCode System Design: A Deep Dive into Scalable Code Execution Platforms
Image: The system design for the leetcode platform

LeetCode System Design: A Deep Dive into Scalable Code Execution Platforms

Functional Requirements

  1. Users should be able to view coding problems
  2. Users should be able to submit solutions in various programming languages
  3. The system should compile and run the submitted code
  4. The system should test the code against predefined test cases
  5. Users should receive results (success/failure, runtime, memory usage)
  6. The platform should support multiple users simultaneously

Non-Functional Requirements

  1. Scalability: The system should handle millions of submissions daily
  2. Security: User-submitted code must run in isolated environments
  3. Performance: Results should be returned quickly (ideally within seconds)
  4. Availability: The platform should be accessible 99.99% of the time
  5. Consistency: Same code should produce the same results every time

Basic Data Design

  1. Problem Table: ProblemID (Primary Key) Title Description Difficulty TestCases
  2. Submission Table: SubmissionID (Primary Key) UserID (Foreign Key) ProblemID (Foreign Key) SubmittedCode Language Status (Pending/Completed) Result (Success/Failure) Runtime MemoryUsage SubmissionTime


Image1: This shows how the database design looks like for a leetcode basic design

API Design

  1. Get Problem:

GET /api/problems/{problem_id}
Response: 
{  "problem_id": int, 
 "title": string, 
 "description": string,
"difficulty": string
}        

2.Submit Solution:

POST /api/submit Request:
 { 
"problem_id": int, 
"code": string, 
"language": string 
} Response: 
{
 "submission_id": string 
}        

3.Check Submission Status:

GET /api/submissions/{submission_id}
 Response: 
{ 
"status": string, 
"result": string, 
"runtime": float,
 "memory_usage": float
 }        

4.List Problems:

GET /api/problems?page={page_number}&limit={limit} 
Response: {
 "problems": [ { "problem_id": int, "title": string, "difficulty": string }, ... ],
 "total_pages": int }        

System Architecture and Flow

Now, let's dive into how the system actually works and why it's designed this way.

  1. Client (Web Browser) Users interact with a web-based IDE (Integrated Development Environment). This is typically a lightweight code editor embedded in the webpage.
  2. API Gateway All requests first go through an API Gateway.

An API Gateway is like a smart receptionist for your system. It directs incoming requests to the right places, handles security checks, and can even do some basic request processing.

Application Servers These servers handle the core logic of the application. They process requests, interact with databases, and manage the overall flow of operations.

Here's where we encounter our first major design decision.

Why can't we just run user code directly on these servers?

The problem is with vertical scaling.

Vertical scaling means adding more power (CPU, RAM) to an existing server. It's like trying to make a car go faster by putting in a bigger engine.

While this can work up to a point, it has limitations:

There's a limit to how powerful a single machine can be It's expensive It doesn't provide isolation between different users' code This is where the concept of horizontal scaling comes in.

Horizontal scaling means adding more machines to your system instead of making one machine more powerful. It's like adding more cars to a delivery fleet instead of trying to make one car do all the deliveries.

  1. Peak Load Scenario: Imagine a coding competition with 10,000 submissions at once. Each submission needs to run against 100 test cases. Each test case takes about 100 milliseconds (0.1 seconds) to run.
  2. Total Workload Calculation: For one submission: 100 test cases 0.1 seconds = 10 seconds For all 10,000 submissions: 10,000 10 seconds = 100,000 seconds
  3. Time to Process on a Single Core: 100,000 seconds is about 27.8 hours (100,000 / 3600)
  4. Cores Needed for Quick Processing: If we want to process all this in just 1 minute (60 seconds): Cores required = Total processing time / Target time That's 100,000 seconds / 60 seconds = 1,667 cores
  5. Comparison with Available Hardware: The most powerful Amazon server instances offer up to 128 cores. To get 1,667 cores, we'd need about 13 of these top-tier instances (1,667 / 128).
  6. Why This Matters: It's not practical or cost-effective to use 13 of the most powerful servers for this task. Even if we could, managing the workload across these servers would be complex. This shows why we can't simply use bigger, more powerful computers (vertical scaling).
  7. The Solution: Instead of a few very powerful machines, LeetCode uses many smaller, containerized environments. This allows for better distribution of work, improved security, and easier scaling. It's more flexible: we can easily add or remove processing power as needed.

In essence, the sheer amount of processing power needed makes it impossible to run a system like LeetCode on traditional servers. This is why they use cloud services, containerization, and distributed computing to handle the massive workload efficiently and cost-effectively.

  1. Code Execution Environment

To solve the problems of vertical scaling and provide necessary isolation, we use containerization. Let's break down the evolution of this approach: a.

Virtual Machines (VMs) Virtual Machines are like having several separate computers running inside your physical computer.

Each VM has its own operating system and resources.

Pros:

Good isolation Can run different operating systems

Cons: Heavy resource usage Slow to start up Inefficient for running small, short-lived processes (like code submissions)

b. Docker Containers

Docker containers are like lightweight, standardized boxes for your code and all its dependencies.

They're much lighter than VMs because they share the host system's kernel.

Pros: Lightweight and fast to start Efficient resource usage Consistent environment across different systems

Cons: Slightly less isolation than VMs (though still very good) Requires careful configuration for security


c. Amazon ECS (Elastic Container Service)

Amazon ECS is like a smart manager for your Docker containers.

It handles deploying containers, scaling them based on demand, and ensuring they're healthy.

Pros: Automated management of containers Easy scaling Integration with other AWS services Cons: Vendor lock-in to AWS Can be complex to set up initially ECS keeps track of container health through regular checks.

If a container becomes unresponsive or unhealthy, ECS can automatically replace it.

  1. Queue System (e.g., Amazon SQS)

A queue system is like a buffer or waiting line for tasks.

When a user submits code, instead of running it immediately, we place it in a queue. This queue system allows for asynchronous processing.

Asynchronous means "not occurring at the same time". In our system, it means that the user doesn't have to wait for their code to finish running before getting a response.

Here's how it works: User submits code System immediately returns a submission ID Code is placed in the queue Worker processes pick up code from the queue and execute it Results are stored in the database User can check the status using the submission ID This approach has several benefits: Users get an immediate response The system can handle traffic spikes more easily We can prioritize certain tasks if needed

  1. Worker Processes These are the processes that actually execute the code. They: Pick up submissions from the queue Set up the necessary environment (using Docker containers) Run the code with the test cases Record the results
  2. Databases We use databases to store problems, submissions, and results. For a system like this, we might use a combination of: Relational database (e.g., PostgreSQL) for structured data like user accounts and problems NoSQL database (e.g., MongoDB) for more flexible data like submissions and results
  3. Caching Layer (e.g., Redis) A cache is like a short-term memory for our system. It stores frequently accessed data to reduce database load and speed up responses.

Putting It All Together

When a user submits code, here's what happens:

  1. The submission goes through the API Gateway
  2. The Application Server receives it and generates a submission ID
  3. The submission is placed in the queue
  4. The user immediately receives the submission ID
  5. A worker picks up the submission from the queue
  6. The worker spins up a Docker container in ECS
  7. The code is executed in the container
  8. Results are stored in the database
  9. The user can check the status and results using their submission ID

This architecture allows LeetCode to:

  • Handle millions of submissions daily (scalability)
  • Run untrusted code safely (security)
  • Provide quick feedback to users (performance)
  • Handle traffic spikes gracefully (availability)
  • Ensure consistent results (consistency)

By leveraging technologies like Docker and AWS services, and using patterns like queuing and asynchronous processing, LeetCode creates a robust, scalable system capable of handling the demands of millions of users practicing coding problems every day.

要查看或添加评论,请登录

Surya Raghavarapu的更多文章

社区洞察

其他会员也浏览了