登录查看更多内容

Generate and Insert Massive Data into SQLite Databases with Ease

Hesam Alavi

AI developer, backend developer, database designer

发布日期: 2025年1月1日

In this article, we’ll dive into the process of generating and inserting large-scale data into an SQLite database. Whether you’re building a backend project, testing application performance, or simply honing your database skills, this guide will help you efficiently populate your database with realistic datasets. We’ll walk you through setting up your database and using a powerful tool to automate data insertion, ensuring that you can scale your project with ease. Let’s get started!

Step 1: Clone the Repository

Before proceeding, you need to clone my repository to your local machine. Follow these steps:

Visit the following link to access the repository: [Your Repository URL].
Ensure that you have Git installed on your system. If you haven’t installed it yet, you can download it from here.
Open your terminal (or Git Bash) and navigate to the directory where you’d like to store the repository.
Run the following command to clone the repository:

git clone [Your Repository URL]

Once the repository is cloned, navigate to the project folder by running:

cd [Repository Folder Name]

With these steps, you will have successfully cloned the repository to your local environment and are ready to continue with the rest of the setup or instructions in the article!

Step 2: Copy the data Folder and data_generator.py to the Root of Your Project

Once you’ve cloned the repository, the next step is to copy two important files/folders into the root directory of your project.

From the repository, locate the data folder and the data_generator.py file.
Copy both the data folder and the data_generator.py file.
Paste them directly into the root directory of your project. This should be the same directory where your project’s main files are located.

By doing this, you will ensure that your project has access to the necessary data and script for generating or processing it.

Step 3: Copy the Contents of models.py into Your Project's models.py

Next, you need to ensure your project is set up with the correct database models.

Open the models.py file from the repository.
Copy all the contents of this file.
Paste the contents into the models.py file in the root directory of your project. If your project doesn't already have a models.py file, create a new one and paste the contents there.

Alternatively, if you choose not to copy the contents directly, you can create your own SQLite database based on the models provided in my models.py. Ensure that the models match the structure defined in the original file for compatibility.

By completing this step, you’ll ensure that your project has the proper database schema set up.

Explanation of the Code

The following Python script is designed to generate a large volume of fake data and insert it into a database, simulating a movie-related dataset. This can be useful for testing purposes or working with big data in a database environment.

1. Importing Required Libraries

The script imports several libraries to handle tasks like file reading, database connection, and random data generation:

import itertools
import datetime
import json
import numpy as np 
import sqlite3
import random

itertools: For generating combinations of names, words, movies, etc.
datetime: For handling date-related functions, like calculating years.
json: To parse JSON data (e.g., genres).
numpy: For numerical operations, particularly to create a range of ratings.
sqlite3: To interact with an SQLite database.
random: For generating random data for each movie, actor, and director.

2. Loading Data Files

The script reads various external text files to generate random data:

with open("data/names/first-names.txt", "r") as f1:
    names_file = f1.read()
names = names_file.split()

Here, the script reads the file first-names.txt, which contains a list of first names, and stores them in a list called names. It does the same for the family names.

with open("data/names/first-names.txt", "r") as f2:
    families_file = f2.read()
families = families_file.split()

Similarly, it loads other files like countries.txt, genres.json, movie_list.txt, and wiki-100k.txt for various data needed in the simulation. These files provide information on countries, genres, movies, and random words used for descriptions.

3. Generating Combinations and Selections

The script generates combinations of names and families, which will later be used to assign actors and directors randomly.

names_families_combo = list(itertools.product(names, families))

It also prepares a list of movie combinations by pairing up movie titles randomly.

领英推荐

SQL at 50: What Lies Ahead for the Structured Query…

TechScope 9 个月前

Choosing Between Django ORM and Direct SQL Queries:…

Maziv Technologies Limited 9 个月前

The Essential Guide to Node.js, SQL, Kafka, and Event…

Karthik Rana 3 周前

movies_combo = list(itertools.combinations(movies_choices, 2))
movies = [movie[0] + " " + movie[1] for movie in movies_combo]

Descriptions for movies are generated by combining random words from a list:

descriptions_1 = [word[0] + " " + word[1] + " " + word[2] + " " + word[3] for word in words_1]

These will be used to generate random movie descriptions.

4. Database Connection

The script connects to an SQLite database (db.sqlite3) and sets up a cursor to execute SQL commands:

cnn = sqlite3.connect("db.sqlite3")
cur = cnn.cursor()

5. Inserting Data into the Database

The core of this script is the insertion of random data into various tables in the database. Here’s how it handles the data:

Actors: 10,000 random actors are generated by combining first names, family names, and birth years. Each actor is inserted into the actors table:

a = f"INSERT INTO actors (`firstname`, `lastname`, `born_year`) VALUES ('{actor_name.replace("'" , "")}', '{actor_family.replace("'" , "")}', '{actor_born}')"

Directors: Similar to actors, 10,000 directors are randomly generated and inserted into the directors table.

b = f"INSERT INTO directors (`firstname`, `lastname`, `born_year`) VALUES ('{director_name.replace("'" , "")}', '{director_family.replace("'" , "")}', '{director_born}')"

Countries: Every country in the countries.txt file is inserted into the countries table.

a = f"INSERT INTO countries (`name`) VALUES ('{country.replace("'" , "")}')"

Genres: All genres from the genres.json file are inserted into the genres table.

a = f"INSERT INTO genres (`title`) VALUES ('{genre.replace("'" , "")}')"

Movies: 1,000 movie pairs (combinations of movie titles) are created and inserted into the movies table. Each movie is associated with a random description, year, and rating.

a = f"INSERT INTO movies (`title`, `description`, `year`, `rating`) VALUES ('{movie.replace("'" , "")}', '{description.replace("'" , "")}', '{year}', '{rating}')"

6. Associating Data

After inserting movies, the script creates associations between movies and actors, directors, genres, and countries.

Movies and Actors: Each movie is randomly associated with 5 to 15 actors.

a = f"INSERT INTO movies_actors (`movie_id`, `actor_id`) VALUES ('{i}', '{actr}')"

Movies and Directors: Each movie is randomly associated with 2 to 5 directors.

a = f"INSERT INTO movies_directors (`movie_id`, `director_id`) VALUES ('{i}', '{drct}')"

Movies and Genres: Each movie is randomly assigned 3 to 5 genres.

a = f"INSERT INTO movies_genres (`movie_id`, `genre_id`) VALUES ('{i}', '{gnr}')"

Movies and Countries: Each movie is randomly associated with 5 to 10 countries, along with random sales figures.

a = f"INSERT INTO sales (`amount`, `country_id`, `movie_id`) VALUES ({amount}, '{cntr}', '{i}')"

7. Committing the Data to the Database

Each insert statement is executed using the cur.execute(a) command, and changes are committed to the database using cnn.commit(). This ensures that the data is saved in the database after every insertion.

Conclusion

This script is an efficient way to generate a large-scale movie database populated with random but realistic data, simulating a big data environment. By inserting thousands of records into multiple tables (e.g., actors, directors, movies, genres), it provides a comprehensive dataset to work with. This is especially useful for testing database performance, machine learning applications, and big data analysis.

WaterLyst

2 个月

This sounds like a fantastic guide for anyone working with big data! ?? Generating large datasets for testing and analysis is a crucial skill. Can’t wait to dive into the article! ??

1 次回应

查看更多评论

要查看或添加评论，请登录

Hesam Alavi的更多文章

Building a Simple Telegram Bot to Fetch Cryptocurrency Prices with Python

2025年2月26日

Building a Simple Telegram Bot to Fetch Cryptocurrency Prices with Python

Telegram bots have become a popular way to automate tasks, provide services, and interact with users. In this article…
Understanding the F1 Score: A Deep Dive into Classification Metrics

2025年2月23日

Understanding the F1 Score: A Deep Dive into Classification Metrics

In the world of machine learning, evaluating model performance is crucial for ensuring that our algorithms are making…
Unlocking Decision-Making: An In-Depth Analysis of Entropy in Decision Trees

2025年2月23日

Unlocking Decision-Making: An In-Depth Analysis of Entropy in Decision Trees

Decision trees are a popular machine learning algorithm used for classification and regression tasks. They work by…

1 条评论
Building a Decision Tree from Scratch: Gini Impurity Explained with Python

2025年2月13日

Building a Decision Tree from Scratch: Gini Impurity Explained with Python

Decision Trees and Gini Impurity: A Fun Dive into Data Science ?? Hello, my fellow data enthusiasts! Buckle up because…
K-Nearest Neighbors Explained: A Guide to Classification Algorithms

2025年2月4日

K-Nearest Neighbors Explained: A Guide to Classification Algorithms

K-Nearest Neighbors (KNN) is a simple yet powerful algorithm used for classification and regression tasks in machine…
Understanding Regression Loss and Accuracy: A Deep Dive into Linear Regression

2025年1月27日

Understanding Regression Loss and Accuracy: A Deep Dive into Linear Regression

Introduction In the realm of data science and machine learning, regression analysis plays a pivotal role in predicting…
Building a Custom Linear Regression Model from Scratch in Python

2025年1月21日

Building a Custom Linear Regression Model from Scratch in Python

In this article, we will explore how to implement a simple linear regression model from scratch using Python. The goal…
Building a K-Means Algorithm in Python: A Step-by-Step Guide

2025年1月1日

Building a K-Means Algorithm in Python: A Step-by-Step Guide

K-Means Clustering is a popular unsupervised machine learning algorithm used for grouping data into clusters. It aims…
Building a Real-Time Surveillance System with Python: Automating Content Monitoring and Downloads

2024年12月8日

Building a Real-Time Surveillance System with Python: Automating Content Monitoring and Downloads

In today’s fast-paced digital world, automation has become a cornerstone for efficiency and scalability. Whether it’s…
Crafting a Seamless WYSIWYG Experience: Integrating CKEditor with React and Django

2024年10月16日

Crafting a Seamless WYSIWYG Experience: Integrating CKEditor with React and Django

In today’s digital age, providing a smooth and user-friendly content creation experience is essential for both…

1 条评论

See all articles

Generate and Insert Massive Data into SQLite Databases with Ease

Hesam Alavi

AI developer, backend developer, database designer

Explanation of the Code

领英推荐

Conclusion

Hesam Alavi的更多文章

社区洞察

其他会员也浏览了

Efficiently Managing Employee Records Using Azure SQL and Python

Bulk Insert via python to insert over 4 Million+ rows to MariaDB at localhost [Project-Based]

Building a Data Pipeline with SQL, Python, and Azure Fabric

Revolutionizing SQLite Interactions: Why SqliteDict Is a Game-Changer for Developers

Create A Flask App To Use PostgreSQL Database

SQL Renaissance (ML4Devs Newsletter, Issue 17)

Python Database Connection Tutorial

Mastering Spark Session Creation and Configuration in Apache Spark

Introduction to Neo4j Graph Database

Explanation of the Code

领英推荐

Conclusion

Hesam Alavi的更多文章

Building a Simple Telegram Bot to Fetch Cryptocurrency Prices with Python

Understanding the F1 Score: A Deep Dive into Classification Metrics

Unlocking Decision-Making: An In-Depth Analysis of Entropy in Decision Trees

Building a Decision Tree from Scratch: Gini Impurity Explained with Python

K-Nearest Neighbors Explained: A Guide to Classification Algorithms

Understanding Regression Loss and Accuracy: A Deep Dive into Linear Regression

Building a Custom Linear Regression Model from Scratch in Python

Building a K-Means Algorithm in Python: A Step-by-Step Guide

Building a Real-Time Surveillance System with Python: Automating Content Monitoring and Downloads

Crafting a Seamless WYSIWYG Experience: Integrating CKEditor with React and Django

社区洞察

其他会员也浏览了

Efficiently Managing Employee Records Using Azure SQL and Python

Bulk Insert via python to insert over 4 Million+ rows to MariaDB at localhost [Project-Based]

Building a Data Pipeline with SQL, Python, and Azure Fabric

Revolutionizing SQLite Interactions: Why SqliteDict Is a Game-Changer for Developers

Create A Flask App To Use PostgreSQL Database

SQL Renaissance (ML4Devs Newsletter, Issue 17)

Python Database Connection Tutorial

Mastering Spark Session Creation and Configuration in Apache Spark

Introduction to Neo4j Graph Database