Generate and Insert Massive Data into SQLite Databases with Ease
In this article, we’ll dive into the process of generating and inserting large-scale data into an SQLite database. Whether you’re building a backend project, testing application performance, or simply honing your database skills, this guide will help you efficiently populate your database with realistic datasets. We’ll walk you through setting up your database and using a powerful tool to automate data insertion, ensuring that you can scale your project with ease. Let’s get started!
Step 1: Clone the Repository
Before proceeding, you need to clone my repository to your local machine. Follow these steps:
With these steps, you will have successfully cloned the repository to your local environment and are ready to continue with the rest of the setup or instructions in the article!
Step 2: Copy the data Folder and data_generator.py to the Root of Your Project
Once you’ve cloned the repository, the next step is to copy two important files/folders into the root directory of your project.
By doing this, you will ensure that your project has access to the necessary data and script for generating or processing it.
Step 3: Copy the Contents of models.py into Your Project's models.py
Next, you need to ensure your project is set up with the correct database models.
Alternatively, if you choose not to copy the contents directly, you can create your own SQLite database based on the models provided in my models.py. Ensure that the models match the structure defined in the original file for compatibility.
By completing this step, you’ll ensure that your project has the proper database schema set up.
Explanation of the Code
The following Python script is designed to generate a large volume of fake data and insert it into a database, simulating a movie-related dataset. This can be useful for testing purposes or working with big data in a database environment.
1. Importing Required Libraries
The script imports several libraries to handle tasks like file reading, database connection, and random data generation:
import itertools
import datetime
import json
import numpy as np
import sqlite3
import random
2. Loading Data Files
The script reads various external text files to generate random data:
with open("data/names/first-names.txt", "r") as f1:
names_file = f1.read()
names = names_file.split()
Here, the script reads the file first-names.txt, which contains a list of first names, and stores them in a list called names. It does the same for the family names.
with open("data/names/first-names.txt", "r") as f2:
families_file = f2.read()
families = families_file.split()
Similarly, it loads other files like countries.txt, genres.json, movie_list.txt, and wiki-100k.txt for various data needed in the simulation. These files provide information on countries, genres, movies, and random words used for descriptions.
3. Generating Combinations and Selections
The script generates combinations of names and families, which will later be used to assign actors and directors randomly.
names_families_combo = list(itertools.product(names, families))
It also prepares a list of movie combinations by pairing up movie titles randomly.
领英推荐
movies_combo = list(itertools.combinations(movies_choices, 2))
movies = [movie[0] + " " + movie[1] for movie in movies_combo]
Descriptions for movies are generated by combining random words from a list:
descriptions_1 = [word[0] + " " + word[1] + " " + word[2] + " " + word[3] for word in words_1]
These will be used to generate random movie descriptions.
4. Database Connection
The script connects to an SQLite database (db.sqlite3) and sets up a cursor to execute SQL commands:
cnn = sqlite3.connect("db.sqlite3")
cur = cnn.cursor()
5. Inserting Data into the Database
The core of this script is the insertion of random data into various tables in the database. Here’s how it handles the data:
a = f"INSERT INTO actors (`firstname`, `lastname`, `born_year`) VALUES ('{actor_name.replace("'" , "")}', '{actor_family.replace("'" , "")}', '{actor_born}')"
b = f"INSERT INTO directors (`firstname`, `lastname`, `born_year`) VALUES ('{director_name.replace("'" , "")}', '{director_family.replace("'" , "")}', '{director_born}')"
a = f"INSERT INTO countries (`name`) VALUES ('{country.replace("'" , "")}')"
a = f"INSERT INTO genres (`title`) VALUES ('{genre.replace("'" , "")}')"
a = f"INSERT INTO movies (`title`, `description`, `year`, `rating`) VALUES ('{movie.replace("'" , "")}', '{description.replace("'" , "")}', '{year}', '{rating}')"
6. Associating Data
After inserting movies, the script creates associations between movies and actors, directors, genres, and countries.
a = f"INSERT INTO movies_actors (`movie_id`, `actor_id`) VALUES ('{i}', '{actr}')"
a = f"INSERT INTO movies_directors (`movie_id`, `director_id`) VALUES ('{i}', '{drct}')"
a = f"INSERT INTO movies_genres (`movie_id`, `genre_id`) VALUES ('{i}', '{gnr}')"
a = f"INSERT INTO sales (`amount`, `country_id`, `movie_id`) VALUES ({amount}, '{cntr}', '{i}')"
7. Committing the Data to the Database
Each insert statement is executed using the cur.execute(a) command, and changes are committed to the database using cnn.commit(). This ensures that the data is saved in the database after every insertion.
Conclusion
This script is an efficient way to generate a large-scale movie database populated with random but realistic data, simulating a big data environment. By inserting thousands of records into multiple tables (e.g., actors, directors, movies, genres), it provides a comprehensive dataset to work with. This is especially useful for testing database performance, machine learning applications, and big data analysis.
This sounds like a fantastic guide for anyone working with big data! ?? Generating large datasets for testing and analysis is a crucial skill. Can’t wait to dive into the article! ??