Generate and Insert Massive Data into SQLite Databases with Ease

Generate and Insert Massive Data into SQLite Databases with Ease

In this article, we’ll dive into the process of generating and inserting large-scale data into an SQLite database. Whether you’re building a backend project, testing application performance, or simply honing your database skills, this guide will help you efficiently populate your database with realistic datasets. We’ll walk you through setting up your database and using a powerful tool to automate data insertion, ensuring that you can scale your project with ease. Let’s get started!

Step 1: Clone the Repository

Before proceeding, you need to clone my repository to your local machine. Follow these steps:

  1. Visit the following link to access the repository: [Your Repository URL].
  2. Ensure that you have Git installed on your system. If you haven’t installed it yet, you can download it from here.
  3. Open your terminal (or Git Bash) and navigate to the directory where you’d like to store the repository.
  4. Run the following command to clone the repository:

  • git clone [Your Repository URL]

  1. Once the repository is cloned, navigate to the project folder by running:

  • cd [Repository Folder Name]

With these steps, you will have successfully cloned the repository to your local environment and are ready to continue with the rest of the setup or instructions in the article!

Step 2: Copy the data Folder and data_generator.py to the Root of Your Project

Once you’ve cloned the repository, the next step is to copy two important files/folders into the root directory of your project.

  1. From the repository, locate the data folder and the data_generator.py file.
  2. Copy both the data folder and the data_generator.py file.
  3. Paste them directly into the root directory of your project. This should be the same directory where your project’s main files are located.

By doing this, you will ensure that your project has access to the necessary data and script for generating or processing it.

Step 3: Copy the Contents of models.py into Your Project's models.py

Next, you need to ensure your project is set up with the correct database models.

  1. Open the models.py file from the repository.
  2. Copy all the contents of this file.
  3. Paste the contents into the models.py file in the root directory of your project. If your project doesn't already have a models.py file, create a new one and paste the contents there.

Alternatively, if you choose not to copy the contents directly, you can create your own SQLite database based on the models provided in my models.py. Ensure that the models match the structure defined in the original file for compatibility.

By completing this step, you’ll ensure that your project has the proper database schema set up.

Explanation of the Code

The following Python script is designed to generate a large volume of fake data and insert it into a database, simulating a movie-related dataset. This can be useful for testing purposes or working with big data in a database environment.

1. Importing Required Libraries

The script imports several libraries to handle tasks like file reading, database connection, and random data generation:

import itertools
import datetime
import json
import numpy as np 
import sqlite3
import random        

  • itertools: For generating combinations of names, words, movies, etc.
  • datetime: For handling date-related functions, like calculating years.
  • json: To parse JSON data (e.g., genres).
  • numpy: For numerical operations, particularly to create a range of ratings.
  • sqlite3: To interact with an SQLite database.
  • random: For generating random data for each movie, actor, and director.

2. Loading Data Files

The script reads various external text files to generate random data:

with open("data/names/first-names.txt", "r") as f1:
    names_file = f1.read()
names = names_file.split()        

Here, the script reads the file first-names.txt, which contains a list of first names, and stores them in a list called names. It does the same for the family names.

with open("data/names/first-names.txt", "r") as f2:
    families_file = f2.read()
families = families_file.split()        

Similarly, it loads other files like countries.txt, genres.json, movie_list.txt, and wiki-100k.txt for various data needed in the simulation. These files provide information on countries, genres, movies, and random words used for descriptions.

3. Generating Combinations and Selections

The script generates combinations of names and families, which will later be used to assign actors and directors randomly.

names_families_combo = list(itertools.product(names, families))        

It also prepares a list of movie combinations by pairing up movie titles randomly.

movies_combo = list(itertools.combinations(movies_choices, 2))
movies = [movie[0] + " " + movie[1] for movie in movies_combo]        

Descriptions for movies are generated by combining random words from a list:

descriptions_1 = [word[0] + " " + word[1] + " " + word[2] + " " + word[3] for word in words_1]        

These will be used to generate random movie descriptions.

4. Database Connection

The script connects to an SQLite database (db.sqlite3) and sets up a cursor to execute SQL commands:

cnn = sqlite3.connect("db.sqlite3")
cur = cnn.cursor()        

5. Inserting Data into the Database

The core of this script is the insertion of random data into various tables in the database. Here’s how it handles the data:

  • Actors: 10,000 random actors are generated by combining first names, family names, and birth years. Each actor is inserted into the actors table:

a = f"INSERT INTO actors (`firstname`, `lastname`, `born_year`) VALUES ('{actor_name.replace("'" , "")}', '{actor_family.replace("'" , "")}', '{actor_born}')"        

  • Directors: Similar to actors, 10,000 directors are randomly generated and inserted into the directors table.

b = f"INSERT INTO directors (`firstname`, `lastname`, `born_year`) VALUES ('{director_name.replace("'" , "")}', '{director_family.replace("'" , "")}', '{director_born}')"        

  • Countries: Every country in the countries.txt file is inserted into the countries table.

a = f"INSERT INTO countries (`name`) VALUES ('{country.replace("'" , "")}')"        

  • Genres: All genres from the genres.json file are inserted into the genres table.

a = f"INSERT INTO genres (`title`) VALUES ('{genre.replace("'" , "")}')"        

  • Movies: 1,000 movie pairs (combinations of movie titles) are created and inserted into the movies table. Each movie is associated with a random description, year, and rating.

a = f"INSERT INTO movies (`title`, `description`, `year`, `rating`) VALUES ('{movie.replace("'" , "")}', '{description.replace("'" , "")}', '{year}', '{rating}')"        

6. Associating Data

After inserting movies, the script creates associations between movies and actors, directors, genres, and countries.

  • Movies and Actors: Each movie is randomly associated with 5 to 15 actors.

a = f"INSERT INTO movies_actors (`movie_id`, `actor_id`) VALUES ('{i}', '{actr}')"        

  • Movies and Directors: Each movie is randomly associated with 2 to 5 directors.

a = f"INSERT INTO movies_directors (`movie_id`, `director_id`) VALUES ('{i}', '{drct}')"        

  • Movies and Genres: Each movie is randomly assigned 3 to 5 genres.

a = f"INSERT INTO movies_genres (`movie_id`, `genre_id`) VALUES ('{i}', '{gnr}')"        

  • Movies and Countries: Each movie is randomly associated with 5 to 10 countries, along with random sales figures.

a = f"INSERT INTO sales (`amount`, `country_id`, `movie_id`) VALUES ({amount}, '{cntr}', '{i}')"        

7. Committing the Data to the Database

Each insert statement is executed using the cur.execute(a) command, and changes are committed to the database using cnn.commit(). This ensures that the data is saved in the database after every insertion.

Conclusion

This script is an efficient way to generate a large-scale movie database populated with random but realistic data, simulating a big data environment. By inserting thousands of records into multiple tables (e.g., actors, directors, movies, genres), it provides a comprehensive dataset to work with. This is especially useful for testing database performance, machine learning applications, and big data analysis.

This sounds like a fantastic guide for anyone working with big data! ?? Generating large datasets for testing and analysis is a crucial skill. Can’t wait to dive into the article! ??

要查看或添加评论,请登录

Hesam Alavi的更多文章

社区洞察

其他会员也浏览了