Issue 4 - The Spaghetti Files
You didn't expect it, but here it is! The fourth issue of your favourite code quality newsletter - Spaghetti Code - has landed in your inbox.
A brief introduction to UNIX filesystems for software engineers
If you learned software development in the past few years, chances are you are "cloud-native". You've learned how to develop for the cloud, on the cloud. Your development environment consists of on-line, collaborative Jupyter notebooks that live on the web. If that's the case, there are likely a few fundamental ideas you have missed on, and this is hampering your growth.
One of those is the idea of filesystems.
Understanding how modern filesystems work - and where they came from - will help you get a better grasp of how your code interacts with the underlying operating system (OS). It will also help you diagnose issues with more precision, and avoid any mishaps when deploying code within various environments.
Before we dive in - what do we mean by "UNIX-like"?
"UNIX-like systems" are operating systems that resemble the original UNIX system in design and functionality. Examples include Linux and BSD. These systems have similar features, commands, and structures, making it easier for developers to work across different UNIX-like systems. But here I am using the term "UNIX-like" more broadly, to also include operating systems you are familiar with.
macOS is related to UNIX-like systems because it's built on a foundation called "Darwin", which is based on BSD. So, macOS has many similarities with UNIX-like systems and shares a common command-line interface.
Windows is somewhat different from UNIX-like systems because it has its own architecture and conventions. However, many concepts are similar and influenced by UNIX. In addition, Windows now offers the Windows Subsystem for Linux (WSL) which allows users to run a Linux environment. This way, Windows users can enjoy the benefits of both Windows and UNIX-like environments.
In UNIX-like systems, the filesystem is organized into a tree-like structure (a "hierarchy"), with the root ("/") being the starting point of your journey. From the root, you'll find various directories that serve specific purposes. Here are some common ones and what they're used for:
- /bin: Contains essential command-line binaries (programs) that every user can access.
- /etc: Contains configuration files for the system and various software packages.
- /home: Contains a directory for every user on the system with their personal files.
- /lib: Contains shared libraries used by system binaries.
- /tmp: A temporary storage space for files that may be cleared upon reboot.
- /usr: Contains non-essential command-line binaries, documentation, and header files needed for software development.
- /var: Contains variable data like log files, databases, and other files that change over time.
A few more words about "/home". A "home directory" is a special directory designated for each user on a computer. It's a personal space where you can store your files, documents, and configurations. On UNIX-like systems and macOS, the home directory is commonly located in the "/home" folder and named with your username. For example, if your username is "mojo", your home directory is usually "/home/mojo". In macOS, the home directories are in the "/Users" folder, so it would be "/Users/mojo".
The "~" shortcut, also known as the "tilde", is a convenient way to refer to your home directory in UNIX-like systems and macOS when using the command line. Instead of typing out the full path to your home directory, you can simply use "~" as a shorthand. For example, you can navigate to your home directory by running the command "cd ~".
On Windows, the concept of a home directory exists as well, typically located in "C:\Users\mojo". While the tilde doesn't work as a shortcut by default in the Windows command prompt, it does work in the Windows Subsystem for Linux (WSL) and other UNIX-like environments on Windows.
An important aspect of filesystems is the idea that they control who has access to what. In UNIX-like systems, permissions are assigned to files and directories using three attributes: owner, group, and others. Each of these may specify read, write, or execute permissions on the file. For example, to list a file's permissions, you can use the command 'ls -l'. Here's a sample output:
-rwxrw-r--?1 user group?1024 Jan 1 00:00 myfile.txt
The first set of characters indicates the type (file or directory) and permissions for owner, group, and others. In the example above, 'rwx' means the owner has read, write, and execute permissions, 'rw' means the group has read and write permissions, and 'r' means others only have read permission.
As a cloud-native developer, you might not deal with these concepts daily (although there is quite a bit of this in DevOps!), but understanding how they work can help with system interactions, configuration files, and application deployments. You'll be better equipped to work with containerization technologies like Docker, which rely on OS-level knowledge to efficiently package applications.
Lastly, you should also have working familiarity with some basic UNIX commands for working with files and directories.
By learning and practising these commands, you'll become more comfortable navigating and managing files and directories in a UNIX-like file system, which will make your coding experience smoother and more enjoyable.
Remember that building a strong foundation in these primary concepts will make you a more versatile software engineer. So, even if you're working on cloud-native applications, you'll be well-prepared to handle any OS surprises that come your way.
Working with files in Python
Handling files is an essential programming skill that allows you to read, write, and interact with the most common way of storing data - whether that's on the local filesystem, on a shared network location, or in the cloud.
The first thing you need, in order to access the content of files in Python, is the built-in function `open()`. ?It takes two arguments: the file's name (possibly including its path) and the mode in which you want to open the file. The common modes are:
Here's the most basic example of opening a file for reading: `file = open('some_file.txt')`
Here, we only gave a simple filename, without specifying a full path.
In the context of file handling, paths are used to specify the location of a file or directory on the filesystem. There are two types of paths: relative and absolute.
领英推荐
Relative paths are defined concerning the current working directory, which is the directory where your script is running from. When you're dealing with relative paths, you specify the file's location in relation to this current directory. For example, if your script is in a folder called "project" and you have a file named "data.txt" in a subfolder called "data", the relative path would be "data/data.txt".
Absolute paths, on the other hand, include the file's location starting from the root directory of the filesystem, providing the full path. For example, on a UNIX-like system, an absolute path might look like "/home/user/project/data/data.txt".
When using `open()` with a path name, Python interprets the path as follows:
Understanding the difference between relative and absolute paths is essential in managing and accessing files properly in your Python script. It ensures that the files are located and accessed from the correct paths, especially when your script's directory structure changes or your code is run on different systems. So, always be mindful of whether you're working with relative or absolute paths when using `open()` and other file handling functions.
My rule of thumb is - prefer to work with absolute paths, where possible, in order to avoid confusion and to make your software easier to refactor.
Once you've opened a file with the `open()` function, you can perform various operations, such as reading its content, writing new data, or appending information. A few helpful methods for working with files are:
Here's an example of reading a file's content and printing it:
with open('file.txt', 'r') as file:
???content = file.read()
The `with ... as ...` statement in the example is a good practice when working with files. It's called a "context manager" and ensures the file is properly closed after the operations within the block are completed, even if an error occurs.
When you're done working with a file, it's essential to close it using the `file.close()` method. This ensures the file is no longer taking up system resources and any changes you made are saved. If you're using the `with` statement, you don't need to explicitly close the file - the context manager does that for you.
Here's a simple example of writing some text to a file:
with open('file.txt', 'w') as file:
???file.write('Hello, World!')
To wrap up, here are a few guidlines to keep in mind when working with files in Python:
1. Use context managers: Always use the `with` statement when opening files, as it ensures proper file handling and helps avoid potential errors. The `with` statement automatically closes files after the indented block of code is executed, even if an error occurs. This prevents issues related to unclosed files and saves system resources.
2. Prefer absolute over relative paths: When working with file paths, using absolute paths can lead to more predictable behavior, especially when your code is run in different environments or redistributed. Absolute paths specify the file location starting from the root directory, making it less prone to errors due to changes in folder structure. However, to maintain portability, consider dynamically generating absolute paths using built-in libraries like `os.path` or `pathlib`.
3. Handle exceptions: When working with files, it's usually crucial to handle exceptions gracefully. File handling operations such as opening, reading, and writing can raise errors, like FileNotFoundError or PermissionError. Use `try`, `except`, and `finally` blocks to safely handle these situations and prevent your script from crashing.
4. Test and validate user-provided paths: When your script deals with file paths provided by users, ensure that you validate and sanitize the input. This can prevent security issues such as directory traversal attacks and reduces the likelihood of encountering errors when attempting to access files. Utilize built-in libraries such as `os.path` or `pathlib` to verify and manipulate file paths.
By following these guidelines, you'll be able to write better, more reliable Python code for file handling and improve the overall quality of your software engineering.
Python challenge: Business file organiser
You are hired by a large software company that has a centralized file server. Over time, the server has become disorganized as employees have been saving various types of files without folllowing naming conventions or folder hierarchy.
Your task is to develop a Python script to analyze the server's filesystem and organize it to make it easier for employees to locate relevant files quickly.
The server has a mixture of files and folders, and they may contain files of various types (e.g., .txt, .pdf, .jpg, .docx).
Your script should perform the following tasks:
1. Traverse the entire file server and identify files along with their creation timestamps.
2. Group files based on their creation year and month (e.g., 2023-04) and create folders with this naming convention if they don't already exist.
3. Within each year-month folder, create subfolders for each file type (extension) encountered (e.g., txt, pdf, jpg, docx).
4. Move each file into its corresponding year-month and file type subfolder, without modifying its original filename.
Note that the file type subfolders should be flat, meaning no further organization is needed inside them.
Tip 1: You can utilize libraries such as `os` and `shutil` to efficiently handle file and directory operations, and `datetime` to manage dates and times. These libraries provide methods that simplify the task of traversing and organizing the file server.
Tip 2: See if you can use a single-pass approach to traverse through the directories and process files. During the traversal, extract the file's creation timestamp and file type, create the required directories if they don't exist, and directly move the files to their corresponding year-month and file type subfolders. This way, you avoid having to traverse the directories multiple times, reducing the overall processing runtime.