Issue 4 - The Spaghetti Files

Issue 4 - The Spaghetti Files

You didn't expect it, but here it is! The fourth issue of your favourite code quality newsletter - Spaghetti Code - has landed in your inbox.

A brief introduction to UNIX filesystems for software engineers

If you learned software development in the past few years, chances are you are "cloud-native". You've learned how to develop for the cloud, on the cloud. Your development environment consists of on-line, collaborative Jupyter notebooks that live on the web. If that's the case, there are likely a few fundamental ideas you have missed on, and this is hampering your growth.

One of those is the idea of filesystems.

Understanding how modern filesystems work - and where they came from - will help you get a better grasp of how your code interacts with the underlying operating system (OS). It will also help you diagnose issues with more precision, and avoid any mishaps when deploying code within various environments.

Before we dive in - what do we mean by "UNIX-like"?

"UNIX-like systems" are operating systems that resemble the original UNIX system in design and functionality. Examples include Linux and BSD. These systems have similar features, commands, and structures, making it easier for developers to work across different UNIX-like systems. But here I am using the term "UNIX-like" more broadly, to also include operating systems you are familiar with.

macOS is related to UNIX-like systems because it's built on a foundation called "Darwin", which is based on BSD. So, macOS has many similarities with UNIX-like systems and shares a common command-line interface.

Windows is somewhat different from UNIX-like systems because it has its own architecture and conventions. However, many concepts are similar and influenced by UNIX. In addition, Windows now offers the Windows Subsystem for Linux (WSL) which allows users to run a Linux environment. This way, Windows users can enjoy the benefits of both Windows and UNIX-like environments.

In UNIX-like systems, the filesystem is organized into a tree-like structure (a "hierarchy"), with the root ("/") being the starting point of your journey. From the root, you'll find various directories that serve specific purposes. Here are some common ones and what they're used for:

- /bin: Contains essential command-line binaries (programs) that every user can access.

- /etc: Contains configuration files for the system and various software packages.

- /home: Contains a directory for every user on the system with their personal files.

- /lib: Contains shared libraries used by system binaries.

- /tmp: A temporary storage space for files that may be cleared upon reboot.

- /usr: Contains non-essential command-line binaries, documentation, and header files needed for software development.

- /var: Contains variable data like log files, databases, and other files that change over time.

A few more words about "/home". A "home directory" is a special directory designated for each user on a computer. It's a personal space where you can store your files, documents, and configurations. On UNIX-like systems and macOS, the home directory is commonly located in the "/home" folder and named with your username. For example, if your username is "mojo", your home directory is usually "/home/mojo". In macOS, the home directories are in the "/Users" folder, so it would be "/Users/mojo".

The "~" shortcut, also known as the "tilde", is a convenient way to refer to your home directory in UNIX-like systems and macOS when using the command line. Instead of typing out the full path to your home directory, you can simply use "~" as a shorthand. For example, you can navigate to your home directory by running the command "cd ~".

On Windows, the concept of a home directory exists as well, typically located in "C:\Users\mojo". While the tilde doesn't work as a shortcut by default in the Windows command prompt, it does work in the Windows Subsystem for Linux (WSL) and other UNIX-like environments on Windows.

An important aspect of filesystems is the idea that they control who has access to what. In UNIX-like systems, permissions are assigned to files and directories using three attributes: owner, group, and others. Each of these may specify read, write, or execute permissions on the file. For example, to list a file's permissions, you can use the command 'ls -l'. Here's a sample output:

-rwxrw-r--?1 user group?1024 Jan 1 00:00 myfile.txt

The first set of characters indicates the type (file or directory) and permissions for owner, group, and others. In the example above, 'rwx' means the owner has read, write, and execute permissions, 'rw' means the group has read and write permissions, and 'r' means others only have read permission.

As a cloud-native developer, you might not deal with these concepts daily (although there is quite a bit of this in DevOps!), but understanding how they work can help with system interactions, configuration files, and application deployments. You'll be better equipped to work with containerization technologies like Docker, which rely on OS-level knowledge to efficiently package applications.

Lastly, you should also have working familiarity with some basic UNIX commands for working with files and directories.

  • `pwd`: Stands for "print working directory", this command shows the current directory you're in. Just type `pwd` and press enter to see the full path.
  • `ls`: Lists files and directories in your current directory. Simply type `ls` and press enter to see the contents. You can also list files and directories in another folder by typing `ls folder_name`. This command supports many different command-line options that can be quite powerful and versatile - you should learn about those (e.g. we used `ls -l` earlier to show a longer listing, with more details about each file, instead of just the file names.)
  • `cd`: Changes the current working directory. To move to a different directory, type `cd directory_name`. For example, if you want to move to a directory called "Documents", you would type `cd Documents`. To move up to the parent directory, use `cd ..`
  • `cp`: Copies a file or directory from one location to another. The command structure is `cp source destination`. For example, to copy a file called "file.txt" to a folder called "backup", you would type `cp file.txt backup/`.
  • `mv`: Moves a file or directory from one location to another. The command structure is similar to `cp`: `mv source destination`. For example, to move a file called "file.txt" to a folder called "backup", you would type `mv file.txt backup/`. This command can also be used to rename files, by "moving" the file to a new name: `mv old_name.txt new_name.txt`.
  • `rm`: Deletes a file or an empty directory. To remove a file, use the command `rm file_name`. For example, to delete a file called "file.txt", you would type `rm file.txt`. To remove an empty directory, use `rm -d directory_name`.
  • `touch`: Creates an empty file with a given name. To create a new file called "file.txt", you would type `touch file.txt`.
  • `mkdir`: Creates a new directory with a given name. To create a directory called "new_folder", you would type `mkdir new_folder`.
  • `rmdir`: Deletes an empty directory. To remove an empty directory called "empty_folder", you would type `rmdir empty_folder`.

By learning and practising these commands, you'll become more comfortable navigating and managing files and directories in a UNIX-like file system, which will make your coding experience smoother and more enjoyable.

Remember that building a strong foundation in these primary concepts will make you a more versatile software engineer. So, even if you're working on cloud-native applications, you'll be well-prepared to handle any OS surprises that come your way.


Working with files in Python

Handling files is an essential programming skill that allows you to read, write, and interact with the most common way of storing data - whether that's on the local filesystem, on a shared network location, or in the cloud.

The first thing you need, in order to access the content of files in Python, is the built-in function `open()`. ?It takes two arguments: the file's name (possibly including its path) and the mode in which you want to open the file. The common modes are:

  • `'r'`: Read mode, for reading the file's content (this is the default mode if not specified)
  • `'w'`: Write mode, for creating a new file or overwriting an existing one
  • `'a'`: Append mode, for adding new data to an existing file
  • `'x'`: Create mode, for creating a new file but raising an error if the file already exists

Here's the most basic example of opening a file for reading: `file = open('some_file.txt')`

Here, we only gave a simple filename, without specifying a full path.

In the context of file handling, paths are used to specify the location of a file or directory on the filesystem. There are two types of paths: relative and absolute.

Relative paths are defined concerning the current working directory, which is the directory where your script is running from. When you're dealing with relative paths, you specify the file's location in relation to this current directory. For example, if your script is in a folder called "project" and you have a file named "data.txt" in a subfolder called "data", the relative path would be "data/data.txt".

Absolute paths, on the other hand, include the file's location starting from the root directory of the filesystem, providing the full path. For example, on a UNIX-like system, an absolute path might look like "/home/user/project/data/data.txt".

When using `open()` with a path name, Python interprets the path as follows:

  1. If you provide only a filename without any path, such as `open("some_file.txt")`, Python assumes the file is located in the current working directory.
  2. If you provide a relative path, such as `open("data/data.txt")`, Python locates the file relative to the current working directory.
  3. If you provide an absolute path, such as `open("/home/user/project/data/data.txt")`, Python accesses the file directly at the specified location.

Understanding the difference between relative and absolute paths is essential in managing and accessing files properly in your Python script. It ensures that the files are located and accessed from the correct paths, especially when your script's directory structure changes or your code is run on different systems. So, always be mindful of whether you're working with relative or absolute paths when using `open()` and other file handling functions.

My rule of thumb is - prefer to work with absolute paths, where possible, in order to avoid confusion and to make your software easier to refactor.

Once you've opened a file with the `open()` function, you can perform various operations, such as reading its content, writing new data, or appending information. A few helpful methods for working with files are:

  • `file.read()`: Reads the entire content of the file.
  • `file.readline()`: Reads a single line from the file.
  • `file.readlines()`: Reads all lines in the file and returns them as a list.
  • `file.write('text')`: Writes the given text to the file (provided the file was open in "write mode.")

Here's an example of reading a file's content and printing it:

with open('file.txt', 'r') as file:

???content = file.read()

The `with ... as ...` statement in the example is a good practice when working with files. It's called a "context manager" and ensures the file is properly closed after the operations within the block are completed, even if an error occurs.

When you're done working with a file, it's essential to close it using the `file.close()` method. This ensures the file is no longer taking up system resources and any changes you made are saved. If you're using the `with` statement, you don't need to explicitly close the file - the context manager does that for you.

Here's a simple example of writing some text to a file:

with open('file.txt', 'w') as file:

???file.write('Hello, World!')

To wrap up, here are a few guidlines to keep in mind when working with files in Python:

1. Use context managers: Always use the `with` statement when opening files, as it ensures proper file handling and helps avoid potential errors. The `with` statement automatically closes files after the indented block of code is executed, even if an error occurs. This prevents issues related to unclosed files and saves system resources.

2. Prefer absolute over relative paths: When working with file paths, using absolute paths can lead to more predictable behavior, especially when your code is run in different environments or redistributed. Absolute paths specify the file location starting from the root directory, making it less prone to errors due to changes in folder structure. However, to maintain portability, consider dynamically generating absolute paths using built-in libraries like `os.path` or `pathlib`.

3. Handle exceptions: When working with files, it's usually crucial to handle exceptions gracefully. File handling operations such as opening, reading, and writing can raise errors, like FileNotFoundError or PermissionError. Use `try`, `except`, and `finally` blocks to safely handle these situations and prevent your script from crashing.

4. Test and validate user-provided paths: When your script deals with file paths provided by users, ensure that you validate and sanitize the input. This can prevent security issues such as directory traversal attacks and reduces the likelihood of encountering errors when attempting to access files. Utilize built-in libraries such as `os.path` or `pathlib` to verify and manipulate file paths.

By following these guidelines, you'll be able to write better, more reliable Python code for file handling and improve the overall quality of your software engineering.

Python challenge: Business file organiser

You are hired by a large software company that has a centralized file server. Over time, the server has become disorganized as employees have been saving various types of files without folllowing naming conventions or folder hierarchy.

Your task is to develop a Python script to analyze the server's filesystem and organize it to make it easier for employees to locate relevant files quickly.

The server has a mixture of files and folders, and they may contain files of various types (e.g., .txt, .pdf, .jpg, .docx).

Your script should perform the following tasks:

1. Traverse the entire file server and identify files along with their creation timestamps.

2. Group files based on their creation year and month (e.g., 2023-04) and create folders with this naming convention if they don't already exist.

3. Within each year-month folder, create subfolders for each file type (extension) encountered (e.g., txt, pdf, jpg, docx).

4. Move each file into its corresponding year-month and file type subfolder, without modifying its original filename.

Note that the file type subfolders should be flat, meaning no further organization is needed inside them.


Tip 1: You can utilize libraries such as `os` and `shutil` to efficiently handle file and directory operations, and `datetime` to manage dates and times. These libraries provide methods that simplify the task of traversing and organizing the file server.

Tip 2: See if you can use a single-pass approach to traverse through the directories and process files. During the traversal, extract the file's creation timestamp and file type, create the required directories if they don't exist, and directly move the files to their corresponding year-month and file type subfolders. This way, you avoid having to traverse the directories multiple times, reducing the overall processing runtime.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了