Python's "with" keyword - in context
In Python, everything is an object. To some this may be a tad cliche to repeat, but underneath this rather innocuous statement lies a paradigm that is indicative of every object that developers interact with in the Python programming language. This paradigm is the Python data model.
The Python data model[1] provides developers an interface with which to hook into some of the core functionality of the Python programming language. The model provides an explicit API through which all Python objects can be interacted with. Knowing this model, and how to leverage it so to create our own hooks for customized object abstractions, is one of the core tenants of advanced Python programming.
One of my favorite abstractions that demonstrates this data model paradigm is the context manager. Context managers are designed to enforce the execution of a set of protocols around the scope of a given block of code. Most commonly this is observed in the context of resource management, but in theory, the context manager can be applied to any block of code which requires an 'enter and exit' strategy to interact 'with'.
I'm choosing these words rather playfully, as the three elements which context managers require are the 'with' keyword, which invokes the context manager (CM for short), and then the __enter__() and __exit__() methods, which Python executes as the CM is entered and exited respectively. The result here is wrapping a very technical protocol into beautiful, idiomatic Python code.
class ContextManager: def __enter__(self): print('entering...') def __exit__(self, *args, **kwargs): print('exiting...') with ContextManager(): print('in context') In []: python -u d:\programming\sandbox\contextmanager.py Out[]: entering... in context exiting...
As a quick point, CM methods emulate the functionality of a try/finally block, where the __exit__() is called regardless of the state of the block inside the context manager. The encompassed error will still ultimately be thrown, but the benefit here is that we get the exit protocol to execute no matter the success or failure of the encompassed code block.
class ContextManager: def __enter__(self): print('entering...') def __exit__(self, *args, **kwargs): print('exiting...') with ContextManager(): raise ValueError In []: python -u d:\programming\sandbox\contextmanager.py Out[]: entering... exiting... Traceback (most recent call last): File "d:\programming\sandbox\contextmanager.py", line 9, in <module> raise ValueError ValueError
As I mentioned before, the most common examples of CMs come in the form of resource managers. Many resource-management objects in Python are by design context managers, built with the intent to abstract away from developers the need to incessantly call open, close, read, write, save, etc., over I/O resources that can be exclusively tethered to your Python instance during runtime. For example, "open(path)" can be used as a CM, where in this case the CM handles the I/O setup and teardown protocols under the hood so that the developer doesn't have to worry about resource management themselves.
with open(r'D:\Programming\csv_reader\this.txt', 'w') as file: file.writelines(['this\n', 'that\n', 'this and that\n']) with open(r'D:\Programming\csv_reader\this.txt', 'r') as file: print(file.readlines()) ---vs--- file = open(r'D:\Programming\csv_reader\this.txt', 'w') file.writelines(['this\n', 'that\n', 'this and that\n']) file.close() file = open(r'D:\Programming\csv_reader\this.txt', 'r') print(file.readlines()) file.close()
Example: CSVdb
CSV files are ubiquitous in the realm of data science and analytics. They're easy to generate and simple to send and receive. Oftentimes data in a CSV file will be arranged as a table, where a line of headers describe the subsequent columns of data in a file.
The one major drawback to these types of CSV files however is that they're row-oriented. We can easily read a row of text as a string from a CSV file via file.readline(). However, as mentioned above, the data arrangement of CSV's is commonly oriented by column. Yet, there's no 'file.readcolumn()' method to be found! That's annoying. Now, this may be a non-issue if we can simply read the entire thing into memory using something like pandas.read_csv(), but this is a non-starter if our hardware doesn't have the necessary RAM to keep everything off-disk.
For Python, a great yet simple solution to this problem is to read our CSV file into a relational database. Though unavailable as a CSV, in a relational db we have the option to read out our data by column. Python ships natively with a relational database module called sqlite3, which, combined with Python's CSV module, provides us with the necessary functionality for reading into python our data by column.
import csv import os.path import sqlite3 import tempfile import pandas as pd file = open(r'D:\Programming\csv_reader\this.txt') reader = csv.reader(file, delimiter=',') headers = next(reader) temp_dir = tempfile.TemporaryDirectory() conn = sqlite3.connect(os.path.join(temp_dir.name, 'temp.db')) sql = conn.cursor() sql.execute( """ CREATE TABLE IF NOT EXISTS csv ( table_id INTEGER PRIMARY KEY, {} ) """.format( ', '.join([h + " TEXT" for h in headers]) ) ) conn.commit() sql.executemany( "INSERT INTO csv({}) VALUES({})".format( ', '.join(headers), ', '.join(['?' for _ in range(len(headers))]) ), reader ) conn.commit() file.close() target_header = headers[0] df = pd.read_sql_query( "SELECT {} from csv".format(target_header), conn ) series = df[target_header] print(series) sql.close() conn.close() temp_dir.cleanup()
As we can see here, without reading the entire file into memory, we can push the contents of our CSV file into a relational database, and subsequently use SQL to read out a column of data from the *.db file into a pandas series.
This is great start to our problem; but we can clean this up tremendously if we consider that, other than the target_header, df, and series objects, everything else codified is simply setup and teardown protocol. I.E. we can generalize this protocol with a context manager.
import csv import os.path import random import sqlite3 import tempfile import pandas as pd # If CSV has a blank header (like for an index column) # we'll just give it a random one here since # headers in SQLite can't be blank def string_generator(string_length: int, char=False) -> str: if not char: return ''.join( [chr(random.randint(65,90)) for _ in range(string_length)] ) return char * string_length class CSVdb: def __init__(self, input_file: "file"): self.file = input_file def __enter__(self) -> 'CSVdb': reader = csv.reader(self.file, delimiter=',') self.headers = [ x if x != '' else string_generator(5) for x in next(reader) ] self.temp_dir = tempfile.TemporaryDirectory() self.conn = sqlite3.connect( os.path.join(self.temp_dir.name, 'temp.db') ) self.sql = self.conn.cursor() self._execute_and_commit(self._create_table()) self._executemany_and_commit(self._insert_values(), reader) self.file.close() return self def __exit__(self, *args, **kwargs): self.sql.close() self.conn.close() self.temp_dir.cleanup() @staticmethod def from_filepath(path: str) -> 'CSVdb': return CSVdb(open(path)) def pull_column(self, header: str) -> pd.Series: if header not in self.headers: raise ValueError("header requested not in database") df = pd.read_sql_query( "SELECT {} from csv".format(header), self.conn ) return df[header] def _create_table(self) -> str: return """ CREATE TABLE IF NOT EXISTS csv ( table_id INTEGER PRIMARY KEY, {} ) """.format( ', '.join([h + " TEXT" for h in self.headers]) ) def _insert_values(self) -> str: return """ INSERT INTO csv({}) VALUES({}) """.format( ', '.join(self.headers), ', '.join(['?' for _ in range(len(self.headers))]) ) def _execute_and_commit(self, *args, **kwargs) -> None: self.sql.execute(*args, **kwargs) self.conn.commit() def _executemany_and_commit(self, *args, **kwargs) -> None: self.sql.executemany(*args, **kwargs) self.conn.commit()
With this CSVdb class, we've completely abstracted away any interaction with the setup and teardown process of these file and directory objects. Instead of having to keep track of every I/O instance, we can simply call the context manager's pull_column() method so to extract the necessary column(s) of data within a 'with' block.
with CSVdb.from_filepath(r'D:\Database\csv\NOAA\california.csv') as c: series = c.pull_column(<TARGET_HEADER>) df = pd.DataFrame( {h: c.pull_column(h) for h in <ITERABLE_TARGET_HEADERS>} )
Summary
The Python data model provides us developers with a powerful way to hook into some of the core functionality of the Python programming language. With these protocols, we have the capacity to create our own objects which can interact with the language at a more fundamental level. Here, we saw how the __enter__() and __exit__() methods allow us to hook into the context manager protocol, allowing us to abstract away the setup and teardown of I/O instances using the 'with' keyword. The result of this is beautiful, idiomatic Python code.