Migrating Scripts From Imperative JavaScript to Functional Python (with a Guest Appearance by SQLite)

I recently migrated a bunch of local scripts from NodeJS to Python. I also started storing all of the raw data for these scripts in SQLite as an intermediary step.

Here's why I did this and what I learned.

What Am I Working on?

I'm working on By The Topics , a platform to help people learn about national and local politicians, the policies they create, what they stand for, and how those policies change our lives.

Specifically, I created a set of scripts about a year ago which take raw congress member data and transform it into a format so that a single web page on bythetopics.com only needs to retrieve data from one small JSON file.

I've set up the site as if it were created with a Static Site Generator (SSG), like Gatsby (A Netlify Company) but a little different. All the access patterns that a user can make on the site are, so far, very specific and known. This makes static site generation a worthwhile approach, and the site is fast because of it.

The frontend for By The Topics uses Solid Start, a meta-framework built on top of SolidJS . SolidJS is a UI library that's known for its best-in-class performance. Solid's ecosystem is quite small compared to React 's, and because of this, I decided to roll my own SSG solution. This also made the solution more performant since it was specific to my data model.

I started writing this SSG scripting in Node.js about a year ago, and in the last few weeks, I've migrated the scripts first to TypeScript with more of a functional programming approach, and then finally migrating to Python (still maintaining the functional programming approach). SQLite was also a major part of this migration.

Why Did I Start with Node?

I started with Node purely because of my familiarity with it. I think this is a perfectly good reason to choose any tool. Ultimately, when you start building something, you're making a best guess as to how it should go and the problems you might face along the way.

Here's what I knew to begin with:

  • I wanted to be able to run these scripts locally from my computer.
  • The performance needed to be "good enough". It didn't need to be ultra fast. It needed to be fast enough that the scripts could run as a cron job or kicked off and be done in the time it took me to make a cup of coffee.
  • A scripting language would make it very easy to scaffold things and get going quickly. I don't have to worry about compile times. I can just run the scripts with no effort.

The majority of these scripts involved file I/O and transforming data. I could do these things in Node with no effort, and so I went with that.

I also tried Bun while I started writing these scripts. Bun was noticeably faster, but I ultimately would use Node across most of my scripts because, at the time, I found it difficult to quickly scaffold certain things in Bun. I kept just using Node's native modules within Bun (Bun can interop with native Node modules, although I'm not sure how much parity there is).

On at least one occasion, whatever I was doing was just not working in Bun while it did work in Node, so I gave up on using it. I would probably give Bun (and Deno ) another shot in the future, but that was my experience with it about a year ago (granted, I believe 1.0 had just been released at the time). I imagine Bun and Deno will be even more popular in the coming years, and I can see myself giving them another try.

Finally, I just wasn't great at writing Python at the speed I could write JavaScript. At the time I created the scripts, I was writing tons of JavaScript for my job, so I felt more comfortable with it.

Everything was going great with the Node scripts for a while. All in all, I wrote about 10 or 15 scripts that all did various things. One particular script took several minutes to run. I knew it would be a problem over time, but nonetheless I ran this sequence of scripts usually several times a day, and hundreds of times over the past year. They worked well!

Why Did I Migrate to Python?

The members of U.S. Congress generate several thousand bills every two years. The current Congress has generated around 10,000 bills. In the past decade, millions and millions of individual votes have been cast on bills. These bills and votes are a big part of my data set.

Recently I decided to fetch all of the congressional bills for the current Congress. I had started working with a data scientist on the project, and I figured it would be a good idea to try to fill out as much of the data set as possible to prepare for the upcoming work.

Before I scraped all this new data, I had a few thousand bills. Afterward, I had more than double the number of bills.

This slowed down my scripts. Like, dramatically slower. That one problem script I mentioned earlier now took upwards of 15 minutes to finish. What was once an "I'll fix it later" problem became a fire that needed to be put out.

The system was having trouble scaling, and it was concerning because I knew I'd be pulling more and more data in the near future.

Upgrade 1: Cache Data That I've already Transformed

So how was I going to fix it? The most immediate and obvious solution was to cache data that I had already transformed.

You might be thinking, "Why didn't you already do that?" The answer is that I'm lazy and I had other user-facing features to build.

This was one of those things that I was going to do eventually. It's a no-brainer. One of the easiest ways to make something more performant is to memoize the result of a computation and then reuse it later on. I could detect changes to certain data files and only re-transform them based on whether certain raw data had changed.

Unfortunately, this is where the plot thickens in a bad way. There was another problem with that script that took 15 minutes to run.

The code for the script was bad. It was truly difficult for me to understand everything going on -- and I wrote it!

This is one of the big problems with easy scripting languages like JavaScript and Python. You write a hundred lines of imperative code to quickly get something up and running. You put a few pieces of global state somewhere at the top of the file without a care. You throw some for loops into global context, and you nest other for loops within those, using outer pieces of state as you go. Who cares? It's only a hundred lines of code!

But then the code grows. And grows. And before you know it, you have a thousand lines of spaghetti code. You leave the file alone for several months and then come back, and what happens? You feel completely lost!

"I'll just move some of these functions into a util file", you say to yourself, but you realize your half-baked solution has its flaws. Some of the functions are using global state in the file, and the global state would now be needed in both files.

And while you can remedy that problem by passing the state into the functions as arguments, other nastier problems start to show up. Your imperative code is modifying global state in ways that is difficult to keep track of, and there are side effects reading and writing files all over the place.

So what started off as a quick cache implementation would have to be handled with a nuclear approach. I decided to buckle down and refactor all of the imperative code in hopes that it would be easier to read and be more maintainable, and I decided to be much more principled about how I wrote it.

Upgrade 2: Refactor The Code: Functional Core, Imperative Shell

In my refactoring efforts, I tried to be more strict about my approach to functional programming. To prepare, I bought two books: Effective Haskell by Rebecca Skinner and Grokking Simplicity by Eric Normand. Both books are about functional programming. It was time to get serious.

I wish I could sit here and say that I read both books twice front to back, but honestly I only read about 30 pages total from both of them (I'm lazy, remember?). Nonetheless, I got enough information from that and from various sources on the internet to get me to approach my messy code with a few core principles:

  1. "Functional Core, Imperative Shell" -- the "main" module of the code can still remain imperative, but most of the business logic would live in pure functions
  2. Isolate side effects (like file I/O) to their own impure functions. These functions should aim to do nothing else except perform the side effect in question.
  3. Avoid relying on closure over global state within functions. Anything that the function relies on should go into the function through its inputs.
  4. If only one function uses some large constant (like a dictionary, map, or object), then enclose the constant within the function's scope. If the constant is large or has some kind of up-front computation cost where declaring only once in memory makes sense, then wrap the function in an IIFE with the constant in the outer scope and have it return a function which has access to the constant through closure so that the constant is initialized in memory only once. (This will make more sense with an example later on)
  5. Avoid looping logic in the imperative shell, and instead rely on helpers like map, reduce, and filter, where state is enclosed within the scope of the callback function passed to the looper method.
  6. Avoid "multi-purpose" functions and instead use function composition. The function's name should do what it says. If it does more than that, then split the business logic up into multiple functions. If the desire is to then process a piece of data sequentially through these functions, then use a `compose` function to orchestrate this logic.(This will make more sense with an example later on)

There may be some functional programming purists out there who feel this list doesn't cut the mustard, and I'm sure I'm missing things here. For me, I was looking for something that would be a compass for the code I write going forward, a compass that would help me write better code while also feeling practical and easy to implement. This felt good.

With my new approach, it was slow going at first, but as I wrote primitives and learned patterns, things picked up quickly.

Here's a quick look at what each of these look like in practie.

1. "Functional Core, Imperative Shell"

import sqlite3

from utils import get_transformed_bill_data, read_json

# connect to db
con = sqlite3.connect("./sqlite/main.db")

# establish db cursor
cur = con.cursor()

bill_names = read_folder("./data/bills/us/")

data_to_insert = list(
    map(
        get_transformed_bill_data,
        bill_names,
    )
)

cur.executemany(
    "INSERT OR REPLACE INTO bill VALUES(?, ?, ?, ?, ?, ?)",
    data_to_insert,
)

con.commit()        

Here's an example of a script that inserts bill data into a SQLite table. It's short and readable. This is the imperative shell. The business logic is encapsulated into functions with names that tell you exactly what they do, and any impure functionality is isolated.

`read_folder` is an impure function, and because of that, it's treated as radioactive. It does one thing and only one thing, and it's not mixed with other functionally pure business logic.

`get_transformed_bill_data` is encapsulated business logic in a pure function. It's actually a set of pure functions that are wrapped in a `compose()` function (more on that later. But from this file, the important parts are obvious. We're generating a new list from the raw list `bill_names` and storing the new list in `data_to_insert`.

Altogether, the script is imperative, but parts of logic that can easily be encapsulated and named (particularly sections that are looped) are abstracted away into their own functions.

2. Isolate side effects (like file I/O) to their own impure functions

def read_json(file_name):
    try:
        with open(file_name, "r") as file:
            return load(file)
    except:
        return {}        

Side effects are radioactive. If you throw them around willy nilly, then you'll find yourself mocking external APIs in your tests, unexpected things will happen in your functions, and it's generally harder to pin down what comes out of a function. When a pure function is given the same input twice, then the output will be the same for both calls. You can't make that claim with an impure function.

3. Avoid relying on closure over global state within functions

my_cool_map = {
  "cheetos": 255,
  "doritos": 9000,
  ...
}

# dont
def get_foo(key):
  return my_cool_map[key]

# do
def get_bar(key, my_cool_map):
  return my_cool_map[key]        

This ridiculously contrived example seems innocent, but suppose multiple functions are setting and getting values in a piece of global state. As logic grows, this becomes potentially unmanageable.

4. If only one function uses some large constant (like a dictionary, map, or object), then enclose the constant within the function's scope.

Here's a somewhat contrived example:

def get_formatted_vote_date_with_state():
    month_map = {
        "Jan": "January",
        "Feb": "February",
        "Mar": "March",
        "Apr": "April",
        "May": "May",
        "Jun": "June",
        "Jul": "July",
        "Aug": "August",
        "Sep": "September",
        "Oct": "October",
        "Nov": "November",
        "Dec": "December",
    }

    def format_date(date_str):
        if not date_str:
            return ""

        if "-" not in date_str:
            return date_str

        day, month, year = date_str.split("-")

        formatted_day = day[1:] if len(day) == 2 and day[0] == "0" else day

        formatted_month = month_map[month]

        formatted_date = f"{formatted_month} {formatted_day}, {year}"

        return formatted_date

    return format_date

# Create the function
get_formatted_vote_date = get_formatted_vote_date_with_state()        

I'll be honest. If there's one of my six principles that I felt might be a little extra, it's this one. But nonetheless, I try to practice this principle when it makes sense.

`month_map` is only created once in memory given the code above. If there was instead a `get_formatted_vote_date()` function where `month_map` is defined within the scope of the function, then the constant is created in memory on every invocation of the function.

Perhaps overkill under certain circumstances, but it's a "principle". It's good to aim for it.

5. Avoid looping logic in the imperative shell, and instead rely on helpers like map, reduce, and filter, where state is enclosed within the scope of the callback function passed to the looper method.

# dont
transformed_actions = []
for raw_action in raw_actions:
  ...

# do
transformed_actions = list(map(get_transformed_actions, raw_actions))        

6. Avoid "multi-purpose" functions and instead use function composition.

I admit that `compose` functions are typically opaque and difficult to understand at first glance:

# python
def compose(*fns):
    def composed(x):
        return reduce(lambda v, f: f(v), reversed(fns), x)

    return composed

// typescript
const compose =
  (...fns: any) =>
  (x: any) =>
    fns.reduceRight((v: any, f: any) => f(v), x)        

They're much easier to understand when you see how they're used.

get_transformed_goals = compose(transform_goal, read_goals_file)

transformed_goals = list(map(get_transformed_goals, raw_list))        

Each item in `raw_list` is sequentially piped through the `get_transformed_goals` composition function. This means the item is first passed to `read_goals_file`, and then the output of that function is passed as the input to `transform_goal`. (It's traditional that the first function in the pipe is the right-most function, and the data moves right to left through the pipe. Don't ask me why. Math or something. I don't make the rules.)

What's nice about this is that I'm now sequentially processing a single item, and I've also isolated side effects to an impure function. These composition functions can also be made up of many, many smaller functions. It opens up a world of possibilities.

This is particularly useful in data pipelines if you want to process a single item at a time, especially if each item takes up a good chunk of memory. Rather than requiring tons of memory to hold many items at once before all the data is sent off in a side effect, you only use the memory needed by a single item.

Using point-free style, state that gets used in business logic (like `raw_action` in the first example) doesn't get exposed in the imperative shell. This makes it easier to "compose" functionality (more on this in the next principle) and prevents some unwitting programmer from coming in and introducing a side effect to the for loop, or maybe mutating `raw_action` in some way that ends up breaking other logic.

This principle reinforces the practice of well-named functions describing the task at hand.


Refactoring my code with these principles was great, and it's something that I'm going to be doing as much as possible where it makes sense to do so (given that it fits my team's style).

But there was one more problem. I now had so much more data, and I wanted to be able to see it in tables, to sort it, and to easily query things from it.

A year ago, I didn't think to store my data in a database. My line of thinking was that it would either be expensive to set up in AWS, or it would be annoying to set up locally. Sweet lady liberty, was I wrong.

Upgrade 3: SQLite and Python

Claude made it clear that my line of thinking was wrong, and that I should try SQLite. So I did. 11 tables and millions of rows later, I can say that SQLite is awesome. I can already tell it's going to revolutionize the data story for By The Topics.

With SQLite, I can quickly create and destroy tables locally, and the database is just a single file. I can also partition my data into multiple database files (I haven't done this yet, but I plan on it). It's incredibly easy to use, and it's opened up new doors for the project.

This is where the pivot to Python also occurred. Initially I tried setting up SQLite with Node, and I found that the most popular NPM package was made with CommonJS. My current scripts were using ES modules. For those familiar with the various module systems in JavaScript, you'll know that trying to work with multiple module types at the same time can be a royal pain.

I rolled my eyes very hard when I found out the NPM package was using CommonJS, and I didn't want to deal with it. Oh, JavaScript. You rascal.

I had rewritten a few other scripts in Python at this point, and I decided to see what Python had for SQLite. It turns out that Python has a built-in module dedicated to SQLite. And on top of that, it's dead simple to use.

So I decided to rewrite one script in Python using the built-in SQLite module. And another script. And another. As I kept going, I would find more and more things that made me want to go all in on Python scripting. I found the itertools and functools modules, which helped me write pure functional code.

I also discovered that Python scripts didn't have memory limits when you ran them. If you've used Node before, you may have come across having to set `max_old_space_size` and fussing with the memory. I certainly did while working with the older scripts.

And the cherry on the cake was that Python is just really easy to write. So is JavaScript, but Python is, like, ridiculously easy. And it's very readable.


So that was my recent journey to functional programming, SQLite, and Python. As lame as it sounds, it's felt like a personal renaissance with how I write scripts and work with data. I've found myself not only writing more efficient and maintainable code but also approaching problem-solving with a fresh perspective that I'm excited to apply to future projects.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了