On the Importance of Good Design
MGA 1500

On the Importance of Good Design

The image above is a late 50’s MGA convertible. I picked this because I happen to think that this car is one of the most elegantly designed cars ever made. Certainly in the top 50. While we as people place a lot of emphasis on design when it comes to physical objects that we use, when it comes to software, a lot of our software’s design looks more like the car below: This vehicle looks like it was designed by a committee that couldn’t decide whether they were designing a door stop or a golf cart.

No alt text provided for this image

I’ve been doing a lot of thinking lately (for reasons that will become apparent in a few weeks) about the lack of good software engineering practices in the data science space. The more I think about it, the more I am shocked with the questionable design that exists in the data science ecosystem, particularly in Python. What’s even more impressive is that the python language is actually designed to promote good design, and the people building these modules are all great developers.

As an example of this, I recently read an article about coding without if statements and it made me cringe. Here is a quote:

When I teach beginners to program and present them with code challenges, one of my favorite follow-up challenges is: Now solve the same problem without using if-statements (or ternary operators, or switch statements).
You might ask why would that be helpful? Well, I think this challenge forces your brain to think differently and in some cases, the different solution might be better.
There is nothing wrong with using if-statements, but avoiding them can sometimes make the code a bit more readable to humans. This is definitely not a general rule as sometimes avoiding if-statements will make the code a lot less readable. You be the judge.
https://medium.com/edge-coders/coding-tip-try-to-code-without-if-statements-d06799eed231

Well, I am going to be the judge and every example the author cited made the code more difficult to follow. The way I see it, hard to follow code means that you as the developer are more likely to introduce unintended bugs and errors. Since the code is difficult to follow, not only will you have more bugs, but these bugs will take you more time to find. Even worse, if you didn’t write this code and are asked to debug it!! AAAAHHHH!! I know if that were me, I wouldn’t even bother. I’d probably just rewrite it. The only exception would be if it was heavily commented.

I’ll also add that when you have code like this, it is more difficult to write unit tests for it that take into account all the various things that could go wrong. Consider the following code:

if x > 100:
   # do something
elif: x >= 50:
   # do something else
else:
   # All else failed...do this

This code is super easy to test and prove that it it works in all cases. Clearly you have to write a unit test for each if clause, and then for some edge cases, such as x is undefined or x is not a number. Even a non-coder would not have much difficulty figuring out what is going on here.

Python Encourages Good Design

The developers of Python recognized that good design is a vital component of good coding and structured the language to encourage good practices. As an example, I’m going to quote PEP20: The Zen of Python.

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

These are very good principles to think about when writing Python code. Unfortunately despite the founders best intentions, developers found many ways to create difficult and even horrendous code with Python.

Let’s take a look at Pandas. Let me start by saying that Pandas is a wonderful module. It’s simply brilliant and a huge game changer in how you code. With that said, Pandas regularly violates the principle of “There should be one– and preferably only one– obvious way to do it.”

In pandas, you can access columns in a dataframe either with dotted notation:

df.col1

OR you can use bracket notation:

df['col1']

The problem here is that these are not functionally equivalent! There are several major gotcha’s involved with the dot notation, but the fact that both exist is problematic. Additionally, there are many methods such as unique() and drop_duplicates() which seem functionally equivalent but in fact are not. Unfortunately, usually you don’t discover the difference between these kinds of functions until 2AM when you’ve been coding all day and are trying to debug some code that looks like it should work, but doesn’t.

No alt text provided for this image

Let’s consider another bit of awesomeness: Matplotlib. On the one hand, this library is awesome. On the other hand, the design is incredibly confusing. For me, MatPlotLib was a complete mystery until I read Effectively Using Matplotlibby Chris Moffitt, which explained that there are two interfaces for MatPlotLib: a state based interface and an object oriented interface. WHAT??!?

The image on the left illustrates all the components of a visualization rendered with MatPlotLib. There’s a lot here, but the issue is that there are many ways to modify each one. If there was one, and preferably one obvious way to change these, I suspect that MatPlotLib would make a lot more sense to a lot more people.

I almost forgot, I have another gripe with Pandas and that is the json.normalize function. This is an awesome package that, from my experience, a lot of people don’t know about. The ability to properly read nested JSON data is hugely important, and yet, in order to do that, you have to know about this package and specially import it.

Design is Important, Especially for Software

It seems silly to state this, but good design is really important. From a practical perspective, good design saves money. How so you might ask? Well, let’s start with the obvious, a well designed product will take less time for a user to learn and adopt, thereby increasing productivity. A well designed product will take less training to teach users how to use it and fewer resources to support.

For the developers of open source software, you have a vested interest in good design as well. If you create a well designed product, more people will use it, and fewer people will bother you with “how do I?” questions ??

Learn More at my Upcoming Workshop!

If you’re interested in learning more about how to write well designed code that will get through code reviews and into production easier, then you might want to check out my upcoming workshop on Safari Online: https://learning.oreilly.com/live-training/courses/first-steps-writing-cleaner-code-for-data-science-using-python-and-r/0636920391661/


要查看或添加评论,请登录

Charles Givre的更多文章

  • All Great Things Part 2: The Founder's Dilemma

    All Great Things Part 2: The Founder's Dilemma

    I recently posted an article about the demise of DataDistillr.?It was painful to write and I was worried that by doing…

    4 条评论
  • All Great Things...

    All Great Things...

    Well, this is the post I’d hoped to never write, but alas, we’ve reached the conclusion that it’s time to shut down…

    65 条评论
  • Why You Shouldn't Rely on GPT to Write Code

    Why You Shouldn't Rely on GPT to Write Code

    A lot of people have tried out ChatGPT and other LLMs for code their code writing abilities. My theory was that the…

    20 条评论
  • Tests in a GenAI World

    Tests in a GenAI World

    I teach a graduate level data management class at the University of Maryland, Baltimore County (UMBC). Let me preface…

    5 条评论
  • Five Things I Learned Writing SQL with Gen AI

    Five Things I Learned Writing SQL with Gen AI

    ChatGPT has been all over the news for the last few months and again with the release of GPT-4. At DataDistillr, we…

    7 条评论
  • It's The Assumptions That Get You

    It's The Assumptions That Get You

    I’ve had a number of conversations recently that have highlighted to me how not understanding people’s assumptions can…

    4 条评论
  • ChatGPT, Meet DataDistillr! You’ll have lots to discuss!

    ChatGPT, Meet DataDistillr! You’ll have lots to discuss!

    Happy New Year everyone! I’m pretty excited about this. Like every other tech geek out there, I was experimenting with…

    24 条评论
  • Five Technologies That I Think Are Bullshit

    Five Technologies That I Think Are Bullshit

    This is going to piss people off. I took a road trip a few weeks ago to New York and listened to an interview with Mark…

    49 条评论
  • We Launched! (Beta)

    We Launched! (Beta)

    Well, that day has finally come! After months of testing, speaking with customers and investors, our public beta is…

    13 条评论
  • Joining Difficult Data: How to Join Data on Extracted Domains

    Joining Difficult Data: How to Join Data on Extracted Domains

    2 条评论

社区洞察

其他会员也浏览了