You can’t and shouldn’t do Everything in Python

With all the buzz around AI and Machine-Learning, lots of people learned Python. Which is great, because Python is an excellent programming language for that. You have libraries like Pandas for statistical calculations, and loads of powerful NLP libraries.

However, since what people really want to do is process data at scale and serve the results of these processes – there’s a lot to build around the Python programs themselves.

And here’s where it gets problematic: when inexperienced yet over-confident Python programmers decide that they can just do Everything in Python, and convince their non-technical managers that it’s possible. When it really, really isn’t.

Let’s start with the fact that Python is simply not good for data-typing, data conversions, applying and maintaining a schema, or even copying a vector.

As an example, here's something that sounds easy enough but isn't:

Suppose you have a list of integers to process, and you want to put them in a vector. But you’re reading them from a text file. So first you have to convert the strings to integers, but to which kind? Python’s default integer? PyArrow’s int64? NumPy’s int32?

Each of these options may or may not produce different results, depending on the Python version and the system-architecture your code is running on. For example: with Python 2.x you’ll get 32-bit integers as the default Python “int” on certain Windows systems. So if you were counting on 64-bit integers – you’ll get an overflow for large enough values.

And if this didn’t deter you because you work with Python 3.x on Linux – in the Pandas’ documentation you’ll find that things like trying to copy or mutate values don’t always behave as expected, and can even produce outright unpredictable results. You may have seen SettingWithCopyWarning messages while building your code – that’s what the warning refers to.

So if your plan is to load data as text into a Pandas dataframe and then convert columns to numbers – just don’t. Even if you manage to convert the data, you’ll most likely also end up with “Not a Number” values all over the place, and those spread to other columns when you apply mathematical operations on vectors...

There’s a reason why Spark provides typed-dataframes only in Java and Scala, but not in Python. Ingestion of data, typing and applying a schema should all be done in one of those strongly-typed languages, which run on a JVM and hence ensure that you “write once, run anywhere” and your code will always produce the same results. Python does not guarantee this, and don’t let your software developers find this out by trying and failing to deliver even the ETL part of the system. Once they have clean data in the correct schema as output from the Java/Scala part of the pipeline – then they can pass it through the Python code.

And last but not least - when it comes to the web-services around your data:

Do you want things like input-validation, JSON support, schema-evolution support, connection to pretty much any storage type, CSRF, CORS, any type of authentication method, roles and permissions, seamless back-end/front-end integration – all out of the box? Do you want your developers to spin up the full-blown web-application in minutes and then just have to configure it and add your business-logic? Then you want Java’s spring-boot. There’s no such thing in Python. Or in any other programming language, that I know of.

Bottom line – my strong advice is to always use best-practices, even if there’s a learning-curve for your developers. In the long run it will really save you a lot of time and money.


要查看或添加评论,请登录

Nira Amit的更多文章

社区洞察

其他会员也浏览了