Tools of the Trade (an overview)

Tools of the Trade (an overview)

This post was originally published on The Sampler, our in-house blog at Applied AI. Subscribe to our RSS or email feed for regular posts on machine learning, statistics, insurance and fintech.

The practicing data scientist will be familiar with a wide range of software for scientific programming, data acquisition, storage & carpentry, lightweight application development, and visualisation. Above all, agile iteration, proper source control, good communication are vital.

As we outlined in the previous post, the discipline of data science exists at a confluence of applied mathematics, software engineering, information visualisation, storytelling and domain expertise. A lot of one's time will be spent at a computer and a lot of that time will be spent writing code, so it's critical to use the best tools available: powerful hardware, modern software technologies and established engineering methodologies.

Let's take a quick, high-level tour of some of the technical considerations:

The Core Equipment

Fast, flexible data analysis starts with excellent software.

Over the past ten years, R and Python have become two of the most important core technologies in the data science toolbox[^n]. Both are open-source programming languages with a huge ecosystem of supporting code libraries and packages for data processing and statistical modelling.

  • R has grown organically from the statistics community and is widely used and praised for a rich set of industry-standard algorithms, publication-quality plotting, rapid prototyping and native functional programming. Whilst powerful, it is perhaps best thought of as an interactive environment for statistics with a high-level programming language attached; there's almost a tradition within academia to release an implementation of one's novel algorithm as a new R package, which coupled with R's inherently muddled syntax and culture of poor documentation can make for a daunting initiation to newcomers and regular frustration for software engineers.
  • Python is a very popular general-purpose high-level programming language with syntax that's considered intuitive and aesthetic, but a runtime that can be slow compared to compiled languages like C++ and Java etc. The creation in 2005 of NumPy, a library for very fast matrix numeric computation, spurred the use of Python within the computer science / machine learning communities who might traditionally use MATLAB or C. In the years since, a wealth of best-in-class open-source libraries have been developed for data manipulation, efficient computation and statistical modelling. When this new facility is coupled with Python's tradition for excellent documentation, well-maintained open-source code, strong developer communities, consistent syntax, and an ethos of 'batteries included', it has become an increasingly default choice of key software for data scientists.
  • Data visualisation in both R and Python can be made accurate and beautiful, but it's worth also noting D3.js: a comprehensive and powerful JavaScript library that makes it quite simple to develop rich, interactive, web- and production-ready visualisations. Tools for web-based data visualisation tend to evolve and specialise extremely rapidly but D3 has become something of a standard, and frequently used alongside and within other new libraries.
  • Acquiring, cleaning and storing data will often involve using a whole host of additional tools and languages, including Unix command line tools (awk, sed, cut, grep, curl etc etc, Perl, SQL, Pig, JavaScript and many more. There's quite a good conversation on Quora with more details.


Numerical processing has of course been around for many years

...and there's a whole suite of different legacy environments and languages available including MATLAB, SPSS, Stata and SAS. These closed-source tools commonly have expensive licensing, surprisingly conservative development cycles[^n] and reduced functionality when compared to open source. The high economic barriers to entry limit the size of the user base, leading to fewer contributors, a smaller community and reduced sense of ownership for practitioners.

There are a handful of large companies further undermining the cause for using closed-source software by packaging, customising and selling enterprise-ready distributions of the above open-source tools that are bundled with their own technical support, consulting and library extensions. Two such companies are Revolution R, recently acquired by Microsoft, and Continuum Analytics who continue to make major developments in the Python community.


Finally, just to mention MS Excel

we've all been through the pain of trying to use spreadsheets for something too complex. It's initially very tempting to 'use what you know', and businesses also often rely on Excel files as a primary datastore for accounts, marketing, reports and more. To put it simply, spreadsheets are the wrong tool for data analysis & predictive modelling and should be avoided wherever possible. In more detail, spreadsheets are a poor choice because:

  • spreadsheets are stored in binary format and can't be easily used with source control to provide critical audit and versioning
  • calculations are written individually to cells and not visible in bulk, thereby encouraging accidental bugs and making code maintenance and review very difficult
  • calculations are performed upon specific cells, and this lack of variable substitution makes it difficult to test code using dummy data
  • code and data are stored in the same file, risking the loss of everything in the event of a runtime or user error
  • advanced users will typically end up writing VBA code when calculations need to be sufficiently complex. VBA is a macro-scripting procedural language very ill-suited to numerical computation and software engineering; it has a whole host of issues and is no longer actively supported by Microsoft. Recoursing to writing VBA is a sure sign that your spreadsheet is too complicated and it's worth investing time to tackle the problem properly.
I just don't care about proprietary software. It's not "evil" or "immoral," it just doesn't matter. I think that Open Source can do better ... it's just a superior way of working together and generating code. - Linus Torvalds, Interview on GPLv2, 2007

How does Big Data fit into this?

"Big data" is often mentioned alongside data science and while there are certainly technical crossovers and shared goals, it's important to treat the subjects separately.

The ability to store and efficiently manipulate a huge amount of data is incredibly useful, but 'big data analytics' often concerns itself simply with providing counts of events rather than statistical modelling, e.g. counting sales volume of an item by location by date, or counting page requests on a website.

NoSQL storage and map-reduce data processing have been around for a long time now, there's many ways to do it and many tools available. Hadoop, HBase, Cassandra, MongoDB, Redis, Riak, Redshift, BigQuery, Mahout, Spark etc. all have worthwhile use cases depending on the nature and volume of data to be stored and processed and we won't go into them here.

In a recent talk, Wes McKinney observed that the Python data science ecosystem still doesn't have a great story to tell about 'big data' and there's very much a need to interface well with high performance big data systems. We agree, but it's worthwhile to remember that one can gain deep insight and develop highly predictive models with only small-medium datasets. Intelligent surveying, balanced subsampling, advanced modelling and even simple human communication can often help solve the business issue without requiring us to process terabytes of mean averages.

"One only needs two tools in life: WD-40 to make things go, and duct tape to make them stop." - G. Weilacher

A lot of data science looks like software engineering

Statistical data processing can take a lot of horsepower, but will certainly also require a great deal of thought and human computer interaction.

Fortunately we now live in a world where memory and storage are fast and cheap, processors are multi-core and large high-resolution displays are available. Getting the right tools for the job is essential.

The data analytics function within any company needs to have excellent desktop hardware and capital expenditure here will be rewarded many times over in improved speed and sophistication of computation, breadth of analysis possible, and depth of knowledge gained.


Smaller datasets and simpler algorithms may not pose difficulties when using your well-specced local machine, but when dealing with larger datasets or complex models, it's wise to consider separate server hardware.

As noted above, RAM and processing is quite cheap these days, so building a powerful in-house server is reasonable. External cloud-based servers are worth considering for their ability to scale on demand, reducing capital outlay for short-term projects. That said, holding certain data outside the corporate firewall often requires a layer of legal arrangements and regulations that may make it not viable at all. We'll write about our approach to massive and efficient data anonymisation in a future blog post.


Source control has long been an integral part of software engineering and is naturally of vital importance in data science.

As teams grow and models are increasingly implemented in production systems rather then one-off analyses, proper source control is critical to provide code versioning, code review, auditability, continuous integration testing and more. Even for one-off analyses undertaken by just one person, these standard working methodologies will preserve the code and may help greatly on the next project.

Distributed version control tools like Git and Mercurial are the way to go; they're powerful, widely supported and easy to implement into the development process.


"The key word in 'data science' is not data, it's science" Jeff Leek, SimplyStatistics.org, 2013

Know your Toolset


Good tools for data science provide a framework for discovering new insights and solving problems not previously possible.

To recap the technical considerations:

  • use open source tools with a strong community and solid implementations of basic and cutting-edge analytical techniques
  • maintain a well-organised, version controlled code base with issue tracking and wikis
  • strive for repeatability, code review, testing and audit
  • use powerful local machines and consider scalable hardware where possible
  • iterate quickly and try to simplify the problem before throwing more processing at it.

We'll no doubt elaborate on the above in future posts about the technical aspects of running a data science department and certainly when discussing particular examples of our work. For now, thanks for reading!

 

This post was originally published on The Sampler, our in-house blog at Applied AI. Subscribe to our RSS or email feed for regular posts on machine learning, statistics, insurance and fintech.

要查看或添加评论,请登录

Jonathan Sedar的更多文章

  • Delivering Value Throughout the Analytical Process

    Delivering Value Throughout the Analytical Process

    This post was originally published on The (new) Sampler, my personal blog for technical case studies and general…

  • Build You a Library

    Build You a Library

    This post was originally published on The (new) Sampler, my personal blog for technical case studies and general…

  • On Contractor Day Rates

    On Contractor Day Rates

    This post was originally published on The (new) Sampler, my personal blog at sedar.co.

    3 条评论
  • 9 Questions To Determine If You Have A Good Data Science Ecosystem

    9 Questions To Determine If You Have A Good Data Science Ecosystem

    This post was originally published on The Sampler, our in-house blog at Applied AI. Subscribe to our RSS or email feed…

  • Our Growing World of Instech

    Our Growing World of Instech

    This post was originally published on The Sampler, our in-house blog at Applied AI. Subscribe to our RSS or email feed…

  • The Data Science Maturity Model

    The Data Science Maturity Model

    This post was originally published on The Sampler, our in-house blog at Applied AI. Subscribe to our RSS or email feed…

    2 条评论
  • How to Build a Data Science Business Function

    How to Build a Data Science Business Function

    This post was originally published on The Sampler, our in-house blog at Applied AI. Subscribe to our RSS or email feed…

  • Data Science has become a well established discipline, so what is it?

    Data Science has become a well established discipline, so what is it?

    This post was originally published on The Sampler, our in-house blog at Applied AI. Subscribe to our RSS or email feed…

社区洞察

其他会员也浏览了