登录查看更多内容

Are Data Structures and Algorithms obsolete today?

Taral Pawar

SSE @ Telestream | Frontend Development, React.js, Next.js | Crafting pixel perfect user interfaces | PM Fellow @ NextLeap

发布日期: 2021年5月26日

Data structures and algorithms are considered to be the core module of computer engineering and software development. Every programmer or a developer is recommended to possess a firm understanding of basic data structures for handling the data and of algorithms for creating logic and functions. Despite being such a fundamental concept of modern technology, many believe that it is not mandatory for programmers to understand these topics as they are no longer taken into consideration. A major reason for this is that a typical data scientist spends most of their time in high-level languages such as Python/R/SQL and rarely needs to think about underlying implementations [1]. This is due to fact that the majority of data analysis and machine learning algorithms are already packaged in ready-to-use, heavily optimized libraries such as Scikit-Learn, Numpy, Pandas, and others.

But the truth is that machine learning is a field that is heavily based on mathematics and one needs to have a good understanding of data structures and algorithms to solve these mathematical problems optimally. Data structures and algorithms are helpful in determining how a problem is represented internally or how the actual storage pattern works and what is happening under the hood for a problem [2]. While many prefer to use powerful libraries, most of those are indeed based on the basic algorithms and it becomes essential to have knowledge about these algorithms when it comes to the big picture.

Even though it may seem that the theoretical concepts of algorithms are been fading due to advanced libraries and modern development tools, it is necessary to be aware of them for optimized development. One of such crucial concepts is algorithmic complexity. It is a method that allows understanding of how well a piece of code scales with the data. By understanding the complexity of an algorithm, one could reason about its performance regardless of its language or architecture of execution. This concept is important to data scientists due to the need to process an ever-increasing amount of information. Lets us consider an example of a machine learning model for a classification task. An obvious solution to this would be to use a Support Vector Machine (SVM) due to its good ability to handle non-linear decision boundaries [1]. But considering that the compute requirement of SVM ranges between O(n2) and O(n3), it would be better to reconsider using SVM for large data sets, as the execution time would increase exponentially on the increase of data size. Hence, it is vital to understand algorithmic complexities and their asymptotic behavior over large data rather than blindly using predefined models.

There are generic models or frameworks defined in the theory that describes the design of a class of algorithms. These are called algorithmic paradigms and they are an abstraction higher than the notion of an algorithm [3]. All the modern algorithms are based on these paradigms and it is important to understand the characteristics and functioning of each of them in order to create complex and efficient algorithms and tackle a variety of problems. Some of the popular and vastly used algorithmic paradigms are Dynamic programming, Greedy algorithms, Divide and conquer, Randomized and sublinear algorithms. The advantage of learning these paradigms is that it provides the design and working of an algorithm irrespective of any programming language, and this makes it easy to implement them in any type of application.

Algorithms are heavily used in all machine learning and data science practices. Graph algorithms are some of the most common ones, having applications from recommendation systems to fraud detections. Among the graph algorithms, the most significant ones are Generic graph search, allowing generic exploration of graphs, Shortest path algorithm, allowing to compute the shortest path between any two vertices, Minimum spanning tree, useful for clustering and image segmentation [1]. Genetic algorithms are vastly used in machine learning to generate high-quality solutions to optimization and search problems. In these algorithms, the reinforcement learning algorithm uses the concept of dynamic programming which helps to explore every possibility and subsequently responsible to choose one aspect that is most expected at each step of the computation [2]. Apart from these, machine learning also consists of advanced algorithms like Linear Regression, Decision Tree, K-means, Naive Bayes, Supervised and unsupervised learning, reinforcement learning, etc. A proper knowledge of the inner working and limitations of these algorithms would help one choose the right and most efficient algorithm for their task.

Most of the machine learning applications today are written in python language as it is easy, flexible, and has versatile data structures. Due to the convenience of python, many programmers tend to use python’s list as the default data structure for most of their applications. This may lead to significant performance bottlenecks [1] as there are various other types of data structures available, each with its own strengths and weaknesses and applicability for different tasks. For example, Hash tables are useful for lookup applications, and learning about them will help in understanding how sets and dictionaries are implemented behind the scenes. Binary search trees are highly efficient in searching operations. Heaps are used for computations on objects having maximum and minimum values. Bloom Filter [4] is a lesser-known space-efficient probabilistic data structure that is used to test if a given element is a member of a set, and returns false positive values. While arrays, stacks, and lists are used commonly, there are many more data containers with unique functionalities, and understanding them and optimizing computation by using appropriate data structures will allow speeding up certain applications by an order of magnitude.

To summarize, even though powerful libraries and ready-to-use machine learning models have seemed to eliminate the need to know the implementation of the underlying logic of an application, it is essential to learn about the fundamentals of algorithms and data structures. Machine learning demands high accuracy and optimization which can only be obtained by using appropriate data structures and having adequate algorithmic knowledge. All the modern highly efficient algorithms follow the rules of algorithmic complexities and fall under one of the aforementioned paradigms. Data structures and algorithms are the most fundamental concepts of development and they would retain their importance in programming irrespective of technological advancements.

References

[1] Denis Lapchev. “Why Data Scientists Should Learn Algorithms and Data Structures?” Medium, Oct 6, 2020, https://medium.com/swlh/why-data-scientists-should-learn-algorithms-and-data-structures-4d93237a1026.

[2] “Need of Data Structures and Algorithms for Deep Learning and Machine Learning.” Geeksforgeeks, 14 Oct 2020, https://www.geeksforgeeks.org/need-of-data-structures-and-algorithms-for-deep-learning-and-machine-learning/.

[3] “Algorithmic Paradigms.” Wikipedia, https://en.wikipedia.org/wiki/Algorithmic_paradigm#cite_note-1.

[4] “Bloom Filter.” Wikipedia, https://en.wikipedia.org/wiki/Bloom_filter.

Are Data Structures and Algorithms obsolete today?

Taral Pawar

SSE @ Telestream | Frontend Development, React.js, Next.js | Crafting pixel perfect user interfaces | PM Fellow @ NextLeap

更多精彩文章

社区洞察

其他会员也浏览了

Difference between Pandas and Numpy and their uses.

DATA Pill #092 - MLFlow iceberg, Meta ?? Python

Understanding the essential Data Processing libraries

Machine Learning fast-track: Telco Customer Churn Prediction

Introduction to Data Science for Python

Mastering Key Data Structures and Algorithms: Week 2 Breakdown

Choosing Your Companion for Data and AI Journey: Jupyter Notebook vs. Dataiku DSS. Part 2.

The law of parallel processing

What are Panda and NumPy in data analytics?

A Data Science Framework: To Achieve 99% Accuracy using Python

Computer games and their influence on gamers and the society

2021年5月15日