Big Data tools at the Master in Business Analytics and Big Data at IE

Big Data tools at the Master in Business Analytics and Big Data at IE

During the Master in Business Analytics and Big Data here at IE we have been introduced to many different big data technologies. For the IE Big Data Club newsletter I have summarized what we have seen so far and how we used them.

R: R, or RStudio (which is the interface we use when programming in R), is a programming language which has its strengths within statistics and is a popular choice for data analysis. It is a free software that is supported by a large community which comes with many predefined ‘packages’. During our studies we have also come across R in other subjects such as Time Series Forecasting, Recommendation Engines, and Social Network Analysis. One example, of how we used R was to create a time series forecast for a stock price.

SQL: Standard Query Language (SQL) is a language to communicate with relational database management systems. We got familiar with creating, updating, and deleting tables from a database which holds important business information such as customer data. Also, we looked into extracting relevant business information from large databases. For example, finding employee specific information in your company. Overall, we got familiar with the structure of:

SELECT <select list>, FROM <table>, WHERE <predicates>, GROUP BY <expression>, HAVING <condition>, ORDER BY <…> 

Hadoop: Hadoop, or rather Apache Hadoop is a software that allows for computing large amounts of data through a network of many computers. Hadoop can be used for several parts, storage and processing in the big data value chain which consists of its sources, ingestion, storage, processing, and serving. It is also open-source and has its strengths within batch processing.

Dataiku: Dataiku is web-based platform that facilitates data analytics and machine learning models amongst others. We used it for our machine learning classes as well as data competitions. One project was using machine learning models in order to predict housing prices in a given neighbourhood.

Python: Python is a programming language that can be used for many data science related tasks such as data mining, data visualization, and machine learning. It is similar to R open source and supported by a global community. We used for many different classes such as Recommendation Engines, Machine Learning, and Data Visualization. 

NoSQL: Given the complexity of databases, SQL (even though still by many) is not enough. NoSQL refers to languages other than SQL, as it is also referred to “Not Only SQL”. In broad terms it is possible to divide NoSQL databases into: document-oriented, column-family, key-value store, and graph-oriented.

Spark: Apache Spark, is also a tool for handling large amounts of data such as Hadoop. However, Spark is able to handle streaming data, in other words real time data that is processed immediately, unlike Hadoop. A use case for Spark could be fraud detection, where the incoming data needs to be processed immediately and cannot wait until the next batch in an hour.

PowerBi: PowerBI is a popular data visualization tool, for example when creating dashboards to keep track of project developments. Data can be imported from Excel amongst others which makes it easy to use. We were introduced to PowerBI in our Data Visualization class, but also used it for data competition projects.

Written by Dennis Pedersen

要查看或添加评论,请登录

Dennis Pedersen的更多文章

社区洞察

其他会员也浏览了