登录查看更多内容

Data Analysis with R Programming

Mothilal Jami

Student Peer Mentor at KL University | | Samsung Prism Research Intern || Competitive Coder || Student Developer || Data Science Aspirant

发布日期: 2022年5月4日

Why learn R?

There are various reasons to learn R, we have listed the major ones that will surely answer your question to why learn R.

1. Why R is important for Data Science?

R plays a very important role in Data Science, you will be benefited with following operations in R.

You can run your code without any compiler?– R is an interpreted language. Hence we can run code without any compiler. R interprets the code and makes the development of code easier.
Many calculations done with vectors?– R is a vector language, so anyone can add functions to a single Vector without putting in a loop. Hence, R is powerful and faster than other languages.
Statistical Language?– R used in biology, genetics as well as in statistics. R is a turning complete language where any type of task can perform.

2. Why R is Good for Business?

R will just not help you in the technical fields, it will also be a great help in your business.

Here, the major reason is that R is open-source, therefore it can be modified and redistributed as per the user’s need. It is great for visualization and has far more capabilities as compared to other tools.
For data-driven businesses, lack of Data Scientists is a huge concern. Companies are using R programming as their core platform and are recruiting trained R programmers.

3. R is a gateway to Lucrative Career

R language is used extensively in Data Science. This field offers some of the highest-paying jobs in the world today. Data Scientists who are proficient in R make more than?$117,000 (Rs 80,56,093) on an average per year. If you want to enter the field of Data Science and earn a lucrative salary, then you must definitely learn R.

Wondering why R is important for Data Science? Then, do check out the article on –?Reasons to Choose R for Data Science?

4. Open-source

R is an open-source language. It is maintained by a community of active users and you can avail R for free. You can modify various functions in R and make your own packages. Since R is issued under the?General Public Licence (GNU), there are no restrictions on its usage.

5. Popularity

R has become one of the most popular programming languages in the industries. Conventionally, R was mostly used in academia but with the emergence of Data Science, the need for R in the industries became evident. R is used at Facebook for social network analysis. It is being used at Twitter for semantic analysis as well as visualizations.

6. Robust Visualization Library

R comprises of libraries like?ggplot2, plotly?that offer aesthetic graphical plots to its users. R is most widely recognized for its stunning visualizations which gives it an edge over other Data Science programming languages.

7. With R, you can develop amazing Web-Apps

R provides you with the ability to build aesthetic web-applications. Using the?R Shiny?package, you can develop interactive dashboards straight from the console of your R IDE. Using this, you can embed your visualizations and enhance the storytelling of your data analysis through aesthetic visualizations.

Any queries in why learn R article till now? Please comment below.

8. R enjoys a vast Community Support

R Programming is supported by a vast community that maintains and updates?R. If you face any trouble with the code in R, you can avail the support of the community on places like Stack Overflow?(of course you can also ask us any queries in the comment section below, DataFlair is always there for you!). There are several communities around the world that organize?bootcamps?and?R meetups.

9. A go-to language for Statistics and Data Science

R is the standard language for Statistics and Data Science. R was developed for statistics, by statisticians. It has been in use even before the word?“Data Science”?was coined. Statisticians and Data Scientists are most familiar with R than any other programming language. R facilitates various statistical operations through its thousands of packages.

Its the right time to be aware of?Statistical Programming in R

10. R is being used in almost every industry

R is one of the most widely used programming languages in the world today. It is used in almost every industry, ranging from finance, banking to medicine and manufacturing. R is used for?portfolio management, risk analytics in finance and banking industries.?It is used for carrying out an analysis of drug discovery and genomic analysis in bioinformatics. R is also used to implement various statistical measures to optimize industrial processes.

ESSENTIALS OF R PROGRAMMING

Understand and practice this section thoroughly. This is the building block of your R programming knowledge. If you get this right, you would face less trouble in debugging.

R has five basic or ‘atomic’ classes of objects. Wait, what is an object ?

Everything you see or create in R is an object. A vector, matrix, data frame, even a variable is an object. R treats it that way. So, R has 5 basic classes?of objects. This includes:

Character
Numeric (Real Numbers)
Integer (Whole Numbers)
Complex
Logical (True / False)

Since these classes are self-explanatory by names, I wouldn’t elaborate on that. These classes have attributes. Think of attributes as their ‘identifier’, a name or number which aptly identifies them. An object can have following attributes:

names, dimension names
dimensions
class
length

Attributes of an object can be accessed using?attributes()?function. More on this coming in following section.

Let’s understand the concept of object and attributes practically. The most basic object in R is known as vector. You can create an empty vector using?vector(). Remember, a vector contains object of same class.

For example: Let’s create vectors of different classes. We can create vector using?c()?or concatenate command also.

> a <- c(1.8, 4.5)?#numeric

> b <- c(1 + 2i, 3 - 6i) #complex

> d <- c(23, 44)?#integer

> e <- vector("logical", length = 5)

Similarly, you can create vector of various classes.

Data Types in R

R has various type of ‘data types’ which?includes vector (numeric, integer etc), matrices, data frames and list. Let’s understand them one by one.

Vector:?As mentioned above, a vector contains object of same class. But, you can mix objects of different classes too.?When objects of different classes are mixed in a list, coercion occurs. This effect causes the objects of different types to ‘convert’ into one class. For example:

> qt <- c("Time", 24, "October", TRUE, 3.33)?#character

> ab <- c(TRUE, 24) #numeric

> cd <- c(2.5, "May") #character

To check the class of any object, use?class(“vector name”)?function.

> class(qt)

?"character"

To convert the class of a vector, you can use?as.?command.

> bar?<- 0:5

> class(bar)

> "integer"

> as.numeric(bar)

> class(bar)

> "numeric"

> as.character(bar)

> class(bar)

> "character"

Similarly, you can change the class of any vector. But, you should pay attention here. If you try to convert a “character” vector to “numeric” , NAs will be introduced. Hence, you should be careful to use this command.

List:?A list is a special type of vector which contain elements of different data types. For example:

> my_list <- list(22, "ab", TRUE, 1 + 2i)

> my_list

[[1]]

[1] 22

[[2]]

[1] "ab"

[[3]]

[1] TRUE

[[4]]

[1] 1+2i

As you can see, the output of a list is different from a vector. This is because, all the objects are of different types. The double bracket [[1]] shows the index of first element and so on. Hence, you can easily extract the element of lists depending on their index. Like this:

> my_list[[3]]

> [1] TRUE

You can use [] single bracket too. But, that would return the list element with its index number, instead of the result above. Like this:

> my_list[3]

>?[[1]]

?[1] TRUE

Matrices:?When a vector is introduced with?row?and?column?i.e. a dimension attribute, it becomes a matrix. A matrix is represented by set of rows and columns. It is a 2 dimensional data structure. It consist of elements of same class. Let’s create a matrix of 3 rows and 2 columns:

> my_matrix <- matrix(1:6, nrow=3, ncol=2)

> my_matrix

[,1] [,2]

[1,] 1 4

[2,] 2 5

[3,] 3 6

> dim(my_matrix)

[1] 3 2

> attributes(my_matrix)

$dim

[1] 3 2

As you can see, the dimensions of a matrix can be obtained using either?dim()?or?attributes()?command.?To extract a particular element from a matrix, simply use the index shown above. For example(try this at your end):

> my_matrix[,2]?#extracts second column

> my_matrix[,1]?#extracts first column

> my_matrix[2,]?#extracts second?row

领英推荐

Data Science Career Paths, Skills, and Special…

Towards Data Science 1 年前

Which Language Is Best For Data Science? R, Python And…

Ze Learning Labb 2 个月前

Learning To Use Python in Power BI – An Experiment

INFuse Data Solutions 2 年前

> my_matrix[1,]?#extracts?first row

As an interesting fact, you can also create a matrix from a vector. All you need to do is, assign dimension?dim()?later. Like this:

> age <- c(23, 44, 15, 12, 31, 16)

> age

[1] 23 44 15 12 31 16

> dim(age) <- c(2,3)

> age

[,1] [,2] [,3]

[1,] 23 15 31

[2,] 44 12 16

> class(age)

[1] "matrix"

You can also join two vectors using?cbind()?and?rbind()?functions. But, make sure that both vectors have same number of elements. If not, it will return NA values.

> x?<- c(1, 2, 3, 4, 5, 6)

> y?<- c(20, 30, 40, 50, 60)

> cbind(x, y)

x??y

[1,] 1 20

[2,] 2 30

[3,] 3 40

[4,] 4 50

[5,] 5 60

[6,] 6 70

> class(cbind(x, y))

[1] “matrix”

Data Frame:?This is the most commonly used?member of data types family. It is used to store tabular data. It is different from matrix. In a matrix, every element must have same class. But, in a data frame, you can put list of vectors containing different classes. This means, every column of a data frame acts like a list. Every time you will read?data in R, it will be stored in the form of a data frame. Hence, it is important to understand the majorly used commands on data frame:

> df <- data.frame(name = c("ash","jane","paul","mark"), score = c(67,56,87,91))

> df

name score

1 ash 67

2 jane 56

3 paul 87

4 mark 91

> dim(df)

[1] 4 2

> str(df)

'data.frame': 4 obs. of 2 variables:

$ name : Factor w/ 4 levels "ash","jane","mark",..: 1 2 4 3

$ score: num 67 56 87 91

> nrow(df)

[1] 4

> ncol(df)

[1] 2

Let’s understand the code above.?df?is the name of data frame.?dim()?returns the dimension of data frame as 4 rows and 2 columns.?str()?returns the structure of a data frame i.e. the list of variables stored in the data frame.?nrow()?and?ncol()?return the number of rows and number of columns in a data set respectively.

Here you see “name” is a factor variable and “score” is numeric.?In data science, a variable can be categorized into two types: Continuous and Categorical.

Continuous variables?are those which can take any form such as 1, 2, 3.5, 4.66 etc.?Categorical variables?are those which takes only discrete values such as 2, 5, 11, 15 etc. In R, categorical values are represented by factors. In df, name is a factor variable having 4 unique levels. Factor or categorical variable are specially treated in a data set. For more explanation,?click?here. Similarly, you can find techniques to deal with?continuous variables?here.

Let’s now understand the concept of?missing values?in R. This is one of the most painful yet crucial part of predictive modeling. You must be aware of all techniques to deal with them. The?complete explanation on?such techniques is provided?here.

Missing values in R are represented by?NA?and?NaN. Now we’ll check if a data set has missing values (using the same data frame?df).

> df[1:2,2] <- NA #injecting NA at 1st, 2nd row and 2nd column of df?

> df

name score

1 ash NA

2 jane NA

3 paul 87

4 mark 91

> is.na(df) #checks the entire data set for NAs and return logical output

name score

[1,] FALSE TRUE

[2,] FALSE TRUE

[3,] FALSE FALSE

[4,] FALSE FALSE

> table(is.na(df)) #returns a table of logical output

FALSE TRUE

6???2

> df[!complete.cases(df),] #returns the list of rows having missing values

name?score

1 ash?NA

2 jane NA

Missing values hinder normal calculations in a data set. For example, let’s say, we want to compute the mean of score. Since there are two missing values, it can’t be done directly. Let’s see:

mean(df$score)

[1] NA

> mean(df$score, na.rm = TRUE)

[1] 89

The use of?na.rm = TRUE?parameter tells R to ignore the NAs and compute the mean of remaining values in the selected column (score). To remove rows with NA values in a data frame, you can use?na.omit:

> new_df <- na.omit(df)

> new_df

name score

3 paul 87

4 mark 91

? Exploratory Data Analysis in R

From this section onwards, we’ll dive deep into various stages of predictive modeling. Hence,?make sure you understand every aspect of this section. In case you find anything difficult to understand, ask me in the comments section below.

Data Exploration is a crucial stage of predictive model. You can’t build great and practical models unless you learn to explore the data from begin to end. This stage forms a concrete foundation for data manipulation (the very next stage). Let’s understand it in R.

In this tutorial, I’ve taken the data set from?Big Mart Sales Prediction. Before we start, you must?get familiar with these terms:

Response Variable (a.k.a Dependent?Variable): In a data set, the response variable (y) is one on which we make predictions. In this case, we’ll predict ‘Item_Outlet_Sales’. (Refer to image shown below)

Predictor Variable (a.k.a Independent Variable): In a data set, predictor variables (Xi)?are those using which the prediction is made on response variable. (Image below).

Advantages and Disadvantages of R Programming

There are several benefits and some limitations of the R programming language. Let us discuss them one by one:

Pros of R Language

R is the most comprehensive statistical analysis package, as new technology and ideas often appear first in R.
R is an open-source that’s why you can run R anywhere any time, and even sell it under conditions of the license.
It is cross-platform which runs on many operating systems. It’s best for GNU/Linux and Microsoft Windows.
In R, everyone is welcomed to?provide bug fixes, code enhancements, and new packages.

Cons of R Language

The quality of some packages in R is less than perfect.
There’s no customer support of R Language whom you can complain if something doesn’t work.
R commands hardly concerns over memory management, and so R can consume all the available memory.

Summary

Data Science is the most popular technology in the world today. Since it is?mostly comprised of statistics, R is the lingua franca of this field. We went through the various points which delineate why learning R is the first choice for mastering Data Science. In the end, we conclude that learning R will have immense benefits that will provide you with the right tools to deal with data on a large scale.

要查看或添加评论，请登录

Mothilal Jami的更多文章

SDP-3

2022年1月9日

SDP-3

PROBLEM STATEMENT Many people are suffering by the lack of things like food clothes..
An Article on Travel, Tourism and Hospitality

2021年3月7日

An Article on Travel, Tourism and Hospitality

Introduction As we all know People Travel around a lot for the sake of Jobs and Entertainment. So, to make Travel…

2 条评论

Data Analysis with R Programming

Mothilal Jami

Student Peer Mentor at KL University | | Samsung Prism Research Intern || Competitive Coder || Student Developer || Data Science Aspirant

1. Why R is important for Data Science?

2. Why R is Good for Business?

3. R is a gateway to Lucrative Career

4. Open-source

5. Popularity

6. Robust Visualization Library

7. With R, you can develop amazing Web-Apps

8. R enjoys a vast Community Support

9. A go-to language for Statistics and Data Science

10. R is being used in almost every industry

Data Types in R

领英推荐

Advantages and Disadvantages of R Programming

Pros of R Language

Cons of R Language

Summary

Mothilal Jami的更多文章

社区洞察

其他会员也浏览了

Choosing the Ideal Programming Language for Data Scientists

5 Simple Steps to Become an AI Developer

Python Challenge: Most Profitable Companies

Transitioning from a Java or Python developer to an AI and Data Scientist

R Programming: The Powerhouse of Data Science

Mastering R: A Beginner’s Roadmap for Data Analysis and Visualization

Python is Coming to Excel

Data Analysis with Pandas: Four Essential Methods For Preprocessing Data

Top 10 Tools for data scientists in 2022

Introduction to Network Analysis with Neo4j, AuraDB, and Python ???

1. Why R is important for Data Science?

2. Why R is Good for Business?

3. R is a gateway to Lucrative Career

4. Open-source

5. Popularity

6. Robust Visualization Library

7. With R, you can develop amazing Web-Apps

8. R enjoys a vast Community Support

9. A go-to language for Statistics and Data Science

10. R is being used in almost every industry

Data Types in R

领英推荐

Advantages and Disadvantages of R Programming

Pros of R Language

Cons of R Language

Summary

Mothilal Jami的更多文章

SDP-3

An Article on Travel, Tourism and Hospitality

社区洞察

其他会员也浏览了

Choosing the Ideal Programming Language for Data Scientists

5 Simple Steps to Become an AI Developer

Python Challenge: Most Profitable Companies

Transitioning from a Java or Python developer to an AI and Data Scientist

R Programming: The Powerhouse of Data Science

Mastering R: A Beginner’s Roadmap for Data Analysis and Visualization

Python is Coming to Excel

Data Analysis with Pandas: Four Essential Methods For Preprocessing Data

Top 10 Tools for data scientists in 2022

Introduction to Network Analysis with Neo4j, AuraDB, and Python ???