登录查看更多内容

Graphical Data Analysis in R Programming

Malini Shukla

Senior Data Scientist || Hiring || 6M+ impressions || Trainer || Top Data Scientist || Speaker || Top content creator on LinkedIn || Tech Evangelist

发布日期: 2018年3月29日

Graphical Data Analysis with R

Much of statistical analysis is based on numerical techniques, such as confidence intervals, hypothesis testing, regression analysis, and so on. In many cases, these techniques are based on assumptions about the data being used. One way to determine if data conform to these assumptions is to analyze its graph, as a graph can provide many insights into the properties of the plotted dataset.

Graphs are useful for non-numerical data, such as colors, flavors, brand names, and more. When numerical measures are difficult or impossible to compute, graphs play an important role.

Statistical computing is done with the aim to produce high-quality graphics.

Let us see how to install R, R Studio and R Packages in simple steps

Various types of plots drawn in R are:

Plots with single variables – You can plot a graph for a single variable.
Plots with multiple variables – You can plot graph with multiple variables
Special plots – R has low and high-level graphics facilities.

Let us see how to save Graphs to files in R Programming?

Plots for a Single Variable

You may need to plot for a single variable. For example, a plot showing daily sales values of a particular product over a period of time. You can also plot the time series for month by month sales.

The choice of plots is more restricted when you have just one variable to the plot. R offers the following plotting functions for single variables:

hist(y) – Histograms to show a frequency distribution
plot(y) – Index plots to show the values of y in sequence
plot.ts (y) – Time series plots
pie (x) – Compositional plots like pie diagrams

The types of plots available in R are:

Histograms – Used to display the mode, spread, and symmetry of a set of data.
Index Plots – Here, the plot takes a single argument. This kind of plot is especially useful for error checking.
Time Series Plots – When a period of time is complete, the time series plot can be used to join the dots in an ordered set of y values.
Pie Charts – Useful to illustrate the proportional makeup of a sample in presentations.

A common mistake among beginners is to confuse histograms and bar charts. Histograms have the response variable on the x-axis, and the y-axis shows the frequency of different values of the response. In contrast, a bar chart has the response variable on the y-axis and a categorical explanatory variable on the x-axis.

Histograms

Histograms display the mode, the spread, and the symmetry of a set of data. The R function hist() is used to plot histograms.

X axis is divided into which the values of the response variable are distributed and then counted. This is called bins. Histograms are tricky because it depends on the subjective judgments of where exactly to put the bin margins that what graph you will be looking at. Wide bins produce one picture, narrow bins produce a different picture, and unequal bins produce confusion.

Small bins produce multimodality (combination of audio, textual, and visual modes), whereas broad bins produce unimodality (contains a single mode). When there are different bin widths, the default in R is for this to convert the counts into densities.

The convention adopted in R for showing bin boundaries is to employ square and round brackets, so that:

[a,b) means ‘greater than or equal to a but less than V [square than round)
(a,b] means ‘greater than a but less than or equal to b’ (round than square]

You need to take care that the bins can accommodate both your minimum and maximum values.

The cut() function takes a continuous vector and cuts it up into bins that can then be used for counting.

The hist() function in R does not take your advice about the number of bars or the width of bars. It helps simultaneous viewing of multiple histograms with similar range. For small integer data, you can have one bin for each value.

In R, the parameter k of the negative binomial distribution is known as size and the mean is known as mu.

Drawing histograms of continuous variables is a more challenging task than explanatory variables. This problem depends on the density estimation that is an important issue for statisticians. To deal this problem, you can approximately transform continuous model to a discrete model using a linear approximation to evaluate the density at the specified points.

Learn more about R Programming Career

The choice of bandwidth is a compromise made between removing insignificant bumps and real peaks. The general rule for bandwidth is:

Index Plots

For plotting single samples, index plots can be used. The plot function takes a single argument. This is a continuous variable and plots values on the y-axis, with the x coordinate determined by the position of the number in the vector. Index plots are especially useful for error checking.

Time Series Plot

The time series plot can be used to join the dots in an ordered set of y values when a period of time is complete. The issues arise when there are missing values in the time series (e.g., if sales values for two months are missing during the last five years), particularly groups of missing values (e.g., if sales values for two quarters are missing during the last five years) for which periods we typically know nothing about the behavior of the time series.

ts.plot and plot.ts are the two functions for plotting time series data in R.

Pie Chart

You can use pie charts to illustrate the proportional makeup of a sample in presentations. Here the function pie takes a vector of numbers and turns them into proportions. It then divides the circle on the basis of those proportions.

To indicate each segment of the pie, it is essential to use a label. The label is provided as a vector of character strings, here called data$names.

If a names list contains blank spaces then you cannot use read.table with a tab-delimited text file to enter the data. Instead, you can save the file called piedata as a comma-delimited file, with a “.csv” extension, and input the data to R using read.csvin place of read.table

data <- read, csv (.c : \\temp\\piedata.csv)
data

The pie chart can be created, using the following command:

pie(data$amounts,labels=as.character(data$names))

Note: The color for the segments can also be changed in R.

Plots with Two Variables

The two types of variables used in the graphical data analysis with R:

Response variable
Explanatory variable

The response variable is represented on the y-axis and the explanatory variableis represented on the x-axis. Nature of the explanatory variable determines the kind of plot produced. When the explanatory variable is a continuous variable, such as length or weight or altitude, the appropriate plot to use is a scatterplot.

When an explanatory variable is categorical, like genotype or color or gender, the appropriate plot is either a box-and-whisker plot or a barplot.

A box-and-whisker plot is a graphical means of representing sets of numeric data using quartiles and it depends on the minimum and maximum values, and upper and lower quartiles.

A barplot provides a graphical representation of data in the form of bar charts.

The most frequently used plotting functions for two variables in R are:

plot (x, y): Scatterplot of y against x
plot (factor, y): Box-and-whisker plot of y at each factor level
barplot (y): Heights from a vector of y values (one bar per factor level

The types of plots available in R are:

Scatterplots – When the explanatory variable is a continuous variable.
Stepped Lines – Used to plot data distinctly and provide a clear view.
Boxplots – Boxplots show the location, spread of data and indicate skewness.
Barplots – It shows the heights of the mean values from the different treatments.

Scatterplots

Scatterplots shows a graphical representation of the relationship between two numbered sets. The plot function draws axis and adds a scatterplot of points. You can also add extra points or lines to an existing plot by using the functions, point, and lines.

The points and line functions can be specified in the following two ways:

Cartesian plot (x, y) – A Cartesian coordinate specifies the location of a point in a two-dimensional plan with the help of two perpendicular vectors that are known as an axis. The origin of the Cartesian coordinate system is the point where two axes cut each other and the location of this point is the (0,0).
Formula plot (y, x) – The formula based plot refers to representing the relationship between variables in the graphical form. For example, the equation,y=mx+c, shows straight line in the Cartesian coordinate system.

The advantage of the formula-based plot is that the plot function and the model fit look and feel the same. The Cartesian plots build plots using “x than y” while the model fit uses “y than x”.

The plot function uses the following arguments:

The name of the explanatory variable
The name of the response variable

The syntax for the plot function looks like plot (x, y). The data you want to plot is read into R from a file, as shown in the following commands:

datal <- read, table (.c: \\temp\\scatterl. Txt. ,header=T)
attach(datal)
names(datal)
[1] .x1. .y1.

To produce the scatter plot, type the following command:

Plot (x1, y1, col=.red.)

Unless you specify with explicit labels, the random variable names label the axis. You could use below command to change the label x1 into the longer label called as ‘Explanatory variable’ and the label on ty-axisxis from y1 to ‘Response variable’.

plot(x1, y1, col="red", lab="Explanatory variable", ylab="Response variable")

The argument pch refers to plotting character or plotting symbol. The plotting symbol, pch adds the variations to the scatterplots.

As the value of pch changes, the plotting character also changes. There are 256 different plotting symbols used in R (0 to 255). A graphic showing all of them in sequence, from bottom left to top right, can be built as follows:

plot(0:10,0:10,xlim=c(0,32),ylim=c(0,40),type=.n.,xaxt=.n.,yaxt=.n.,xlab=..,ylab=..)
x <- seq(1,31,2)
s <- –16
f <- –1
for (y in seq (2, 40, 2.5)) {
s <- s + 16
f <- f + 16
y2 <- rep(y, 16)
points (x, y2,pch=s:f,cex=0.7)
text(x,y-l,as.character(s:f),cex=0.6) }

The formula based plot refers to representing the relationship between variables in graphical form. For example, the equation, y=mx+c, shows the straight line in Cartesian coordinate system.

Here, List of Best Books to learn R

Read Complete Article>>

See Also-

R Descriptive Statistics
R Contingency Tables

Bhavesh Gupta

Certified Scrum Master | Agile Methodologies | Project Management |

7 年

Can u plzz tell me approach for learning R language from basic to intermediate to advanced.. I am new to it.... I want to learn. Right now I am learning big data hadoop....

查看更多评论

要查看或添加评论，请登录

Malini Shukla的更多文章

Top 9 Computer Vision Project Ideas for Beginners

2020年1月21日

Top 9 Computer Vision Project Ideas for Beginners

Understand the visual world around us Computer Vision Projects Computer vision is the most powerful and compelling type…
12 Cool Data Science project ideas with source code - "Strengthen your Resume"

2019年11月13日

12 Cool Data Science project ideas with source code - "Strengthen your Resume"

INTRODUCTION Data Science, a field that brings out wonders almost every second day and that’s why it is often regarded…

3 条评论
Python Coding Interview Questions for Experienced - Python FAQ's

2019年9月30日

Python Coding Interview Questions for Experienced - Python FAQ's

Firstly, If you are here, you probably already have a interview scheduled so my friend all the very best with that…
How Data Science is the Backbone of Retail?

2019年7月16日

How Data Science is the Backbone of Retail?

Data Science is having an increasing impact on business models in all industries. And in today’s digital world, data…
How to Get The Coolest & The Sexiest Job Of the Century- “Become a Data Scientist”

2019年7月9日

How to Get The Coolest & The Sexiest Job Of the Century- “Become a Data Scientist”

“The goal is to turn data into information, and information into insight” Data Scientist is an analytical data expert…
What’s the Best programming Language to Start a Career in Data Science?

2019年6月25日

What’s the Best programming Language to Start a Career in Data Science?

If you are thinking which programming languages should I learn to Master data Science in 2019? Then you are at the…

1 条评论
11 Reason Why TensorFlow is So Popular

2019年6月15日

11 Reason Why TensorFlow is So Popular

TensorFlow Features | Why TensorFlow Is So Popular TensorFlow gives us an interactive multiplatform programming…
20 Deep Learning Terminologies You Must Know

2019年6月14日

20 Deep Learning Terminologies You Must Know

Deep Learning Terminologies a. Recurrent Neuron It’s one of the best from the Deep Learning Terminologies.

2 条评论
TensorFlow Performance Optimization – Tips To Improve Performance

2019年6月12日

TensorFlow Performance Optimization – Tips To Improve Performance

Ways for TensorFlow Performance Optimization There a variety of ways through which you can optimize your hardware tools…
Top 9 Reasons Why QlikView is Best in BI

2019年6月11日

Top 9 Reasons Why QlikView is Best in BI

QlikView Features Below are the 9 Features of QlikView, which gives us the importance of QlikView, let’s discuss them:…

See all articles

Graphical Data Analysis in R Programming

Malini Shukla

Senior Data Scientist || Hiring || 6M+ impressions || Trainer || Top Data Scientist || Speaker || Top content creator on LinkedIn || Tech Evangelist

Graphical Data Analysis with R

Plots for a Single Variable

Histograms

Index Plots

Time Series Plot

Pie Chart

Plots with Two Variables

Scatterplots

Malini Shukla的更多文章

社区洞察

其他会员也浏览了

Productivity Prediction of Garment Employees With Python Programming

Mastering data manipulation with r : R script

R Programming for Data Science Patenting #AIPatentFirst

Applications Of R Programming In Real World

The power of R for trading (part 1)

R programming for data science ( A very good course on weekends)

Scope of R Programming

LINEAR REGRESSION Analysis USING BASIC PROGRAMMING LANGUAGE

Anomaly detection - Non programming way

Discover R Applications – Why Top Companies are using R?Programming?

Graphical Data Analysis with R

Plots for a Single Variable

Histograms

Index Plots

Time Series Plot

Pie Chart

Plots with Two Variables

Scatterplots

Malini Shukla的更多文章

Top 9 Computer Vision Project Ideas for Beginners

12 Cool Data Science project ideas with source code - "Strengthen your Resume"

Python Coding Interview Questions for Experienced - Python FAQ's

How Data Science is the Backbone of Retail?

How to Get The Coolest & The Sexiest Job Of the Century- “Become a Data Scientist”

What’s the Best programming Language to Start a Career in Data Science?

11 Reason Why TensorFlow is So Popular

20 Deep Learning Terminologies You Must Know

TensorFlow Performance Optimization – Tips To Improve Performance

Top 9 Reasons Why QlikView is Best in BI

社区洞察

其他会员也浏览了

Productivity Prediction of Garment Employees With Python Programming

Mastering data manipulation with r : R script

R Programming for Data Science Patenting #AIPatentFirst

Applications Of R Programming In Real World

The power of R for trading (part 1)

R programming for data science ( A very good course on weekends)

Scope of R Programming

LINEAR REGRESSION Analysis USING BASIC PROGRAMMING LANGUAGE

Anomaly detection - Non programming way

Discover R Applications – Why Top Companies are using R?Programming?