Graphical Data Analysis in R Programming
Malini Shukla
Senior Data Scientist || Hiring || 6M+ impressions || Trainer || Top Data Scientist || Speaker || Top content creator on LinkedIn || Tech Evangelist
Graphical Data Analysis with R
Much of statistical analysis is based on numerical techniques, such as confidence intervals, hypothesis testing, regression analysis, and so on. In many cases, these techniques are based on assumptions about the data being used. One way to determine if data conform to these assumptions is to analyze its graph, as a graph can provide many insights into the properties of the plotted dataset.
Read more about introduction to R Programming
Graphs are useful for non-numerical data, such as colors, flavors, brand names, and more. When numerical measures are difficult or impossible to compute, graphs play an important role.
Statistical computing is done with the aim to produce high-quality graphics.
Let us see how to install R, R Studio and R Packages in simple steps
Various types of plots drawn in R are:
- Plots with single variables – You can plot a graph for a single variable.
- Plots with multiple variables – You can plot graph with multiple variables
- Special plots – R has low and high-level graphics facilities.
Let us see how to save Graphs to files in R Programming?
Plots for a Single Variable
You may need to plot for a single variable. For example, a plot showing daily sales values of a particular product over a period of time. You can also plot the time series for month by month sales.
The choice of plots is more restricted when you have just one variable to the plot. R offers the following plotting functions for single variables:
- hist(y) – Histograms to show a frequency distribution
- plot(y) – Index plots to show the values of y in sequence
- plot.ts (y) – Time series plots
- pie (x) – Compositional plots like pie diagrams
The types of plots available in R are:
- Histograms – Used to display the mode, spread, and symmetry of a set of data.
- Index Plots – Here, the plot takes a single argument. This kind of plot is especially useful for error checking.
- Time Series Plots – When a period of time is complete, the time series plot can be used to join the dots in an ordered set of y values.
- Pie Charts – Useful to illustrate the proportional makeup of a sample in presentations.
A common mistake among beginners is to confuse histograms and bar charts. Histograms have the response variable on the x-axis, and the y-axis shows the frequency of different values of the response. In contrast, a bar chart has the response variable on the y-axis and a categorical explanatory variable on the x-axis.
Histograms
Histograms display the mode, the spread, and the symmetry of a set of data. The R function hist() is used to plot histograms.
X axis is divided into which the values of the response variable are distributed and then counted. This is called bins. Histograms are tricky because it depends on the subjective judgments of where exactly to put the bin margins that what graph you will be looking at. Wide bins produce one picture, narrow bins produce a different picture, and unequal bins produce confusion.
Small bins produce multimodality (combination of audio, textual, and visual modes), whereas broad bins produce unimodality (contains a single mode). When there are different bin widths, the default in R is for this to convert the counts into densities.
The convention adopted in R for showing bin boundaries is to employ square and round brackets, so that:
- [a,b) means ‘greater than or equal to a but less than V [square than round)
- (a,b] means ‘greater than a but less than or equal to b’ (round than square]
You need to take care that the bins can accommodate both your minimum and maximum values.
The cut() function takes a continuous vector and cuts it up into bins that can then be used for counting.
The hist() function in R does not take your advice about the number of bars or the width of bars. It helps simultaneous viewing of multiple histograms with similar range. For small integer data, you can have one bin for each value.
In R, the parameter k of the negative binomial distribution is known as size and the mean is known as mu.
Drawing histograms of continuous variables is a more challenging task than explanatory variables. This problem depends on the density estimation that is an important issue for statisticians. To deal this problem, you can approximately transform continuous model to a discrete model using a linear approximation to evaluate the density at the specified points.
Learn more about R Programming Career
The choice of bandwidth is a compromise made between removing insignificant bumps and real peaks. The general rule for bandwidth is:
Index Plots
For plotting single samples, index plots can be used. The plot function takes a single argument. This is a continuous variable and plots values on the y-axis, with the x coordinate determined by the position of the number in the vector. Index plots are especially useful for error checking.
Time Series Plot
The time series plot can be used to join the dots in an ordered set of y values when a period of time is complete. The issues arise when there are missing values in the time series (e.g., if sales values for two months are missing during the last five years), particularly groups of missing values (e.g., if sales values for two quarters are missing during the last five years) for which periods we typically know nothing about the behavior of the time series.
ts.plot and plot.ts are the two functions for plotting time series data in R.
Pie Chart
You can use pie charts to illustrate the proportional makeup of a sample in presentations. Here the function pie takes a vector of numbers and turns them into proportions. It then divides the circle on the basis of those proportions.
To indicate each segment of the pie, it is essential to use a label. The label is provided as a vector of character strings, here called data$names.
If a names list contains blank spaces then you cannot use read.table with a tab-delimited text file to enter the data. Instead, you can save the file called piedata as a comma-delimited file, with a “.csv” extension, and input the data to R using read.csvin place of read.table
Read more about how to create, access, manipulate, merge Lists in R?
- data <- read, csv (.c : \\temp\\piedata.csv)
- data
The pie chart can be created, using the following command:
pie(data$amounts,labels=as.character(data$names))
Note: The color for the segments can also be changed in R.
Plots with Two Variables
The two types of variables used in the graphical data analysis with R:
- Response variable
- Explanatory variable
The response variable is represented on the y-axis and the explanatory variableis represented on the x-axis. Nature of the explanatory variable determines the kind of plot produced. When the explanatory variable is a continuous variable, such as length or weight or altitude, the appropriate plot to use is a scatterplot.
When an explanatory variable is categorical, like genotype or color or gender, the appropriate plot is either a box-and-whisker plot or a barplot.
A box-and-whisker plot is a graphical means of representing sets of numeric data using quartiles and it depends on the minimum and maximum values, and upper and lower quartiles.
A barplot provides a graphical representation of data in the form of bar charts.
The most frequently used plotting functions for two variables in R are:
- plot (x, y): Scatterplot of y against x
- plot (factor, y): Box-and-whisker plot of y at each factor level
- barplot (y): Heights from a vector of y values (one bar per factor level
The types of plots available in R are:
- Scatterplots – When the explanatory variable is a continuous variable.
- Stepped Lines – Used to plot data distinctly and provide a clear view.
- Boxplots – Boxplots show the location, spread of data and indicate skewness.
- Barplots – It shows the heights of the mean values from the different treatments.
Scatterplots
Scatterplots shows a graphical representation of the relationship between two numbered sets. The plot function draws axis and adds a scatterplot of points. You can also add extra points or lines to an existing plot by using the functions, point, and lines.
The points and line functions can be specified in the following two ways:
- Cartesian plot (x, y) – A Cartesian coordinate specifies the location of a point in a two-dimensional plan with the help of two perpendicular vectors that are known as an axis. The origin of the Cartesian coordinate system is the point where two axes cut each other and the location of this point is the (0,0).
- Formula plot (y, x) – The formula based plot refers to representing the relationship between variables in the graphical form. For example, the equation,y=mx+c, shows straight line in the Cartesian coordinate system.
The advantage of the formula-based plot is that the plot function and the model fit look and feel the same. The Cartesian plots build plots using “x than y” while the model fit uses “y than x”.
The plot function uses the following arguments:
- The name of the explanatory variable
- The name of the response variable
The syntax for the plot function looks like plot (x, y). The data you want to plot is read into R from a file, as shown in the following commands:
- datal <- read, table (.c: \\temp\\scatterl. Txt. ,header=T)
- attach(datal)
- names(datal)
- [1] .x1. .y1.
To produce the scatter plot, type the following command:
Plot (x1, y1, col=.red.)
Unless you specify with explicit labels, the random variable names label the axis. You could use below command to change the label x1 into the longer label called as ‘Explanatory variable’ and the label on ty-axisxis from y1 to ‘Response variable’.
plot(x1, y1, col="red", lab="Explanatory variable", ylab="Response variable")
The argument pch refers to plotting character or plotting symbol. The plotting symbol, pch adds the variations to the scatterplots.
As the value of pch changes, the plotting character also changes. There are 256 different plotting symbols used in R (0 to 255). A graphic showing all of them in sequence, from bottom left to top right, can be built as follows:
- plot(0:10,0:10,xlim=c(0,32),ylim=c(0,40),type=.n.,xaxt=.n.,yaxt=.n.,xlab=..,ylab=..)
- x <- seq(1,31,2)
- s <- –16
- f <- –1
- for (y in seq (2, 40, 2.5)) {
- s <- s + 16
- f <- f + 16
- y2 <- rep(y, 16)
- points (x, y2,pch=s:f,cex=0.7)
- text(x,y-l,as.character(s:f),cex=0.6) }
The formula based plot refers to representing the relationship between variables in graphical form. For example, the equation, y=mx+c, shows the straight line in Cartesian coordinate system.
Here, List of Best Books to learn R
See Also-
Certified Scrum Master | Agile Methodologies | Project Management
6 年Can u plzz tell me approach for learning R language from basic to intermediate to advanced.. I am new to it.... I want to learn. Right now I am learning big data hadoop....