Descriptive Statistics in R
Malini Shukla
Senior Data Scientist || Hiring || 6M+ impressions || Trainer || Top Data Scientist || Speaker || Top content creator on LinkedIn || Tech Evangelist
R Descriptive Statistics/Summary Statistics
All the data which is gathered for any analysis is useful when it is properly represented so that it is easily understandable by everyone and helps in proper decision making. Thus after doing an analysis of data, making summary plays a vital role. This is known as summarizing the data.
Read more about Arguments in R.
We can summarize the data in several ways either by text manner or by pictorial representation.
Below are the ways of summarizing data in R:
- Descriptive/Summary Statistics – Descriptive Statistics in R (Summary statistics) are the first figures used to represent nearly every dataset. They also form the foundation for much more complicated computations and analyses. Thus, in spite of being composed of simple methods, they are essential to the analysis process.
- Tabulation – Representing data analyzed in tabular form for easy understanding.
- Graphical – It is the way to represent data graphically.
In This Descriptive Statistics in R Tutorial, we will now see the Summary commands in R.
Learn more about Graphical Data Analysis with R
Summary Commands in R
Whenever you start working on any data set, you need to know the overview of what you are dealing with. There are few ways of doing this:
As we have seen in the earlier session that ls() command is used to know the list of named objects that you have. So you can start by using ls command for this purpose.
Once you know the objects that are available, you can then type the name of the object to view its contents. However, is the object contains a lot of data, the display may be quite large and you many want a more concise method to examine objects.
Let us see R Data Types for better understanding
You could use the str() command which shows you something about the structure of data rather than giving the statistical summary. It will inform you about the number of rows and columns in the data and values in the columns with their respective heads. The str() command is designed to help you examine the structure of a data object rather than providing a statistical summary.
Get the best books for R Programming language to become a master in R
To get a quick statistical summary of data objects, you can use summary()command.
The output of summary command depends on the object you are looking at. It gives the output as the largest value in data, the least value or mean and median and another similar type of information.
For example, if you have below data:
S.No. Item Quantity
1 Pen 5
2 Pencil 10
3 Rubber 12
Str() command gives you output describing:
- 3 obs of 2 variables
- Item: pen pencil rubber
- Quantity: 5 10 12
Summary() command gives output in below form:
- Min: 5
- Max: 12
- Mean: 13.5
The summary command is, therefore, more useful as we see minimum, maximum, mean etc values. The summary() command works for both matrix and data frame objects by summarizing the columns rather than the rows.
Read more about Matrix Function in R
Name Commands in R
Name command and its variant are used to find or add names to rows and columns of data structures.
Below are specified few of the commands and explanation for them:
- Names() – This command works on the list or data frame objects. It is used to get or set names for columns of a data frame or the elements of a list. It lists names of variables in a data frame.
- names() – It works on matrix or data frame objects.
- Rownames() – It works on matrix or data frame objects and is used to give names to rows.
- Colnames() – It works on matrix or data frame objects and is used to give names to columns.
- Dimnames() – Gets row and column names for matrix or data frame objects ie, it is used to see dimensions of the data frame.
rownames and row.names return the same values for the data frame and matrices; the only difference is that where there aren’t any names, rownames will print “NULL” (as does colnames), but row.names return it invisibly.
Descriptive statistics is used to analyze data in various types of industries, such as education, information technology, entertainment, retail, agriculture, transport, sales and marketing, psychology, demography, and advertising. In a broader sense, it is used as a tool to interpret and analyze data. For example, with the help of descriptive statistics, a production engineer can uncover the truth behind breakdowns of motors and a manager can supervise the quality of the production process.
Summarizing Samples in R Programming Language
When repeated measurements are there, we generally want to summarize data by showing measures like average. R provides a variety of commands that operate on samples. These samples of data might be individual vectors, or they may be columns in a data frame or part of a matrix or list.
A survey is conducted to find the average weight of people living in a country. As it is not possible to weigh every person of the country, a sample data of a few thousand individuals is collected. The average weight of the people in the sample would be very near to the average weight of the entire population of that country.
A variety of simple summary statistics can be applied to a vector of numbers. Two kinds of summary commands used are:
- Commands for Single Value Results – Produce single value as a result.
- Commands for Multiple Value Result – Produce multiple results as an output.
Summary Commands with Single Value Results in R
Let us see in Descriptive Statistics in R the summary commands with single value results.
There are many such commands that produce a single value as output. Let us see few of them:
- max(x, na.rm = FALSE) – It shows the maximum value. By default, NA values are not removed. NA is considered the largest unless na.rm=true is used.
- min(x, na.rm = FALSE) – Shows minimum value in a vector. If there are navalues, NA is returned unless na.rm=true is used.
- length(x) – Gives length of the vector and includes na values. Na.rm=instruction does not work with this command.
- sum(x, na.rm = FALSE) – Shows the sum of the vector elements
- mean(x, na.rm = FALSE) – Shows the arithmetic mean
- median( x, na.rm = FALSE) – Shows the median value of the vector
- sd(x, na.rm = FALSE) – Shows the standard deviation
- var(x, na.rm = FALSE) – Shows the variance
- mad(x, na.rm = FALSE) – Shows the median absolute deviation
Various commands operate on the vector of values to return a simple result; however, if NA items are present, the final value will also be NA. For most commands, you can ensure that any NA items are ignored by adding the na.rm = TRUE instruction to the command. Now you get a “proper” result.
See Also-