R Data Manipulating and Processing
Manipulating and processing data in R

R Data Manipulating and Processing

Manipulating and processing data in R

Data structures provide the way to represent data in data analytics. We can manipulate data in R for analysis and visualization.

Before we start playing with data in R, let us see how to import data in R and ways to export data from R to different external sources like SAS, SPSS, text file or CSV file.

One of the most important aspects of computing with data in R is its ability to manipulate data and enable its subsequent analysis and visualization. Let us see few basic data structures in R:

Learn more about R vs SAS vs SPSS

a. Vectors in R

These are ordered container of primitive elements and are used for 1-dimensional data.

Types – integer, numeric, logical, character, complex

Learn more about R Vector Functions

b. Matrices in R

These are Rectangular collections of elements and are useful when all data is of a single class that is numeric or characters.

Dimensions – two, three, etc.

Let us see R Matrix Functions

c. Lists in R

These are ordered container for arbitrary elements and are used for higher dimension data, like customer data information of an organization. When data cannot be represented as an array or a data frame, list is the best choice. This is so because lists can contain all kinds of other objects, including other lists or data frames, and in that sense, they are very flexible.

Lets look at how to create, access, manipulate, merge lists in R?

d. Data frames

These are two-dimensional containers for records and variables and are used for representing data from spreadsheets etc. It is similar to a single table in the database.

Learn more in detail R Data Frame

Creating Subsets of Data in R

As we know, data size is increasing exponentially and doing analysis on complete data is very time-consuming. So data is divided into small sized samples and analysis of samples is done. The process of creating samples is called subsetting.

Different methods of subsetting in R are:

a. $

The dollar sign operator selects a single element of data. When you use this operator with a data frame, the result is always a vector.

b. [[

Similar to $ in R, the double square brackets operator in R also returns a single element, but it offers the flexibility of referring to the elements by position rather than by name. It can be used for data frames and lists.

c. [

The single square bracket operator in R returns multiple elements of data. The index within the square brackets can be a numeric vector, a logical vector, or a character vector.

For example: To retrieve 5 rows and all columns of already built in data set iris, below command is used:

> iris[1:5, ]

Let us see how to Install R.

Sample() command in R

As we have seen, samples are created from data for analysis. To create samples, sample() command is used and the number of samples to be drawn are mentioned.

For example, to create a sample of 10 simulation of a die, below command is used:

> sample(1:6, 10, replace=TRUE)

It gives output as:

[1] 2 2 5 3 5 3 5 6 3 5

Sample() should always produce random values. But it does not happen with the test code sometimes. If substituted with a seed value, the sample() command always produces random samples.

Seed value is the starting point for any random number generator formula. Seed value defines both, the initialization of the random number generator along with the path that the formula will follow.

Let us see how seed value is used.

  1. > set.seed(1) //setting seed values for sample() command
  2. >sample(1:6, 10, replace=TRUE)

This gives output as below:

[1] 2 3 4 6 2 6 6 4 4 1

Applications of Subsetting Data

Let us now see few applications of subsetting data in R:

a. Duplicate data can be removed during analysis using duplicated()function in R

Below command shows how to find duplicate data in subsets: Duplicated() function finds duplicate values and returns a logical vector that tells you whether the specific value is a duplicate of a previous value.

>duplicated(c(1,2,1,3,1,4))

This gives output as below:

[1] FALSE FALSE TRUE FALSE TRUE FALSE

For all those values which are duplicate in the sample, true is returned.

Learn to use R Functions with examples

b. Missing data can be identified using complete.cases() function in R

If during analysis, any row with missing data can be identified and removed as below:

complete.cases() command in R is used to find rows which are complete. It gives logical vector with the value TRUE for rows that are complete, and FALSE for rows that have some NA values.

Rows which have NA values can be removed using na.omit() function as below:

> row_name <- na.omit(file_name)

Let us see best data scientist certifications

Adding Calculated Fields to Data

After you have created the appropriate subset of your data, the next step in your analysis is to perform some calculations. R makes it easy to perform calculations on columns of a data frame because each column is itself a vector.

Let us see data manipulation with R with the help of an example:

Let us see how to calculate the ratio between the lengths and width of the sepals

Command for the same is:

  1. > x <- iris$Sepal.Length / iris$Sepal.Width

  2. >head(x)

//Command to display the first five elements of the result

 It gives the output as:

[1] 1.457143 1.633333 1.468750 1.483871 1.388889 1.384615

Refer list of best books to learn R

Read Complete Article>>

要查看或添加评论,请登录

社区洞察

其他会员也浏览了