登录查看更多内容

R Data Manipulating and Processing

Malini Shukla

Senior Data Scientist || Hiring || 6M+ impressions || Trainer || Top Data Scientist || Speaker || Top content creator on LinkedIn || Tech Evangelist

发布日期: 2018年2月22日

Manipulating and processing data in R

Data structures provide the way to represent data in data analytics. We can manipulate data in R for analysis and visualization.

Before we start playing with data in R, let us see how to import data in R and ways to export data from R to different external sources like SAS, SPSS, text file or CSV file.

One of the most important aspects of computing with data in R is its ability to manipulate data and enable its subsequent analysis and visualization. Let us see few basic data structures in R:

Learn more about R vs SAS vs SPSS

a. Vectors in R

These are ordered container of primitive elements and are used for 1-dimensional data.

Types – integer, numeric, logical, character, complex

Learn more about R Vector Functions

b. Matrices in R

These are Rectangular collections of elements and are useful when all data is of a single class that is numeric or characters.

Dimensions – two, three, etc.

Let us see R Matrix Functions

c. Lists in R

These are ordered container for arbitrary elements and are used for higher dimension data, like customer data information of an organization. When data cannot be represented as an array or a data frame, list is the best choice. This is so because lists can contain all kinds of other objects, including other lists or data frames, and in that sense, they are very flexible.

Lets look at how to create, access, manipulate, merge lists in R?

d. Data frames

These are two-dimensional containers for records and variables and are used for representing data from spreadsheets etc. It is similar to a single table in the database.

Learn more in detail R Data Frame

Creating Subsets of Data in R

As we know, data size is increasing exponentially and doing analysis on complete data is very time-consuming. So data is divided into small sized samples and analysis of samples is done. The process of creating samples is called subsetting.

Different methods of subsetting in R are:

a. $

The dollar sign operator selects a single element of data. When you use this operator with a data frame, the result is always a vector.

b. [[

Similar to $ in R, the double square brackets operator in R also returns a single element, but it offers the flexibility of referring to the elements by position rather than by name. It can be used for data frames and lists.

c. [

The single square bracket operator in R returns multiple elements of data. The index within the square brackets can be a numeric vector, a logical vector, or a character vector.

For example: To retrieve 5 rows and all columns of already built in data set iris, below command is used:

> iris[1:5, ]

Let us see how to Install R.

Sample() command in R

As we have seen, samples are created from data for analysis. To create samples, sample() command is used and the number of samples to be drawn are mentioned.

For example, to create a sample of 10 simulation of a die, below command is used:

> sample(1:6, 10, replace=TRUE)

It gives output as:

[1] 2 2 5 3 5 3 5 6 3 5

Sample() should always produce random values. But it does not happen with the test code sometimes. If substituted with a seed value, the sample() command always produces random samples.

Seed value is the starting point for any random number generator formula. Seed value defines both, the initialization of the random number generator along with the path that the formula will follow.

Let us see how seed value is used.

> set.seed(1) //setting seed values for sample() command
>sample(1:6, 10, replace=TRUE)

This gives output as below:

[1] 2 3 4 6 2 6 6 4 4 1

Applications of Subsetting Data

Let us now see few applications of subsetting data in R:

a. Duplicate data can be removed during analysis using duplicated()function in R

Below command shows how to find duplicate data in subsets: Duplicated() function finds duplicate values and returns a logical vector that tells you whether the specific value is a duplicate of a previous value.

>duplicated(c(1,2,1,3,1,4))

This gives output as below:

[1] FALSE FALSE TRUE FALSE TRUE FALSE

For all those values which are duplicate in the sample, true is returned.

Learn to use R Functions with examples

b. Missing data can be identified using complete.cases() function in R

If during analysis, any row with missing data can be identified and removed as below:

complete.cases() command in R is used to find rows which are complete. It gives logical vector with the value TRUE for rows that are complete, and FALSE for rows that have some NA values.

Rows which have NA values can be removed using na.omit() function as below:

> row_name <- na.omit(file_name)

Let us see best data scientist certifications

Adding Calculated Fields to Data

After you have created the appropriate subset of your data, the next step in your analysis is to perform some calculations. R makes it easy to perform calculations on columns of a data frame because each column is itself a vector.

Let us see data manipulation with R with the help of an example:

Let us see how to calculate the ratio between the lengths and width of the sepals

Command for the same is:

> x <- iris$Sepal.Length / iris$Sepal.Width
>head(x)

//Command to display the first five elements of the result

It gives the output as:

[1] 1.457143 1.633333 1.468750 1.483871 1.388889 1.384615

Refer list of best books to learn R

Read Complete Article>>

R Data Manipulating and Processing

Malini Shukla

Senior Data Scientist || Hiring || 6M+ impressions || Trainer || Top Data Scientist || Speaker || Top content creator on LinkedIn || Tech Evangelist

Manipulating and processing data in R

a. Vectors in R

b. Matrices in R

c. Lists in R

d. Data frames

Creating Subsets of Data in R

a. $

b. [[

c. [

Sample() command in R

Applications of Subsetting Data

a. Duplicate data can be removed during analysis using duplicated()function in R

b. Missing data can be identified using complete.cases() function in R

Adding Calculated Fields to Data

更多精彩文章

社区洞察

其他会员也浏览了

R vs SAS vs SPSS- Comparison between Top 3 Data Analytics Tools

R vs SAS vs SPSS – Top 3 Data Analytics Tools

TRAINING IN DATA MANAGEMENT GRAPHICS & STATISTICAL ANALYSIS USING SPSS 15th To 19th J 2021-Email [email protected] Call+254727446544

"Data Analyst Mastery: Striking the Balance Between Technical Expertise and Soft Skills"

Analytics with Power BI

Descriptive Statistics in R

Mastering the Art of Data Analysis: A Six-Step Guide

TRAINING IN DATA MANAGEMENT GRAPHICS & STATISTICAL ANALYSIS 20th To 24th April 2020-Email [email protected] Call+254727446544

BIG DATA MANAGEMENT GRAPHICS & STATISTICAL ANALYSIS 30th March To 3rd Apri 2020-Email [email protected] Call+254727446544

TRAINING IN DATA MANAGEMENT GRAPHICS & STATISTICAL ANALYSIS USING SPSS: Call+254727446544 Email: [email protected]

Manipulating and processing data in R

a. Vectors in R

b. Matrices in R

c. Lists in R

d. Data frames

Creating Subsets of Data in R

a. $

b. [[

c. [

Sample() command in R

Applications of Subsetting Data

a. Duplicate data can be removed during analysis using duplicated()function in R

b. Missing data can be identified using complete.cases() function in R

Adding Calculated Fields to Data

Top 9 Computer Vision Project Ideas for Beginners

2020年1月21日

12 Cool Data Science project ideas with source code - "Strengthen your Resume"

2019年11月13日

Python Coding Interview Questions for Experienced - Python FAQ's

2019年9月30日

How Data Science is the Backbone of Retail?

2019年7月16日

How to Get The Coolest & The Sexiest Job Of the Century- “Become a Data Scientist”

2019年7月9日

What’s the Best programming Language to Start a Career in Data Science?

2019年6月25日

11 Reason Why TensorFlow is So Popular

2019年6月15日

20 Deep Learning Terminologies You Must Know

2019年6月14日

TensorFlow Performance Optimization – Tips To Improve Performance

2019年6月12日

Top 9 Reasons Why QlikView is Best in BI

2019年6月11日

社区洞察

其他会员也浏览了

R vs SAS vs SPSS- Comparison between Top 3 Data Analytics Tools

R vs SAS vs SPSS – Top 3 Data Analytics Tools

TRAINING IN DATA MANAGEMENT GRAPHICS & STATISTICAL ANALYSIS USING SPSS 15th To 19th J 2021-Email [email protected] Call+254727446544

"Data Analyst Mastery: Striking the Balance Between Technical Expertise and Soft Skills"

Analytics with Power BI

Descriptive Statistics in R

Mastering the Art of Data Analysis: A Six-Step Guide

TRAINING IN DATA MANAGEMENT GRAPHICS & STATISTICAL ANALYSIS 20th To 24th April 2020-Email [email protected] Call+254727446544

BIG DATA MANAGEMENT GRAPHICS & STATISTICAL ANALYSIS 30th March To 3rd Apri 2020-Email [email protected] Call+254727446544

TRAINING IN DATA MANAGEMENT GRAPHICS & STATISTICAL ANALYSIS USING SPSS: Call+254727446544 Email: [email protected]