登录查看更多内容

Importance of Long Format Data

Parfait Gasana

Research Analyst

发布日期: 2020年1月24日

Often in the world of data science, there is the need to reshape data. Nearly every data analytics tool and language (SAS, Python, Stata, SQL, Matlab, R, Julia, SPSS, other) maintains supported methods to transform data from wide to long and long to wide. However, newcomers are unaware of best practices in data format. Usually, they work with how raw data originally derived.

Generally, in most data analytics needs, long-formatted data is preferred as the more efficient, maintainable, and scalable format for operational work.

Consider the following example of wide format containing yearly average spot prices of precious metals.

Notice several limitations of this format:

Inability to use precious metal indicators for any analysis since the data elements reside in column headers.
Calculations like sum, mean, min, max, etc. must be repeated in loop form across all needed columns.
To add other metals into data, requires restructuring the data set with new column(s) rather than the inexpensive addition of new rows.
Treats a data set or data frame as a matrix which limits the usefulness of diversity in data types, required for modeling and plotting purposes.

However, in long format as shown below

Data is much more manageable and maintainable to run various operations such as in R:

Merge/Append

properties_df <- data.frame(
  metal = c("gold", "silver", "platinum", "palladium", "rhodium"),
  atomic_number = c(79, 47, 78, 46, 45),
  symbol = c("Au", "Ag", "Pt" , "Pd", "Rh"),
  atomic_weight = c(196.967, 107.868, 195.09, 106.40, 102.905),
  melting_point = c(1063, 960.8, 1769, 1554.9, 1966),
  boiling_point = c(2966, 2212, 3827, 2963, 3727)
)


merged_df <- merge(metals_long_df, properties_df, by="metal")

Aggregation

agg_df <- do.call(data.frame, 
                  aggregate(avg_spot_price ~ metal, metals_long_df, function(x) 
                            c(min=min(x), 
                              median=median(x), 
                              mean=mean(x), 
                              max=max(x),
                              sd=sd(x)))
)

Testing/Modeling

corr_matrix <- cor(with(metals_long_df, 
                        tapply(avg_price, list(year, metal), sum)))

Plotting

boxplot(avg_price ~ metal, metals_long_df, col=seaborn_palette,
        main = paste0("Precious Metals Yearly Average\n",
                      "Price Distribution, 2011 - 2020"))

Kinyetta Nance, Ph.D.

5 年

Parfait Gasana?Long time friend!?

Zaynaib (Ola) Giwa

I'm a data analyst who enjoys using R, Python and JavaScript to tell interesting stories and to develop tools

5 年

This was a good read. Do you mind if I link your article to my own blog post on the subject?

查看更多评论

要查看或添加评论，请登录

Parfait Gasana的更多文章

Relational Databases in Data Science

2019年3月23日

Relational Databases in Data Science

In the age of big data, data scientists should leverage relational databases in their workflow. Doing so, analysts can…

2 条评论
Vectorized Functions in R and Python

2016年11月4日

Vectorized Functions in R and Python

Data analytics tools such as the popular open source languages, R and Python, often have nuanced functions and…

1 条评论

Importance of Long Format Data

Parfait Gasana

Research Analyst

Parfait Gasana的更多文章

社区洞察

其他会员也浏览了

Check Regional Information via Coarsen

Feature Selection for faster analytics

Decision Tree Classifier - Explained

A Reference Notebook for (+30) Statistical Charts in Seaborn

R Packages and Libraries in Data Science

Welcome to The Statistics Revolution

?? Mastering Tree Data Structures: The Backbone of Efficient Algorithms

A l g o r i t h m s a n d ML (M a c h i n e L e a r n i n g ), P y t h o n a n d D e s c r i p t i v e S t a t i s t i c s

Practical Time Series Analysis Review

Poisson distribution and Probability, a data science perspective

Parfait Gasana的更多文章

Relational Databases in Data Science

Vectorized Functions in R and Python

社区洞察

其他会员也浏览了

Check Regional Information via Coarsen

Feature Selection for faster analytics

Decision Tree Classifier - Explained

A Reference Notebook for (+30) Statistical Charts in Seaborn

R Packages and Libraries in Data Science

Welcome to The Statistics Revolution

?? Mastering Tree Data Structures: The Backbone of Efficient Algorithms

A l g o r i t h m s a n d ML (M a c h i n e L e a r n i n g ), P y t h o n a n d D e s c r i p t i v e S t a t i s t i c s

Practical Time Series Analysis Review

Poisson distribution and Probability, a data science perspective