Importance of Long Format Data

Importance of Long Format Data

Often in the world of data science, there is the need to reshape data. Nearly every data analytics tool and language (SAS, Python, Stata, SQL, Matlab, R, Julia, SPSS, other) maintains supported methods to transform data from wide to long and long to wide. However, newcomers are unaware of best practices in data format. Usually, they work with how raw data originally derived.

Generally, in most data analytics needs, long-formatted data is preferred as the more efficient, maintainable, and scalable format for operational work.

Consider the following example of wide format containing yearly average spot prices of precious metals.

wide formatted table

Notice several limitations of this format:

  • Inability to use precious metal indicators for any analysis since the data elements reside in column headers.
  • Calculations like sum, mean, min, max, etc. must be repeated in loop form across all needed columns.
  • To add other metals into data, requires restructuring the data set with new column(s) rather than the inexpensive addition of new rows.
  • Treats a data set or data frame as a matrix which limits the usefulness of diversity in data types, required for modeling and plotting purposes.

However, in long format as shown below

long formatted table

Data is much more manageable and maintainable to run various operations such as in R:

  • Merge/Append
properties_df <- data.frame(
  metal = c("gold", "silver", "platinum", "palladium", "rhodium"),
  atomic_number = c(79, 47, 78, 46, 45),
  symbol = c("Au", "Ag", "Pt" , "Pd", "Rh"),
  atomic_weight = c(196.967, 107.868, 195.09, 106.40, 102.905),
  melting_point = c(1063, 960.8, 1769, 1554.9, 1966),
  boiling_point = c(2966, 2212, 3827, 2963, 3727)
)


merged_df <- merge(metals_long_df, properties_df, by="metal")
merge output table
  • Aggregation
agg_df <- do.call(data.frame, 
                  aggregate(avg_spot_price ~ metal, metals_long_df, function(x) 
                            c(min=min(x), 
                              median=median(x), 
                              mean=mean(x), 
                              max=max(x),
                              sd=sd(x)))
)

aggregation output table
  • Testing/Modeling
corr_matrix <- cor(with(metals_long_df, 
                        tapply(avg_price, list(year, metal), sum)))
correlation output table
  • Plotting
boxplot(avg_price ~ metal, metals_long_df, col=seaborn_palette,
        main = paste0("Precious Metals Yearly Average\n",
                      "Price Distribution, 2011 - 2020"))
box plot image


Zaynaib (Ola) Giwa

I'm a data analyst who enjoys using R, Python and JavaScript to tell interesting stories and to develop tools

5 年

This was a good read. Do you mind if I link your article to my own blog post on the subject?

回复

要查看或添加评论,请登录

Parfait Gasana的更多文章

  • Relational Databases in Data Science

    Relational Databases in Data Science

    In the age of big data, data scientists should leverage relational databases in their workflow. Doing so, analysts can…

    2 条评论
  • Vectorized Functions in R and Python

    Vectorized Functions in R and Python

    Data analytics tools such as the popular open source languages, R and Python, often have nuanced functions and…

    1 条评论

社区洞察

其他会员也浏览了