Importance of Long Format Data
Often in the world of data science, there is the need to reshape data. Nearly every data analytics tool and language (SAS, Python, Stata, SQL, Matlab, R, Julia, SPSS, other) maintains supported methods to transform data from wide to long and long to wide. However, newcomers are unaware of best practices in data format. Usually, they work with how raw data originally derived.
Generally, in most data analytics needs, long-formatted data is preferred as the more efficient, maintainable, and scalable format for operational work.
Consider the following example of wide format containing yearly average spot prices of precious metals.
Notice several limitations of this format:
- Inability to use precious metal indicators for any analysis since the data elements reside in column headers.
- Calculations like sum, mean, min, max, etc. must be repeated in loop form across all needed columns.
- To add other metals into data, requires restructuring the data set with new column(s) rather than the inexpensive addition of new rows.
- Treats a data set or data frame as a matrix which limits the usefulness of diversity in data types, required for modeling and plotting purposes.
However, in long format as shown below
Data is much more manageable and maintainable to run various operations such as in R:
- Merge/Append
properties_df <- data.frame( metal = c("gold", "silver", "platinum", "palladium", "rhodium"), atomic_number = c(79, 47, 78, 46, 45), symbol = c("Au", "Ag", "Pt" , "Pd", "Rh"), atomic_weight = c(196.967, 107.868, 195.09, 106.40, 102.905), melting_point = c(1063, 960.8, 1769, 1554.9, 1966), boiling_point = c(2966, 2212, 3827, 2963, 3727) ) merged_df <- merge(metals_long_df, properties_df, by="metal")
- Aggregation
agg_df <- do.call(data.frame, aggregate(avg_spot_price ~ metal, metals_long_df, function(x) c(min=min(x), median=median(x), mean=mean(x), max=max(x), sd=sd(x))) )
- Testing/Modeling
corr_matrix <- cor(with(metals_long_df, tapply(avg_price, list(year, metal), sum)))
- Plotting
boxplot(avg_price ~ metal, metals_long_df, col=seaborn_palette, main = paste0("Precious Metals Yearly Average\n", "Price Distribution, 2011 - 2020"))
Parfait Gasana?Long time friend!?
I'm a data analyst who enjoys using R, Python and JavaScript to tell interesting stories and to develop tools
5 年This was a good read. Do you mind if I link your article to my own blog post on the subject?