登录查看更多内容

Vectorized Functions in R and Python

Parfait Gasana

Research Analyst

发布日期: 2016年11月4日

Data analytics tools such as the popular open source languages, R and Python, often have nuanced functions and procedures that provide efficiency, scalability, and maintainability in application code. Such methods include vectorized ones where analysts can manipulated large serialized objects in single calls.

R

Recent post of a StackOverflow R question by @commissar-vasili-karlovic shows the familiar path newcomers in R take in processing multi-step procedures where the familiar nested for loops (available in most general purpose languages) are used to build large data objects iteratively by row and column indexing as shown below:

a1 <- data.frame(); b1 <- data.frame(); rs <- data.frame()

k <- ncol(company)/levels
l <- 1 - k
for (j in 1:levels) {
  l <- l + k
  k <- k + k
  for (i in l:k) {
    mod <- lm(company[,i] ~ benchmark[,j])

    a1[i,j] <- mod$coefficients[1]
    b1[i,j] <- mod$coefficients[2]
    rs[i,j] <- summary(mod)$adj.r.squared
  }
}

table <- data.frame('Alpha_Coef' = a1, 'Beta_Coef' = b1, 
                    'Adj.R_Squared' = rs)

My answer (@parfait) suggested use of vectorized functions such as Map(), lapply(), do.call() to manipulate entire objects in single calls:

reghandle <- function(x, y){
    mod <- lm(company[[x]] ~ benchmark[[y]])

    return(list(Alpha_coef = unname(mod$coefficients[1]),
                Beta_coef = unname(mod$coefficients[2]),
                Adj.R_Squared = unname(summary(mod)$adj.r.squared)))
}

tablelist <- Map(reghandle, names(company), benchmarknames)
table <- do.call(rbind, lapply(tablelist, data.frame)) 
table

Python

Another example on a StackOverflow Python pandas question, @patthebug appends rows iteratively in a for loop from a list source (again reminiscent of most programming languages) into a pandas dataframe:

finalResults = pd.DataFrame({'Concept1': itemsets_dct[0][0][0], 
                             ?'Concept2': itemsets_dct[0][0][1], 
                             'Concept3': itemsets_dct[0][0][2], 
                             'Concept4': itemsets_dct[0][0][3], 
                             'Count': itemsets_dct[0][1]}, index=[0])

for i in range(1,len(itemsets_dct)):
    tempResult = pd.DataFrame({'Concept1': itemsets_dct[i][0][0], 
                               'Concept2': itemsets_dct[i][0][1], 
                               'Concept3': itemsets_dct[i][0][2], 
                               'Concept4': itemsets_dct[i][0][3], 
                               'Count': itemsets_dct[i][1]}, index=[i])
    finalResults.append(tempResult)

My answer (@parfait) suggested converting nested list into a list of dictionaries using a dictionary comprehension that is then cast to a dataframe in one call:

dfDict = [{'Concept1': i[0][0], 
           'Concept2': i[0][1], 
           'Concept3': i[0][2], 
           'Concept4': i[0][3],          
           'Count': i[1]} for i in itemsets_dct]

finalResults = pd.DataFrame(dfDict)

要查看或添加评论，请登录

Parfait Gasana的更多文章

Importance of Long Format Data

2020年1月24日

Importance of Long Format Data

Often in the world of data science, there is the need to reshape data. Nearly every data analytics tool and language…

4 条评论
Relational Databases in Data Science

2019年3月23日

Relational Databases in Data Science

In the age of big data, data scientists should leverage relational databases in their workflow. Doing so, analysts can…

2 条评论

Vectorized Functions in R and Python

Parfait Gasana

Research Analyst

R

Python

Parfait Gasana的更多文章

社区洞察

其他会员也浏览了

Python Basics for Data Science

Complex: Python With A Real And Imaginary Number

Top 10 Ways to deal with Missing Values in Python

My Python Joy: A World Without Tables and Calculators? (#05)

Inbuilt Data Structures in Python

Python gridData in geotechTools

Python Lists – Learn Data Structures in Python

Python Dictionary Methods

Python QuickEcharts

The "Adult" dataset, also known as the "Census Income" dataset used to predict whether a person's income surpasses $50K per year based on data.

R

Python

Parfait Gasana的更多文章

Importance of Long Format Data

Relational Databases in Data Science

社区洞察

其他会员也浏览了

Python Basics for Data Science

Complex: Python With A Real And Imaginary Number

Top 10 Ways to deal with Missing Values in Python

My Python Joy: A World Without Tables and Calculators? (#05)

Inbuilt Data Structures in Python

Python gridData in geotechTools

Python Lists – Learn Data Structures in Python

Python Dictionary Methods

Python QuickEcharts

The "Adult" dataset, also known as the "Census Income" dataset used to predict whether a person's income surpasses $50K per year based on data.