登录查看更多内容

rdata: Read R datasets from Python

pyOpenSci

pyOpenSci creates diverse community around the open source python tools that drive open science.

发布日期: 2024年4月25日

This guest article from Carlos Ramos Carre?o was originally published on the pyOpenSci blog.

In the last years Python has solidified itself as the main language for data science and machine learning. However, the R programming language is also a widely used language in statistics, offering thousands of specialized software packages in this field. As it is common in the R community, many of these packages include accompanying data, often stored as R objects in the .rds and .rda formats.

This variety of datasets is normally inaccessible directly from Python programs. Instead, programmers typically have to open the data in R and manually convert it to another format, such as CSV, which can be understood by Python libraries. This process requires an R interpreter, it is not fast, nor easy, and may even be impossible to do without information loss. In particular, for nested structures and complex or custom R types there is no common file exchange format for Python and R that can be used to this purpose.

The Python package rdata allows Python programmers to read R datasets in the .rds and .rda formats directly, facilitating the sharing of examples and results between the two languages. It is a small, pure Python solution that does not require an R interpreter to work, nor external libraries in other languages. This makes it suitable to be used in most contexts where Python can be found, including the web using Pyodide. The package rdata has been accepted by pyOpenSci as part of its ecosystem in March 2024.

A library for reading R datasets

To illustrate rdata capabilities, lets take the role of a Python developer implementing new statistical techniques. In order to compare her results with prior approaches, she needs to use the same data as them. Unfortunately for her, the people who implemented the original techniques do not have the data in a format directly readable by Python, as they were all statisticians using the R programming language. They are eager to share their data with her, in the form of a .rda file called dataset.rda.

A cursory inspection of the file in R shows the following:

The dataset contains 3 variables, of which just one is a dataframe. Moreover the dataframe contains a column with instances of a custom R class, and a boolean column with some NA values. Thus, converting this data to a format such as CSV to read it in Python would be a difficult endeavor. Luckily, this Python programmer has heard of the rdata Python package, which could be useful to solve this task.

Our programmer quickly installs rdata with pip or conda, and executes the following command in the Python interpreter:

dataset = rdata.read_rda("dataset.rda")

This loads the file into a Python dictionary that she can now explore. She now starts by checking the “model” and “freq” variables:

Benjamin Bennett Alexander 10 个月前

How to Use Python for Machine Learning Projects

Analytics Insight? 4 个月前

Python Fundamentals: The Building Blocks of Python…

Benjamin Bennett Alexander 2 年前

>>> dataset["model"]
array(['NovTech200'], dtype='<U10')

>>> dataset["freq"]
array([1200.])

The original R vectors have been now converted to equivalent NumPy objects with their corresponding dtype. Then, she checks the default conversion of the dataframe itself:

>>> dataset["frame"]
    level                                            subject  control
1       8  namespace(name=array(['Louis'], dtype='<U5'), ...    False
2       4  namespace(name=array(['Martin'], dtype='<U6'),...     <NA>
3       6  namespace(name=array(['Miriam'], dtype='<U6'),...     True
4       3  namespace(name=array(['Yoshi'], dtype='<U5'), ...    False
5       9  namespace(name=array(['John'], dtype='<U4'), a...     True
6       2  namespace(name=array(['Karl'], dtype='<U4'), a...    False
7       7  namespace(name=array(['Eusebio'], dtype='<U7')...    False
8      10  namespace(name=array(['Rosa'], dtype='<U4'), a...    False
9       1  namespace(name=array(['Lilly'], dtype='<U5'), ...     <NA>
10      5  namespace(name=array(['Mark'], dtype='<U4'), a...     <NA>

Support for custom classes

The dataframe has been converted by default to a Pandas dataframe, and the NA have been preserved. However, there is a problem with the subject column: as it contains instances of a custom R class, the default conversion did not know to which class it should be mapped in Python, returning a SimpleNamespace instead.

Our programmer has developed a Python class, Person, which she plans to use in her code. Instead of modifying the dataframe to manually convert the instances, she uses the following code to load the data applying the desired transformations:

def person_constructor(obj, attrs):
	return Person(name=str(obj.name[0]), age=int(obj.age[0]))

constructor_dict = {
    **rdata.conversion.DEFAULT_CLASS_MAP,
    "Person": person_constructor,
}

dataset = rdata.read_rda(
    "dataset.rda",
    constructor_dict=constructor_dict,
)

This code defines a “constructor” of a Python object and assigns it to a dictionary of constructors, using “Person” as the key. During the R to Python conversion process, whenever an object is found with a class attribute of “Person”, the constructor function will be called with the partially converted object and its attributes, and the return value will be used as the converted value of that object. The mapping rdata.conversion.DEFAULT_CLASS_MAP contains a default set of constructors, including the one for converting R dataframes into Pandas dataframes. Now, showing the dataset again, we can verify that every instance of the original custom class has been replaced with its Python counterpart:

>>> dataset["frame"]
    level                         subject  control
1       8    Person(name='Louis', age=57)    False
2       4   Person(name='Martin', age=22)     <NA>
3       6   Person(name='Miriam', age=41)     True
4       3    Person(name='Yoshi', age=29)    False
5       9     Person(name='John', age=63)     True
6       2     Person(name='Karl', age=34)    False
7       7  Person(name='Eusebio', age=77)    False
8      10     Person(name='Rosa', age=33)    False
9       1    Person(name='Lilly', age=28)     <NA>
10      5     Person(name='Mark', age=30)     <NA>

What’s next?

rdata is under active development, and new functionalities will be added as users need them. Contributions are welcome in the Github repository!

pyOpenSci Newsletter

441 位关注者

Aaron Ezeala

Software Developer | ?? Python, JavaScript ??, Go?? | Django, Git, Google Cloud ?? , HTML, CSS ?? |Engineering??? scalable software solutions with finesse ??

7 个月

Great article! ?? The `rdata` package facilitates seamless access to R datasets from Python, streamlining workflows for data scientists. The examples provided illustrate its ease of use and the importance of interoperability between Python and R. The mention of `rioxarray` extending functionality for raster datasets adds further value. Overall, a valuable resource for enhancing productivity and collaboration across languages. ??

1 次回应

要查看或添加评论，请登录

rdata: Read R datasets from Python

pyOpenSci

pyOpenSci creates diverse community around the open source python tools that drive open science.

A library for reading R datasets

领英推荐

Support for custom classes

What’s next?

pyOpenSci Newsletter

441 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

How to implement the linked list in Python - NareshIT

How to Get the Index of an Item in a List in Python

What is the use of self in Python - NareshIT

02. Unleashing the Power of Python Strings: From Basics to Advanced Manipulation

Beyond Data Types: Demystifying Mutable and Immutable Objects for Python Mastery

What is string formatting in python?- A Beginner's Guide

An Introduction to Python Data Types and Built-in Functions

Choropleth maps using Plotly in Python

Importance of Python in the Realms of Data Analytics

Python For Artificial Intelligence

A library for reading R datasets

领英推荐

Support for custom classes

What’s next?

pyOpenSci Newsletter

441 位关注者

Why Python package health metrics are a priority for pyOpenSci and other open source communities

2024年9月19日

Why should Python open source package health matter to scientists? (and to you!)

2024年9月12日

It’s Been a Long Short Road: The Monumental Past 2 Years of pyOpenSci

2024年9月5日

?? Announcing the pyOpenSci Fall Festival!

2024年8月22日

pyOpenSci @ SciPy 2024: Python Packaging Tutorials, Talks and Community ??

2024年8月15日

pyOpenSci Community News: August 2024

2024年8月8日

pyOpenSci beginner-friendly sprints at PyCon US 2024

2024年8月1日

automata: Simulation and manipulation

2024年7月18日

Friends don’t let friends package alone- my Python packaging talk at PyCon 2024

2024年6月27日

SLEPLET: Slepian Scale-Discretised Wavelets in Python

2024年6月20日

社区洞察

其他会员也浏览了

How to implement the linked list in Python - NareshIT

How to Get the Index of an Item in a List in Python

What is the use of self in Python - NareshIT

02. Unleashing the Power of Python Strings: From Basics to Advanced Manipulation

Beyond Data Types: Demystifying Mutable and Immutable Objects for Python Mastery

What is string formatting in python?- A Beginner's Guide

An Introduction to Python Data Types and Built-in Functions

Choropleth maps using Plotly in Python

Importance of Python in the Realms of Data Analytics

Python For Artificial Intelligence