登录查看更多内容

Python 3.7 - Reuna os mesmos IDs no DataFrame / Join the similar IDs in DataFrame

Marco Antonio Pereira, MBA

Data Architect @ Tekever | Azure Certified and Specialist | AWS Certified | Google Cloud Platform (GCP) Certified | Databricks Certified | Scrum Certified

发布日期: 2019年5月29日

Participo de alguns canais em mídias sociais e sempre me deparo com uma dúvida frequente de quem está come?ando com Python: "Tenho um DataFrame com IDs repetidos, quero reunir todos os IDs repetidos na mesma linha/coluna, como fa?o isso?". / I participate in a some social media groups and always I see the same doubt: "Is it possible to join repeated IDs of a DataFrame in the same row/column?".

Primeiro precisamos entender o que é um DataFrame. DataFrame é uma estrutura de dados onde os dados s?o armazenados em linhas e colunas e podem ser de tipos diferentes, semelhante a uma matriz. Com isso temos fun??es para linhas, colunas, chaves, valores, etc. / First of all we need understand what is a DataFrame. DataFrame is a data-structure where the data are stored in lines and columns and can worked in different types such as a matriz. So we have functions based on lines, columns, keys, values, etc.

Vamos utilizar o Pandas / We are going to work with Pandas:

import pandas as pd

Agora que importamos a biblioteca Pandas, vamos criar um modelo de DataFrame (estou considerando que você já conhece as Collections do Python) / Now that we already import Pandas library we are going to create a DataFrame (I considered you know the Python Collections):

data={'id':[1,2,3,1,2,1],'name':['marco','maria','john','antonio','jack','pereira']}

dfInput=pd.DataFrame(data=data)

O que teremos ao imprimir dfInput? What are we have print dfInput?

Usamos um Dictionary para criar um DataFrame com duas colunas, Id e Name, perceba que temos IDs repetidos, provavelmente o usuário irá querer que você reuna os IDs repetidos na mesma linha/coluna pois se refere a um nome completo de um usuário. / We used a Dictionary to create a DataFrame with two columns, Id and Name. Realize we have repeated IDs, probably refer to a full name of users.

Vamos realizar loop e agora reduzir as chaves repetidas em uma única chave. Caso seja uma única chave, sem problemas, ela ficará armazenada separadamente. No Python 2.7 tínhamos a fun??o dict.has_key(chave), essa fun??o retornaria True caso a chave existisse no dicionário ou False caso ela n?o existisse. No Python 3.7 essa fun??o n?o existe mais nativamente, ao invés disso temos a fun??o dict.__contains__(chave) que tem o mesmo comportamento da fun??o has_key, inclusive, __contains__ também já existia em sua vers?o 2.7. Verificamos aqui se a chave repetida já existe no dicionário, caso afirmativo, vamos apenas incrementar essa chave repetida / We made a loop to reduce the repeated keys into only one key. In Python 2.7 we had dict.has_key(key) function, this function returns True if the Key exists in the Dictionary or False. In Python 3.7 this function is deprecated, we have the dict.__contains__(key) function, your behavior is similar has_key(key) and this function is present in Python 2.7 too.

d1={}
for x,y in dfInput.values:
	if(d1.__contains__(x)):
		d1.update({x:d1.get(x) + ', ' + y})
	else:
		d1.update({x:y})
print(d1)

Ao printar o resultado teremos o seguinte dentro do Dict d1 / Printing the Dictionary d1 we have here:

Agora vamos transformar o Dict d1 que foi gerado a partir do DataFrame dfInput em um novo DataFrame. / Now we are to transform the Dict d1 generated by DataFrame dfInput in the new DataFrame.

id=[]
name=[]
d2={}
for x,y in d1.items():
	id.append(x)
	name.append(y)
	d2.update({'id':id,'name':name})

O que nós teremos ao imprimir o novo Dict d2? / What we have if print the new Dict d2?

Vamos gerar ent?o o novo DataFrame a partir do Pandas corretamente. / Let's go to generate the new DataFrame with Pandas correctly

dfOutput=pd.DataFrame(data=d2)

Agora temos um novo DataFrame com duas colunas consolidado. / Now we have a new DataFrame with two consolidated columns.

No lugar do primeiro DataFrame poderíamos ter carregado um arquivo CSV/TXT e todo esse tratamento poderia ser feito posteriormente. Esta n?o é a única forma de se obter este resultado. / Instead of using the first DataFrame we could load a CSV / TXT file and transform it into DataFrame and then after to make the same transformations. This is not the only way to solve this problem, we have a hundred other methods to solve this same problem.

Feedbacks s?o bem vindos! / Feedbacks are welcome!

要查看或添加评论，请登录

Marco Antonio Pereira, MBA的更多文章

How to Set Up Virtual Networks (VNET) on Microsoft Azure

2024年10月7日

How to Set Up Virtual Networks (VNET) on Microsoft Azure

Introduction to Virtual Networks on Azure Networking in cloud providers has always been a challenging topic because…
Como configurar Redes Virtuais (VNET) no Microsoft Azure (Portuguese version)

2024年10月6日

Como configurar Redes Virtuais (VNET) no Microsoft Azure (Portuguese version)

Introdu??o às Redes Virtuais no Azure Redes em provedores de Cloud sempre foi um tópico desafiador porque de acordo com…
Uma introdu??o ao AWS Elastic MapReduce (EMR)

2021年3月10日

Uma introdu??o ao AWS Elastic MapReduce (EMR)

Bom dia / Boa tarde / Boa noite! Neste vídeo explico e demonstro sobre a utiliza??o e acesso ao Amazon EMR (Elastic…

2 条评论
Uma introdu??o ao Apache Flume

2017年11月30日

Uma introdu??o ao Apache Flume

Repositório: https://github.com/MarcoAP/flume-to-kafka O Flume é mais uma pe?a no mundo do ecossistema Big Data.

4 条评论
Uma introdu??o ao Git

2017年4月12日

Uma introdu??o ao Git

1. Instala??o do Git Instalar o GIT através do link https://git-scm.
Desbravando a poderosa ferramenta corporativa Linked[in]

2016年9月20日

Desbravando a poderosa ferramenta corporativa Linked[in]

O objetivo deste artigo é compartilhar com todos minha experiência positiva ao utilizar o Linkedin, pois vejo muitas…

4 条评论

See all articles

Python 3.7 - Reuna os mesmos IDs no DataFrame / Join the similar IDs in DataFrame

Marco Antonio Pereira, MBA

Data Architect @ Tekever | Azure Certified and Specialist | AWS Certified | Google Cloud Platform (GCP) Certified | Databricks Certified | Scrum Certified

Marco Antonio Pereira, MBA的更多文章

社区洞察

其他会员也浏览了

The lambda() and more

Bulk Boto3 (bulkboto3): Python package for fast and parallel transferring a bulk of files to S3!

Using Virtual Machines to speed up python

Python for Finance in Excel — Moving Averages Chart

Creating an app for your Python scripts with Tkinter

Python for Finance Part 3: MACD

C++20: Pythons range Function, the Second

How to Query Apache Hudi Tables with Python Using Daft: A Spark-Free Approach

Stack, Implementation in Dart and Python, And how to use them to solve problems.

How to build an Interest Rate Converter with Streamlit

Marco Antonio Pereira, MBA的更多文章

How to Set Up Virtual Networks (VNET) on Microsoft Azure

Como configurar Redes Virtuais (VNET) no Microsoft Azure (Portuguese version)

Uma introdu??o ao AWS Elastic MapReduce (EMR)

Uma introdu??o ao Apache Flume

Uma introdu??o ao Git

Desbravando a poderosa ferramenta corporativa Linked[in]

社区洞察

其他会员也浏览了

The lambda() and more

Bulk Boto3 (bulkboto3): Python package for fast and parallel transferring a bulk of files to S3!

Using Virtual Machines to speed up python

Python for Finance in Excel — Moving Averages Chart

Creating an app for your Python scripts with Tkinter

Python for Finance Part 3: MACD

C++20: Pythons range Function, the Second

How to Query Apache Hudi Tables with Python Using Daft: A Spark-Free Approach

Stack, Implementation in Dart and Python, And how to use them to solve problems.

How to build an Interest Rate Converter with Streamlit