How to create a simple and powerful data pipeline in Google Cloud.
Rodolfo Marcos
Sr Data / Full-Stack Engineer | Tech Writer | Google cloud certified | AWS @ Freestar
In this post, we’ll learn how to create a simple and powerful data pipeline using only Google Cloud managed services. You'll only need Python and SQL basics to make it!
Currently, there are many tools that can be used to build data pipelines. Some of them are full-customized, others are fully-managed and expensive, and another requires specific language expertise.
Besides all the options on the market if you choose a Cloud solution you are on the way to success.?Data?and?AI?are very tied to cloud solutions nowadays because it's?scalable,?fast, and if well-developed,?cost-effective.
Simplicity, carried to an extreme, becomes elegance.
The solution described in this article can be used successfully as a starting point for complex data pipelines or to solve specific?ELTs?jobs. No?dataframes?or?collections?here. You can start with basic Python and SQL languages in Google Cloud.
Problem to be solved
Suppose you have to ingest employee information in a Big Query table. You want that every time a?.csv?file is placed in a Cloud Storage bucket it automatically uploads the information to the final table and processes it accordingly based on rules on employee ID.
id,name,birth date,occupation,gende
1,John,05/23/1985,Doctor,M
2,Mary,09/12/1992,Engineer,F
3,Joe,02/08/1978,Lawyer,M
4,Anna,07/31/1989,Programmer,F
5,Paul,11/17/1996,Student,M
6,Renata,04/02/1981,Nurse,F
7,Luke,08/19/1993,Architect,M
8,Fernanda,12/27/1975,Psychologist,F
9,Gus,03/14/1990,Veterinarian,M
10,Julie,06/21/1982,Programmer,F
11,Raphael,10/09/1995,Student,M
12,Carol,01/25/1987,Journalist,F
13,Matthew,05/07/1979,Entrepreneur,M
14,Bruna,08/01/1994,Designer,F
15,Fabio,12/18/1986,Pharmacist,M
16,Camille,02/06/1980,Teacher,F
17,Andrew,06/22/1991,Engineer,M
18,Maryann,11/13/1983,Psychiatrist,F
19,Pete,03/30/1998,Student,M
20,Luana,07/15/1976,Doctor,Fr
If, for instance, we upload a file with a line changed (employee id 20)
Look that just the?occupation?changed and now we need to update the specific employee row to reflect this. This is our pipeline goal; when a file is loaded we update existing employee's information, and, if an employee doesn't exist create an entry for it.
Benefits of the solution
Implementation
4. Hit the "Add Eventarc Trigger". Set the event provider as "Cloud Storage" and the event as?"google.cloud.storage.object.v1.finalized",?then choose the?input file bucket?and finally "Save trigger". Now, every time a file is created inside the bucket the cloud function runs. It may popup approval for some APIs, hit "enable". Finally, hit the "next" button.
5. Under runtime select "Python 3.11" and fill entry point as "main". Then on the left panel let's create each file listed, you can find the code for copy/paste below. Replace where necessary with your?Google Cloud Project ID?and?Big Query?dataset.
领英推荐
When you finish it hit "Deploy". It will take some minutes after the conclusion.
Replace the variables at lines?10, 11,?and?46?in?main.py. Before testing let's discuss some points in the code.
Testing
Open your Cloud Function in the Google Cloud console. Go to the "LOGS" section. Keep it there.
Now, open another tab in your navigator, open your cloud storage bucket, and drop a .csv file there (You can download the example .csv from this article?here. Remember to name it in the right format?sample_20230101.csv?for example before loading into Cloud Storage).
Once you place the file the Cloud Function execution begins, switch back to the logs tab and follow until it finishes, then, open Big Query and check your final table. If it's the first time you run the script the table will be created brand new with the initial records.
Now, if you change your employee's CVS file by switching a random column and upload it again you should see that just that column is updated for a given employee ID. If you include a new employee in your file with a non-existing employee ID a new row is created in the table.
Congratulations!
In this article, you learned how to create a simple pipeline that can import, process, and deduplicate data in Google Cloud. This automated solution is very cost-effective, flexible and can take advantage of?Big Query SQL?directly in your pipeline. You can try to change your?merge.sql?file with new features and explore possibilities!
Post-notes
Thank you so much! I hope it will be useful for you!
Rodolfo Marcos, Sr. Data Engineer
MarTech | Analytics | CRM | Growth Strategy | Manager
11 个月Opa, muito obrigado Rodolfo Marcos me ajudou muito com esse artigo a construir meu primeiro Data pipeline na GCP. Funcionou certinho. Agrade?o imensamente. Abra?ao. ;)
Empowering businesses through seamless Digital journeys , using cutting-edge technologies Delivey models, Cloud and DevOps technologies. #CloudMigration #DevOps #DigitalTransformation
1 年Rodolfo Marcos - Very Good project and very nicely explained. One quick question though to clarify? on the main.py file , line no 46 - can you suggest the exact syntax to replace the bucket name with ? I was trying this project but Cloud Function gave an error in logs, most likely for incorrect bucket name in line 46 . My bucket name was "temp_sample_pipeline"
Gerente de Martech | Marketing | Digital | CRM | Mídia Performance | BI | Dados | IA | Dados | Cloud | CRO | CX
1 年monstro.
Engenheira Florestal | Consultora Técnica | Rela??es Institucionais | Coordena??o executiva | Comunica??o Integrada | Florestas Plantadas | Gest?o de Projetos | Madeira | Políticas Públicas | Agricultura
1 年Muito bom ????????????
Data Engineer | GCP - SQL - Python | Cloud
1 年Parabéns Rodolfo! Conteúdo muito bom e de simples entendimento!