登录查看更多内容

Friday fun - ChatGPT once again - framework generation

Bipin Patwardhan

Solution Architect, Solution Creator, Cloud, Big Data, TOGAF 9

发布日期: 2024年5月10日

Recently, I was working on an RFP for a customer. For this RFP, I was working for the data engineering area. As is the case with RFP for many years, customers ask for acceleration of their activities. Over the years, most of us have developed some accelerators and frameworks for the standard data engineering tasks like ingestion, data validation, data cleaning and more. Even for thi RFP, we wanted to share our repository of tools and frameworks that would enable us to help the customer achieve their goal of a high quality deliverable that goes hand in hand with on time delivery.

As has been the case in the recent weeks and months, I decided to ask ChatGPT to generate something for me. I have been experimenting with prompts on ChatGPT. I have found that for programming tasks, we need to give quite precise instructions in order to have ChatGPT generate code for us. More so, if we want ChatGPT to generate an application. Providing a specification that covers all aspects of the application is tough. It is as complex as or even more complex that writing a murder mystery. We need to cover all angles that does not allow the perpetrator to escape - in this case, our instructions have to be such that ChatGPT has no choice but to generate something that meets our requirements. It cannot get away by giving snippets, leaving the job of integration to us.

My prompt

Generate a pyspark application framework for ingestion of data. The application framework should be modular in nature made up of classes. It should be possible to configure behaviour using parameters. Framework should support data ingestion for full load as well as incremental load. Framework should use logging. Framework has to support multiple sources. Ingested data should be written to Delta tables. framework should allow us to create transformations as well as apply data cleaning and cleansing rules

When I executed this prompt, ChaGPT gave a error as below

An error occurred. Either the engine you requested does not exist or there was another issue processing your request. If this issue persists please contact us through our help center at help.openai.com.

I asked ChatGPT to re-execute. This was its response

Sure! Below is a basic outline of a PySpark application framework for data ingestion that meets your requirements. This framework is modular, configurable, supports full and incremental loads, logging, multiple sources, writes data to Delta tables, and allows for data transformations, cleaning, and cleansing. This framework can be extended further to add support for more data sources, additional data cleaning and transformation rules, and other functionalities as per your specific requirements.

Yuri Simione 1 年前

Interactive Data Exploration using ChatGPT

Md. Abdullah Al Mahmud 3 个月前

Developing a Workflow KNIME with ChatGPT-4o

Emanuel A. Tavares 5 个月前



from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import StructType

class DataIngestionFramework:
    def __init__(self, spark, source_type, source_config, destination_path, delta_table_name):
        self.spark = spark
        self.source_type = source_type
        self.source_config = source_config
        self.destination_path = destination_path
        self.delta_table_name = delta_table_name

    def ingest_data(self):
        # Implement logic for ingesting data based on source type
        if self.source_type == "csv":
            df = self.spark.read.csv(self.source_config["path"], header=True, inferSchema=True)
        elif self.source_type == "json":
            df = self.spark.read.json(self.source_config["path"])
        # Add support for other source types as needed
        
        # Apply data transformations, cleaning, and cleansing rules
        df_transformed = self.transform_data(df)
        
        # Write ingested data to Delta table
        df_transformed.write.format("delta").mode("append").saveAsTable(self.delta_table_name)

    def transform_data(self, df):
        # Implement data transformations, cleaning, and cleansing rules
        # Example: df_transformed = df.withColumn(...)
        return df

class IncrementalDataIngestionFramework(DataIngestionFramework):
    def __init__(self, spark, source_type, source_config, destination_path, delta_table_name, last_ingestion_time):
        super().__init__(spark, source_type, source_config, destination_path, delta_table_name)
        self.last_ingestion_time = last_ingestion_time

    def ingest_data(self):
        # Implement logic for incremental data ingestion
        if self.source_type == "csv":
            df = self.spark.read.csv(self.source_config["path"], header=True, inferSchema=True)
        elif self.source_type == "json":
            df = self.spark.read.json(self.source_config["path"])
        # Add support for other source types as needed

        # Apply data transformations, cleaning, and cleansing rules
        df_transformed = self.transform_data(df)

        # Filter data based on last ingestion time
        df_incremental = df_transformed.filter(F.col("timestamp_column") > self.last_ingestion_time)
        
        # Write incremental data to Delta table
        df_incremental.write.format("delta").mode("append").saveAsTable(self.delta_table_name)

class DataTransformation:
    @staticmethod
    def transform_data(df):
        # Implement generic data transformations
        # Example: df_transformed = df.withColumn(...)
        return df

class DataCleaning:
    @staticmethod
    def clean_data(df):
        # Implement data cleaning rules
        # Example: df_cleaned = df.filter(...)
        return df

# Example usage:

if __name__ == "__main__":
    # Initialize Spark session
    spark = SparkSession.builder \
        .appName("DataIngestionFramework") \
        .config("spark.some.config.option", "some-value") \
        .getOrCreate()

    # Define source configuration
    source_type = "csv"
    source_config = {"path": "path/to/source/data.csv"}

    # Define destination configuration
    destination_path = "path/to/destination"
    delta_table_name = "delta_table"

    # Create instance of DataIngestionFramework
    ingestion_framework = DataIngestionFramework(spark, source_type, source_config, destination_path, delta_table_name)

    # Ingest data
    ingestion_framework.ingest_data()

    # Stop Spark session
    spark.stop()

Does the code generated above meet my requirements? Yes and No. But the No is not that strong a response, neither is it weak.

The positives

Code is divided into functional areas - ingestion, cleaning and transformation.
Code provides examples for two file formats - which can be extended
Code uses a structured approach making use of classes

What can be improved

Obviously, the prompt specification. If we want more precise code, the prompt has to be more elaborate
I would have developed the code using design patterns like Factory pattern
I would have added classes for the file reading examples, giving us the flexibility and ease of adding new functionality
Improve the prompt to ask for an example transformation rather than an empty withColumn example
Improve the prompt to ask for an example cleaning operation rather than an empty withColumn example

For an experienced person, this structure is obvious. I have been building code with such structures for many years. So I relate to what ChatGPT has generated. can also improve the prompt to get code that aligns with my expectations. Experience is critical.

For a beginner, gaining experience is important and experience can be gained only by writing code and trying things. ChatGPT an generate code for us but it cannot make us understand and debug it. Though I am not much of a cricket fan, this being cricket season, a cricket analogy is called for.

Sitting on a couch and asking a player to run for each ball or to score a four or a six for eahc ball is very easy. It takes real strength to be on the field for two to three hours and face the ball hurtling at you. Sitting at home allow us to watch in slow motion but on the field, the ball travels within seconds, taxing our decision making ability. Being on the field and playing in varying circumstances is the only solution.

#codegeneration #fridayfun #chatgpt #framework #pyspark #generativeai

Friday fun - ChatGPT once again - framework generation

Bipin Patwardhan

Solution Architect, Solution Creator, Cloud, Big Data, TOGAF 9

领英推荐

更多精彩文章

社区洞察

其他会员也浏览了

Developing a Workflow KNIME with ChatGPT-4o

Using ChatGPT as THE Interface to a BI System

Best Uses of ChatGPT for Data Scientists

Custom LLM vs ChatGPTs in SME and Large companies

The Custom GPT Problem

PART1:AI Revolution in IT: Transforming Workflows with ChatGPT

Insight vs. Hallucination: Can You Trust ChatGPT's Data Inferences?

Game Over for BAs? Part 5: Hey ChatGPT, I need assistance!

Talks with GPT

ChatGPT can now see, hear, and speak

领英推荐

Points to Ponder - GenAI tools are a co-pilot

2024年10月14日

Points to ponder - Python - spaces vs tabs - unexpected issue

2024年10月9日

Databricks - Making a true copy of a table having an auto-generated identity column

2024年10月3日

Reducing run time by 10min only?

2024年9月23日

Points to Ponder - Even with GenAI at your disposal, you need to know how to use its power

2024年9月17日

Databricks: Conditional execution in a job using if-else

2024年9月12日

Friday Fun - a productive Thursday

2024年8月30日

Databricks - Access file metadata when loading multiple files from a directory

2024年8月19日

Friday Fun -- show content of large file (page by page)

2024年8月9日

Flatten an XML in pyspark environment

2024年8月6日

社区洞察

其他会员也浏览了

Developing a Workflow KNIME with ChatGPT-4o

Using ChatGPT as THE Interface to a BI System

Best Uses of ChatGPT for Data Scientists

Custom LLM vs ChatGPTs in SME and Large companies

The Custom GPT Problem

PART1:AI Revolution in IT: Transforming Workflows with ChatGPT

Insight vs. Hallucination: Can You Trust ChatGPT's Data Inferences?

Game Over for BAs? Part 5: Hey ChatGPT, I need assistance!

Talks with GPT

ChatGPT can now see, hear, and speak