登录查看更多内容

Why You Shouldn't Rely on GPT to Write Code

Charles Givre

Experienced cyber security data scientist and data engineer. CISSP | Ex CIA, JP Morgan. GenAI | NLP | Python | SQL | Java | Speaker | Blackhat Instructor and O'Reilly Author | Classic car enthusiast.

发布日期: 2023年7月26日

A lot of people have tried out ChatGPT and other LLMs for code their code writing abilities. My theory was that the LLMs would be really good at writing code to do things they've seen before, but not so good at things that were completely new. I started my experiments by asking ChatGPT to write me a function in python to geolocate a phone number. ChatGPT 3.5 did a relatively poor job of this, so I tried again using the playground and gpt-3.5-turbo. This time it was more successful.?

That's not bad. I like that it used the?phonenumbers?library rather than calling some external service.?

Something Harder: A Drill UDF

Apache Drill, my favorite query engine, has an interface which allows users to define User Defined Functions (UDF) which are essentially custom functions. The interface is tricky to use, but once you've built UDFs it can be extremely powerful for data cleanup and other functionality. I thought I'd try this out to see if I could get GPT to generate an Apache Drill UDF to do the same thing.?

Note: I'd actually done this already as a part of DataDistillr. The code is proprietary, however, I wanted to see how GPT would do compared with the code that I wrote. Here's what gpt 3.5-turbo generated:

import org.apache.drill.exec.expr.fn.FunctionHolder;
import org.apache.drill.exec.expr.holders.VarCharHolder;
import org.apache.drill.exec.expr.annotations.FunctionTemplate;
import org.apache.drill.exec.expr.annotations.Output;
import org.apache.drill.exec.expr.annotations.Param;

import com.maxmind.geoip2.DatabaseReader;
import com.maxmind.geoip2.model.CityResponse;

import java.io.File;
import java.io.IOException;
import java.net.InetAddress;

@FunctionTemplate(
    name = "geolocate_phone",
    scope = FunctionTemplate.FunctionScope.SIMPLE,
    nulls = FunctionTemplate.NullHandling.NULL_IF_NULL
)
public class GeoLocatePhoneNumbersUDF implements DrillSimpleFunc {

  @Param
  VarCharHolder phoneNumber;

  @Output
  VarCharHolder out;

  private static final String GEOIP_DATABASE_PATH = "/path/to/GeoLite2-City.mmdb"; // Replace with the actual path to the GeoLite2 database

  @Override
  public void setup() {
  }

  @Override
  public void eval() {
    String phoneNumberStr = org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.toStringFromUTF8(phoneNumber.start, phoneNumber.end, phoneNumber.buffer);

    try {
      File database = new File(GEOIP_DATABASE_PATH);
      DatabaseReader dbReader = new DatabaseReader.Builder(database).build();
      CityResponse response = dbReader.city(InetAddress.getByName(phoneNumberStr));

      String city = response.getCity().getName();
      String countryCode = response.getCountry().getIsoCode();
      String countryName = response.getCountry().getName();

      String result = phoneNumberStr + " is located in " + city + ", " + countryCode + ", " + countryName;

      byte[] bytes = result.getBytes();
      out.buffer = buffer;
      out.start = 0;
      out.end = bytes.length;
      buffer.setBytes(0, bytes);
    } catch (IOException e) {
      e.printStackTrace();
    }
  }
}

Wrong, Wrong, Wrong...?

I was impressed that GPT was able to generate anything that remotely resembled a UDF. Now that I've said something positive... Let's pick this apart

领英推荐

Optimal Techniques for Crafting Effective LLM Prompts

Ramachandran Murugan 7 个月前

Top AI/ML Papers of the Week [26/08 - 01/09]

Bruno Lopes e Silva 6 个月前

Developers’ Tutorial: Using Claude’s Tool (Function…

Kanaka Software 3 个月前

The first issue that I have is that the functionality is just plain wrong. This class does not geolocate phone numbers. It geolocates IP addresses, and we know this because it uses the MaxMind libraries to do so. The second issue that I have is with the code itself. This code demonstrates a really poor understanding of how this SDK actually works and if it were deployed, it would perform terribly.?

The Drill SDK has two methods which must be implemented,?setup()?and?eval(). Drill functions are meant to be run as part of a SQL query and as such the assumption is that they will be executed on collections rather than individual items. To make UDFs that are scalable, anything that only needs to happen once should be placed in the?setup()?method and anything that you do on each row, goes in the?eval()?method.?

The issue with the GPT code is that it creates a new?DatabaseReader?object in the?eval()?method which means that for every function call, you get a new?DatabaseReader. Same for the?File?object. This is horrendously inefficient and would not likely work for a large dataset.

Is This Stolen?

So here's the thing. When I first saw this, I thought that this code looked remarkably familiar and the reason for that is that I actually wrote a Drill UDF that performs IP GeoLocation. (https://github.com/datadistillr/drill-geoip-functions) Mine actually works and is efficient.?

With that said, my code is the only possible source for this as it is the only library out there that does IP Geolocation for Apache Drill. There are a few forks, but I'm fairly certain that my code is the original source. Additionally, in my repositories for this, my code has the Apache 2.0 license attached to each file and in the repo itself. This license allows redistribution but requires that the license be included with the code. The response from GPT did not include the original license.

What Am I Going Do?

Sadly, probably nothing. I'm not using this library to make money, but to me, it is yet another example of OSS being abused by a large company. All this does is make me question whether OSS is going to be viable if the AI providers are simply going to abuse copyrights and licenses.?

Khurrum Mahmood

In Stealth Mode

1 年

I imagine that a lot of opensource code will be outright written by GPTs in a year or so.

Shawn Mittal

AI/ML Engineer

1 年

I've begun using it as a knowledge repository for programming. For instance, I was putting together a flask app and forgot how flask blueprints worked. Instead of a Google search, I asked ChatGPT to write up an example. I would never copy code directly, but man is it a handy reference for all of the stuff I used to know, but have since forgotten. This usage pattern also circumvents some of the ethical concerns you've brought up here. But yeah, lots of open questions in the legal and moral realms definitely remain.

2 次回应

Ted Dunning

Fellow in Hewlett Packard Enterprise Labs working on usable and provable data security

1 年

1. Oh well 2. Absolutely you are rolling the dice (and they are loaded against you) 3. sure. I write open source because I am willing to share. I don't see why to limit that to sharing with humans.

3 次回应

Derek Murawsky

VP, Head of DevOps and DevEx at Nations Benefits | Cloud Architect | Innovator

1 年

Rely on? Absolutely not. Use to advance, learn, train, and explain, while keeping a critical eye on its outputs? Absolutely! It's a great tool to have in your arsenal.

3 次回应

Dr. Rupal Mittal

Assistant Vice President & Lead Data Scientist at CCS Fundraising

1 年

Thanks, Charles Givre, very insightful.

1 次回应

查看更多评论

要查看或添加评论，请登录

Charles Givre的更多文章

All Great Things Part 2: The Founder's Dilemma

2023年12月14日

All Great Things Part 2: The Founder's Dilemma

I recently posted an article about the demise of DataDistillr.?It was painful to write and I was worried that by doing…

4 条评论
All Great Things...

2023年12月4日

All Great Things...

Well, this is the post I’d hoped to never write, but alas, we’ve reached the conclusion that it’s time to shut down…

65 条评论
Tests in a GenAI World

2023年6月2日

Tests in a GenAI World

I teach a graduate level data management class at the University of Maryland, Baltimore County (UMBC). Let me preface…

5 条评论
Five Things I Learned Writing SQL with Gen AI

2023年3月31日

Five Things I Learned Writing SQL with Gen AI

ChatGPT has been all over the news for the last few months and again with the release of GPT-4. At DataDistillr, we…

7 条评论
It's The Assumptions That Get You

2023年2月7日

It's The Assumptions That Get You

I’ve had a number of conversations recently that have highlighted to me how not understanding people’s assumptions can…

4 条评论
ChatGPT, Meet DataDistillr! You’ll have lots to discuss!

2023年1月6日

ChatGPT, Meet DataDistillr! You’ll have lots to discuss!

Happy New Year everyone! I’m pretty excited about this. Like every other tech geek out there, I was experimenting with…

24 条评论
Five Technologies That I Think Are Bullshit

2022年11月13日

Five Technologies That I Think Are Bullshit

This is going to piss people off. I took a road trip a few weeks ago to New York and listened to an interview with Mark…

49 条评论
We Launched! (Beta)

2022年9月28日

We Launched! (Beta)

Well, that day has finally come! After months of testing, speaking with customers and investors, our public beta is…

13 条评论
Joining Difficult Data: How to Join Data on Extracted Domains

2022年8月24日

Joining Difficult Data: How to Join Data on Extracted Domains

2 条评论
5 Ways Google Sheets SDK Could be better. A Tutorial on How to Integrate with Google Sheets. (Startup Part 17)

2022年8月8日

5 Ways Google Sheets SDK Could be better. A Tutorial on How to Integrate with Google Sheets. (Startup Part 17)

Googlesheets (GS) is one of those data sources that I think most data scientists use and probably dread a little. Using…

3 条评论

See all articles

Why You Shouldn't Rely on GPT to Write Code

Charles Givre

Experienced cyber security data scientist and data engineer. CISSP | Ex CIA, JP Morgan. GenAI | NLP | Python | SQL | Java | Speaker | Blackhat Instructor and O'Reilly Author | Classic car enthusiast.

Something Harder: A Drill UDF

Wrong, Wrong, Wrong...?

领英推荐

Is This Stolen?

What Am I Going Do?

Charles Givre的更多文章

社区洞察

其他会员也浏览了

How to create from scratch a Machine Learning Decision Tree in Java from training data. Concrete example of bank loan payment forecast

Python's Growing Dominance in AI, Machine Learning, and Web3: A Sectoral Analysis.

Ollama with LangChain for Local Phi 3 Applications

GPT-Python Pulse: Creating a Family Tree

RAG based customized code generator

Kicking the Tires: Exploring CouchbaseDB and ChatGPT with Python

DataPanthy #92

Code Interpreter?—?Giving LLM Hands

Something Harder: A Drill UDF

Wrong, Wrong, Wrong...?

领英推荐

Is This Stolen?

What Am I Going Do?

Charles Givre的更多文章

All Great Things Part 2: The Founder's Dilemma

All Great Things...

Tests in a GenAI World

Five Things I Learned Writing SQL with Gen AI

It's The Assumptions That Get You

ChatGPT, Meet DataDistillr! You’ll have lots to discuss!

Five Technologies That I Think Are Bullshit

We Launched! (Beta)

Joining Difficult Data: How to Join Data on Extracted Domains

5 Ways Google Sheets SDK Could be better. A Tutorial on How to Integrate with Google Sheets. (Startup Part 17)

社区洞察

其他会员也浏览了

How to create from scratch a Machine Learning Decision Tree in Java from training data. Concrete example of bank loan payment forecast

Python's Growing Dominance in AI, Machine Learning, and Web3: A Sectoral Analysis.

Ollama with LangChain for Local Phi 3 Applications

GPT-Python Pulse: Creating a Family Tree

RAG based customized code generator

Kicking the Tires: Exploring CouchbaseDB and ChatGPT with Python

DataPanthy #92

Code Interpreter?—?Giving LLM Hands