登录查看更多内容

GenAI n00b, Part 2

Dmitry Grinberg

President & COO at YMM

发布日期: 2024年1月3日

This article is Part 2 of multi-part series “GenAI n00b”. Part 1 is here - https://www.dhirubhai.net/pulse/genai-n00b-part-1-dmitry-grinberg-cdzye/

I decided to give it a bit of space from Part 1 to iterate on few concepts that I started with and see what would matter more to share in detail. I ended up building Chain-of-thought / Reason-Act Proof of Concept (POC) with few very interesting takeaways. Though I am compelled to follow through on Part 1’s promise to unwrap all briefly described points with a lot more detail, I believe it is a lot more valuable for me to share some of the nuances of my POC. Most of these nuances do builds on some of the previous topics, so not entirely deviating.

For POC, I wanted to build a bot that can answer complex questions that would rely on a mix of structured dataset, ability to lookup/search internet where needed, run few simple functions if appropriate, all decided by llm itself (not programmed algorithm).

I wanted to be able to use either OpenAI or custom LLM - the route I had to take had to work for both, so I can compare, etc. Since this was a POC, I didn’t want to build any fa?ades nor polymorphic implementation to generate different representation based on model used.

Here is what the setup looks like for the POC, simplified with single "tool" that looks at specific dataset for answers, or uses it's own knowledge. Adding more tools is just a function of enumerating them (use your imagination reading bellow template in Step 4) - I've implemented half a dozen, the limit is obvious (max tokens).

Step 1. Using Colab, get things installed. Make sure to pick correct GPU, switching resets everything (everything has to be reinstalled and redownloaded). I mostly used V100, as A100 burns through credits too fast. Though if you need to iterate through a lot of open llm chatter, A100 will save a ton of time (responds a lot faster). Use cheapest option for OpenAI (doesn't need GPU).

# in order to use GPU, need use this command for llama cpp python wrapper
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python

# OpenAI API
!pip install openai

# using Jinja2 to templatize prompts. Concatenating or 
# formatting strings is pathetic, even for POC.
!pip install Jinja2

I’ve settled on llama-cpp-python for couple of reasons, I’ve used many things, determined this is the best way to go. Partially because it supports streaming (not much else does) and it just works (I ended up not really needing a kitchen sink that many libs offer). Going beyind POC might change my mind, though unlikely (will describe in later parts, if I ever get that far).

Step 3. I get dataset that I want to make sure bot understands. Could be many different kinds. I’ve tested mostly with “drgilermo/nba-players-stats” from Kaggle, it has 3 csv (name explains what’s in there). Also loaded Insurance Claims, Stocks, Housing Data, many others. etc. Works equally well, as long as you don't go nuts with number of CSV and columns in each.

dataset_name = “some name of the dataset from Kaggle”
dest_path = “destination folder, where dataset will be unzipped”

os.environ['KAGGLE_USERNAME'] = “your Kaggle usename”
os.environ['KAGGLE_KEY'] = “your Kaggle key”

api = KaggleApi()
api.authenticate()

# Now, we exercise our power to summon the dataset from the Kaggle dimension.

api.dataset_download_files(dataset_name, path=dest_path, unzip=True)

# then code, that I’m omitting here, which iterates over 
# downloaded csv datasets from dest_path and loads each as 
# Panda Dataframe using pd.read_csv into local dict 
# called “sandbox_locals”.

Step 4.

Setup prompt for Dataframe data as Jinja2 template:

prompt = """
### Instruction:
You should respond in valid JSON format only.

{{history}}

You should use the tools below to answer the question posed of you:

python_repl_ast: A Python shell. Use this to execute python commands. 
Input should be a valid python command, that does not return errors. When
 using this tool, sometimes output is abbreviated - make sure it does not 
look abbreviated before using it in your answer. 

Always use the following format for response:

{
    "Question": "the input question you must answer",
    "Thought": "you should always think about what to do",
    "Actions":
      [
        {
          "Function": "The action to take, should be one of 
[python_repl_ast]",
          "Input": "The input to the action.",
        },
        # you can have multiple Actions
      ],
    "Answer": "The final answer to the original input question. Provide 
details.",
    "Is Answer Final": "Yes if the answer is final, No if not."
}

{% if df is mapping %}{% for key, value in df.items() %}
You are working with a pandas dataframe in Python. The name of the 
dataframe is `{{key}}`.

This is the result of `print({{key}}.head(2))`:
{{value.head(2)}}
{% endfor %}
{% else %}
You are working with a pandas dataframe in Python. The name of the 
dataframe is `df`.

This is the result of `print(df.head(2))`:
{{df.head(2)}}
{% endif %}

Begin!
Question:

{{user_question}}

### Response:
"""?

When actual text prompt needs to be generated, all that needs to be done is this:

template = Template(prompt)

# sandbox_locals dict should have everything needed to render
# in this POC, datframe is loaded with key 'df' as eityher single 
# dataframe, and dict of dict

use_prompt = template.render(sandbox_locals)

Step 5. Load either openai or custom llm

# download model if using local model

model = 'deepseek-coder-33b-instruct.Q4_K_M.gguf'
model_path = 'TheBloke/deepseek-coder-33B-instruct-GGUF'

# this will use huggingface-cli to download proper model from huggingface.

!huggingface-cli download {model_path} {model} --local-dir . --local-dir-
use-symlinks False

# GPU_LAYERS is the number of layers that increase performance, 
# but dependent on available memory. 

llm = Llama(model_path=model, n_ctx=4000, n_threads=4,
 n_gpu_layers=int(os.environ['GPU_LAYERS']))

# -----------------------------
# OR?
# -----------------------------

llm = = OpenAI()

# the logic elsewhere does the switching, etc, omitting here for brevity

Step 6. Implementation of the function that takes user's input and passes it into llm, which uses the instruction from the prompt. If response is not a final answer, call is repeated with included results (as 'history'). Can't repeat too many times, as token limit is reached, I've gone to 10 (as long as data sets are not too complex). This implementation is a few pages long, so skipping it here, but might share it later as link to source.

Here is the promt example for NBA Stats Dataset, using deepseek-coder-33b-instruct.Q4_K_M.gguf model.

领英推荐

?? Massive Progress in Reasoning Models

Pascal Biese 1 个月前

LLM Paper Reading Notes - October 2024

Jean David Ruvini 5 个月前

?? Happy Thanksgiving From OpenCV: Free Trial of…

OpenCV 2 年前

print_answer("Take player stats that determine MVP, then use the stats to
 figure out who is MVP in 2015.")

This is what it responds with:

{
    "Question": "Take player stats that determine MVP, then use the stats
 to figure out who is MVP in 2015.",
    "Thought": "To find the MVP of a specific year, we need to filter the 
dataframe by the year and sort it based on some criteria. The most common 
criteria are points (PTS), assists (AST), rebounds (TRB) etc. For 
simplicity, let's use PTS.",
    "Actions": [
        {
            "Function": "python_repl_ast",
            "Input": "Seasons_Stats_csv['Year'] == 2015"
        },
        {
            "Function": "python_repl_ast",
            "Input": "MVP_2015 =
 Seasons_Stats_csv[Seasons_Stats_csv['Year'] == 2015].sort_values('PTS',
 ascending=False)"
        }
    ],
    "Answer": "The player with the most points in 2015 is MVP. The name of 
the player can be found by accessing the 'Player' column of the first row 
of the sorted dataframe: MVP_2015['Player'].iloc[0].",
    "Is Answer Final": "No"
}

So now few things will happen:

A. The tool that it thinks should be used for 2 of the actions above is python_repl_ast, which in both cases should execute python code that it thinks will get to proper results.

B. The code is executed, with the end result of MVP_2015 dataframe that it created (printed to show and also to be used by llm subsequently). More specifically, I have written bunch of code that takes a sting (say "Seasons_Stats_csv['Year'] == 2015") and executes it as python code (I might go into nuances some other time on how this is done).

print(MVP_2015)

       Unnamed: 0    Year             Player Pos   Age   Tm     G    GS  \
23114       23114  2015.0       James Harden  SG  25.0  HOU  81.0  81.0   
23000       23000  2015.0      Stephen Curry  PG  26.0  GSW  80.0  80.0   
23480       23480  2015.0  Russell Westbrook  PG  26.0  OKC  67.0  67.0   
23154       23154  2015.0       LeBron James  SF  30.0  CLE  69.0  69.0   
23215       23215  2015.0     Damian Lillard  PG  24.0  POR  82.0  82.0   
...           ...     ...                ...  ..   ...  ...   ...   ...   
22999       22999  2015.0         Seth Curry  PG  24.0  PHO   2.0   0.0   
23476       23476  2015.0         David Wear  PF  24.0  SAC   2.0   0.0   
22916       22916  2015.0   Jerrelle Benimon  PF  23.0  UTA   2.0   0.0   
23222       23222  2015.0        Kalin Lucas  PG  25.0  MEM   1.0   0.0   
23210       23210  2015.0        Malcolm Lee  SG  24.0  PHI   1.0   0.0   

           MP   PER  ...    FT%    ORB    DRB    TRB    AST    STL   BLK  \
23114  2981.0  26.7  ...  0.868   75.0  384.0  459.0  565.0  154.0  60.0   
23000  2613.0  28.0  ...  0.914   56.0  285.0  341.0  619.0  163.0  16.0   
23480  2302.0  29.1  ...  0.835  124.0  364.0  488.0  574.0  140.0  14.0   
23154  2493.0  25.9  ...  0.710   51.0  365.0  416.0  511.0  109.0  49.0   
23215  2925.0  20.7  ...  0.864   49.0  329.0  378.0  507.0   97.0  21.0   
...       ...   ...  ...    ...    ...    ...    ...    ...    ...   ...   
22999     8.0 -11.4  ...    NaN    0.0    2.0    2.0    1.0    0.0   0.0   
23476     7.0   2.4  ...    NaN    2.0    0.0    2.0    1.0    0.0   0.0   
22916     3.0   4.7  ...    NaN    1.0    2.0    3.0    0.0    0.0   0.0   
23222     6.0  -0.7  ...    NaN    0.0    0.0    0.0    0.0    1.0   0.0   
23210     2.0 -19.7  ...    NaN    0.0    0.0    0.0    0.0    0.0   0.0   

         TOV     PF     PTS  
23114  321.0  208.0  2217.0  
23000  249.0  158.0  1900.0  
23480  293.0  184.0  1886.0  
23154  272.0  135.0  1743.0  
23215  222.0  164.0  1720.0  
...      ...    ...     ...  
22999    0.0    2.0     0.0  
23476    0.0    1.0     0.0  
22916    1.0    0.0     0.0  
23222    0.0    1.0     0.0  
23210    0.0    0.0     0.0

If you look the response, the Answer is not final, because LLM doesn't know what the result is yet (we need to pass it back), but it's own hint for the Answer is to look at the first position of the MVP_2015 that it created (MVP_2015['Player'].iloc[0]).

So, we take the above print out, and pass it back as 'history' after it's original response (omitted for brevity, all details above to figure out what it is).

After this call, llm responds with Final Answer as James Harden was an MVP in 2015. It basically understands that functions were run on this data and results were provided it to to answer the question.

Few interesting bits:

Llm chooses to get to results in different ways most of the time. In this example above, it chose to get to it in "simplified" way, but it even suggested a fuller answer with using more information (which is a lot more correct for this).
There are bugs in code sometimes, (a lot less with OpenAI) and sometimes it is going after wrong data - though it always does it better when problem is more verbose.
With this setup, where custom structure data is used and it knows it can run python code, you ask it to plot information and it will give you code to do that, and other things that are fairly interesting (if you want to predict future results, it'll try to use different advanced libraries for it like scikit-learn and others). I've extended my POC to see if it's trying to use libraries that need to be installed, and install them (don't do this in production! I am not responsible for what can happen if you do!)
These llms seem to be intelligent enough to understand that certain tools could go beyond what they're advertised for. Specifically in this example, though this tool is prescribed to run on dataframe, you can ask it arbitrary thing to do that is not related to this dataframe, but can be done by running Python code - and it will do it correctly. MacGyver!

The key of this approach, if you understand what this thing is doing, is that you basically give it multiple ways it can choose to get to the answer using different means (tools), by explaining what the these tools can do - and it is very clever at how it picks these tools to use and for what. I've played around with complex asks where it used multiple tools for multiple parts of the question and also when it used one tools and subsequently results of it used for another (with above example, I asked for MVP as above and then to write it using ascii art, which I added as another tool, which it did).

Few more points that are worth mentioning:

Overall, OpenAI is the best out-of-the-box. The more interesting part is that it’s gpt-3-turbo-1106 that is the best, not the gpt-4-1106-preview. The third usable is Deepseek-coder-33b-instruct (that I was able to test with my setup). I've tested a dozen more, nothing came close (out-of-the-box).
OpenAI is definitely fine-tuned well for following instructions and specifically for ability to run function and supporting tools, with understanding that results need to be structured properly, etc.
Passing in functions/tools as “tools” into OpenAI API call or just dumping all into OpenAI prompt works in identical way - which means they’re probably doing the same. The POC I wrote basically works without changes on OpenAI and other models, as the result (dumping all in prompt) - which is very helpful.
Using OpenAI's API for proper functions/tools calls (not doing it as prompt) dramatically simplifies implementation and has automatic memory ability - this is very helpful. It doesn't improve results, but I suspect as OpenAI adds capabilities, using prescribed method will continue adding more value (not just reduce lines of code). Case in point, using it's Assistant API instead already has additional benefits (I will go into another time), but to be honest, it also locks you into proprietary functionally (which is too risky at this early of an stage in GenAI world). My 2 cents - don't do it.
OpenAI overall, is the most predictable and least error prone. Results (in this case) are always json, thoughts are mostly usable and executable (when need to be executable). Learning here is that fine-tuning is a must for open models, as out-of-the-box they're not ideal.
The previous point, perhaps larger models would do much better (OpenAI is far larger than models I've testes), but I couldn’t test any model larger than 34 billion parameters, as the best you can do using A100 offered by Colab (A100 is a $20k GPU!) is not good enough, even with 4-bit quantization to reduce required memory footprint. If I find a way to use 70b model (any Instruct model), that's what I'll do for sure, to validate/invalidate quality of out-of-the-box use.
I’ve tried, unsuccessfully, few supposedly cutting-edge techniques - for example, Airllm’s that has a different way they handle transformer layers, apparently sequentially and other optimizations, that can dramatically reduce memory requirements, but Colab’s maximum 160GB disk does fit Airllm’s model layout, that needs far more disk space to work. Looks promising, but need to find a way to test it out.

For Part 3, I'm going to go into fine-tuning, to improve results for smaller models. I also will likely experiment with using long-term memory (which I'd add to this setup, basically a database) and use it to continuously and automatically fine-tune as well (based on successful task completion, etc).

Dmitry Grinberg的更多文章

GenAI n00b, Part 1

2023年11月20日

GenAI n00b, Part 1

This article is Part 1 of multi-part series “GenAI n00b”. Generative AI is a ‘07 smartphone revolution combined with…

1 条评论

GenAI n00b, Part 2

Dmitry Grinberg

President & COO at YMM

领英推荐

Dmitry Grinberg的更多文章

社区洞察

其他会员也浏览了

Web ML Monthly #12: Google IO, client side LLMs / generative models, and ~250K weekly downloads of TensorFlow.js

Think with the model, Plan with the code.

The Copilot Era:My Speech at Semantic Kernel DevDay in Microsoft Reactor

Torching Through API Dependence: How TorchChat Optimizes LLMs for Local Use

How to Set Up and Run DeepSeek-R1 Locally Using Docker and Docker Compose

Google Colab: A Powerful Testing Platform for Machine Learning and Time Series Analysis

Creating and Building an AI Dataset for Accelerating GPU Design

Boosting Logistic Regression Performance: Migrating from SciKit-Learn (CPU) to CuML (GPU)

DeepSeek r1 vs OpenAI o1

Building a Faster, Leaner Vector Search in Go

领英推荐

Dmitry Grinberg的更多文章

GenAI n00b, Part 1

社区洞察

其他会员也浏览了

Web ML Monthly #12: Google IO, client side LLMs / generative models, and ~250K weekly downloads of TensorFlow.js

Think with the model, Plan with the code.

The Copilot Era:My Speech at Semantic Kernel DevDay in Microsoft Reactor

Torching Through API Dependence: How TorchChat Optimizes LLMs for Local Use

How to Set Up and Run DeepSeek-R1 Locally Using Docker and Docker Compose

Google Colab: A Powerful Testing Platform for Machine Learning and Time Series Analysis

Creating and Building an AI Dataset for Accelerating GPU Design

Boosting Logistic Regression Performance: Migrating from SciKit-Learn (CPU) to CuML (GPU)

DeepSeek r1 vs OpenAI o1

Building a Faster, Leaner Vector Search in Go