?? How I learnt to build an LLM in 32 hours (no, like really!) ??
Greetings Network! It’s been a while. I hope you are all well :) I’m thrilled to share a recent adventure, participating in the University of Cambridge GenAI Hackathon on the (1st to 3rd March).
Fueled by inspiration from a talk by Jonas Templestein , the former CTO of Monzo Bank , my teammates ( Hafsa K. and Uzair Ahmed ) and I set our sights on a project that seemed almost beyond reach: building an MVP of a large language model (LLM) designed to detect and fix security vulnerabilities within a mere 32 hours!
?
??From Zero to Hero in LLM Development??
The task was daunting. With no prior experience in LLMs between us, the complexity of the challenge was immense. Yet, driven by a mix of determination and a pinch of audacity, we began right away. Our project, “ArmouAI,” was an LLM model crafted to detect security vulnerabilities in code and generate fixes to those vulnerabilities. This project had promising scalability and the potential to evolve into a ground-breaking start-up.
Our strategy was multifaceted. While I tackled the formation of the LLM from start to finish, Uzair took on web development, and Hafsa managed data collection, cleaning, and processing. Our journey was filled with learning and adaptation, from pivoting away from a no-code approach (due to its lack of customizability) to leveraging the ‘starcoder’ model on Hugging Faces, training it and then fine-tuning it to fit our custom needs.
?
???? The (good) technical stuff?????
Due to the nature of the ‘starcoder’ model, i.e. being a purely code-generating LLM, I had to train the model at first using the “open-platypus” database, which improved the model’s comprehensive Ability. However, this meant that the LLM could no longer be used commercially due to licensing issues. Despite this, I believe this was a pivotal sacrifice, as the database is quite extensive, with over 15,000 training cases, making it both relatively lightweight and comprehensive, both of which are crucial as we had minimal resources to train the model.
After training the model, it was time to fine-tune it with a niche, custom dataset from GitHub, focusing on analysing security vulnerabilities within C and C++ code (this is to test the model’s viability by focusing on two use cases). To carry out all these steps I leveraged the “sentence_transformers” library with the “thenlper/gte-base” model (a more lightweight model in terms of GPU usage than the gte-Large, but still relatively similar outcomes). Using the gte-base model, I encoded outputs to embeddings, identifying and eliminating near-duplicates (deduplication) via FAISS. Consequently, I eliminated anything with >0.95 similarity.
Further database optimisation involved tokenisation checks to retain examples within manageable token limits. Ultimately, the top 3000 entries by token count were selected to highlight complex, detailed examples. This curated dataset not only underscores the model’s capacity to parse and understand intricate code patterns but also enriches its training foundation. Finally, the database was pushed to the Hugging Face Hub for use in the actual training of the model.
In relation to the implementation and fine-tuning of the model. The journey began with integrating the transformers library to utilise its AutoTokenizer on the dataset set up earlier. To address the computational and memory demands of training the starcoder model, I employed the peft library, which gave us a method of fine-tuning the model by adjusting only a small subset of model parameters, significantly optimising the training process.
Similarly, the utilisation of the trl library provided the SFT Trainer, which is a flexible tool for Supervised Fine-Tuning that enhanced our model’s learning from the targeted dataset. Quantisation was the next critical step, which took me the longest to understand. This was handled adeptly by the bitsandbytes library, allowing us to bypass the need for full precision, which reduced computational load.
The most difficult bit was changing the training arguments to be fine-tuned to our specific needs, i.e. testing to find optimal learning rates (by starting at 0.0001 and moving up in roughly multiples of 3), batch sizes and evaluation strategies.
After completion of all these steps, the model was ready to go. I tested it on several cases, and the results were phenomenal. However, this was my initial training on a small dataset of around 100 intances. On the second day of the hackathon, I was focused on creating the actual model with a more extensive database that encompasses the many instances in the database, around 18,000+. However, I ran into a slight issue; I had reached my limit of virtual GPU usage. As a result, when the competition ended, we only had a handful of functional use cases. I was distraught; all that hard work and the competition ended with no working model. But things took a turn…
?
???? The Judging ????
Watching all the groups present, I was confident in the innovation and the novelty of the idea, but we worried me was that we had no viable MVP to show for it. Nevertheless, we presented and delivered with absolute conviction in our concept. We were explaining every detail of our thought process and our production process. What captivated the judges was the start-up potential of our idea. They were interested in the vision we had, taking inspiration from GenAI leaders like IBM, Crowdstrike, and NVIDIA; our vision was to develop a “white-collar” start-up, an ALL-IN-ONE solution—Cloud-Adaptive Threat Simulation and Immune System for cybersecurity.
The model we just built was the start. We want to create a complete solution for clients. That is not only parsing their entire software system using our AI to detect vulnerabilities. But also leverages another machine learning model, some form of Generative Adversarial Network (GAN) that takes a base “cyber-attack method” and adapts it to the client's software. After that, it launches the cyber-attack, detecting the severity of the vulnerability of client software using a reinforcement learning model that treats different instances of model usage as an agent and the score or “reward” as the amount of data retrieved or load it can cause to servers which we can track from the client-side.
Back to the hackathon, when the judging began, we felt a slither of relief. All that hard work and time spent building was finally coming to an end. It was a peaceful few moments; I felt as though a weight had been lifted off my shoulders. This all came to an end when the judges announced the result. I felt my heart drop… I was excited but also on the edge. Had we just gone and pulled off a masterclass?
?? The END?! ??
WE PLACED IN THE TOP 6 across the entire competition!!! To me, this was a massive sign of success; starting with 0 knowledge of LLMs across the entire team and still managing to beat teams with quite an extensive background in Machine learning was a remarkable feat.
I was over the moon, this really proved to me that with the right motivation and the right mindset anything and everything is possible. And it enforced to me even more so that self-doubt does not serve you rather it hinders you and pulls you back. What started off with doubts and a slither of chaos ended up as one of the most remarkable learning experiences and a feeling of accomplishment.
?
?? What’s Next? ??
To me, this experience was more than just a hackathon; it was a profound journey of growth, learning, and overcoming self-doubt. It reaffirmed my belief in the boundless possibilities when passion meets perseverance.
The possibilities are endless with my newfound love for the field. Just before the competition, I was diving into the field of machine learning as a whole, with a main focus on computer vision. As for my journey, there is a lot more hard work and striving to come, with many goals and hurdles that I will need to overcome, the most significant of which will be documented through LinkedIn posts - stay tuned for that.
The future of ArmouAI is still an exciting endeavour; we hope that, Inshallah (God Willing), once we gain more skills, we can turn our visions into reality.
?
?? Concluding statement ??
I extend my deepest gratitude to Tanmay Gupta, PhD , and GetSeen Ventures for organising this incredible event.
I would also like to take a moment to sincerely thank the judges Andreas Kollegger , Chad Toerien , Ade Famoti , George Neville-Jones ?? , and Murat Ozer for your time and engagement. I hope you enjoyed our presentation as much as we enjoyed being in your presence :)
And finally to my LinkedIn network: this journey has only just begun. I hope you all stay safe and in good health. Until we meet again.
?
Kind regards,
Abdel L.
?
Extra:
Copy of our presentation = https://shorturl.at/hjqBX
? Infrastructure Engineer ? DevOps ? SRE ? MLOps ? AIOps ? Helping companies scale their platforms to an enterprise grade level
1 年Exciting journey! Can't wait to read more about your experience. ??
It was amazing to see you, Hafsa K. and Uzair Ahmed work really hard towards your goal! This is just the beginning and we see a bright future ahead for you all :)
CS student || SWE intern @ myairops
1 年Great work, really insightful Abdelbasit.
Driving AI Research and Scientific Innovation | Board Advisor
1 年Well done ????
CS @ University of Warwick
1 年Marvellous work!