Watch #1: About LLMs Browsing on Your Phone, Unit Tests and Code Reviews
In this issue:
1. Empowering LLMs to use Smartphones for Intelligent Task Automation
What problem does it solve? Traditional bot software is rather limited due to rule-based systems being hard-to-scale and often leads to behavior that will quickly get you flagges as bad actor - even if you aren't doing anything bad - because any excessive non-organic usage pattern will look shady to app owners.
How does it solve the problem? AutoDroid tries to solve Smartphone Task Automation with the power of LLMs. It supports cloud-based user favorites, such as GPT-3.5 and GPT-4, and also local 7B models that can be finetuned to work with specific apps. App data is fed into the LLM’s context in order to simulate memory. Just like in a chat, the model enters a dialogue with the app. Just that this “dialogue” is more complex and consists of several intermediary steps - think of two people not being able to communicate directly. The LLM acts as a guide for the Task Executor and the app sends back feedback after every action.
What’s next? There's certainly still a high barrier to entry to take advantage of this technology and the failure rate is non-trivial. But I'm excited to see smartphoine automation making big strides and personally, I’m looking forward to automating some things on my phone.
2. An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation
Watching: TestPilot (paper)
领英推荐
What problem does it solve? Coding LLMs have been all the rage lately and reasonably so! But with coding assistants evolving, there’s a demand for assistants that can handle additional tasks, such as debugging and designing tests. In terms of unit testing, current Automated Test Generation is suffering from a lack of readability, e.g., due to crude variable naming, and assertions.
How does it solve the problem? Most previous methods utilized conventional techniques based on symbolic execution, evolutionary methods or search. LLMs are trained with human or human-like instructions, so they excel at mimicking natural language and code. This mitigates the two main problems mentioned above: lack of readability and assertions.
What’s next? Tests are often seen as binary - either they pass or they fail. But there’s more to it from the point of a developer. What if a tests fails because it’s the wrong test for your code? Which generated tests are useful and only need a little fixing? Which ones are simply bad generations? Having a more fine-grained evaluation method will be crucial to further improve the user experience.
3. LLaMA-Reviewer: Advancing Code Review Automation with Large Language Models through Parameter-Efficient Fine-Tuning
Watching: LLaMA-Reviewer (paper)
What problem does it solve? While code reviewing can be an effective way to learn collaborative coding, it’s also often perceived as tedious. Current SOTA methods, such as CodeReviewer, are based on pre-trained Transformer models that take up a lot of space (~850MB) and parameters (220M). In times of GPT-4 this might not seem like much, but keep in mind that for coding, we’d ideally want to finetune the model for each of our code bases.
How does it solve the problem? LLaMA itself is way too big with its ~7B parameters. But luckily, Parameter-Efficient-Fine-Tuning (PEFT) is getting better and better. For LLaMA-Reviewer, the researchers explored Prefix-Tuning (PT) and LoRA. The latter performed significantly better - on par with the current SOTA at 26x less parameters taking up 50x less storage space.
What’s next? As this was only done with the smallest version of LLaMA and before LLaMA-2 even existed, there’s still a lot of room for quick improvements. This might be a good time to develop a consumer grade code reviewing software? Sounds like an awesome VSCode plugin (or Cursor if you’re feeling hip).
Thanks for reading LLM Watch! Subscribe for free to receive new posts and support my work - here on LinkedIn or on my substack.