Simplifying Complexity of GenAI Model Customization
Praveen Jayachandran
Senior Technical Staff Member and Senior Manager at IBM Research
Enterprises are embracing generative AI as a differentiator for their core businesses. Rather than simply using a generic generative AI model that is a jack-of-all-trades, there is a growing trend of customizing a pre-trained model with additional enterprise-specific data to make it excel in a particular domain or task within the enterprise. This involves curating and processing high quality data, choosing a good pre-trained model that is right-sized for the use case (larger models are more expensive to train and infer and may not justify the return on investment in using GenAI), using one of several fine-tuning techniques, and building a robust evaluation methodology to ensure the trained model will perform well in the domain and tasks of interest. This is indeed complex and requires highly skilled scientists and engineers.
The model customization team at IBM Research is building a tuning platform based on leading open source community projects from PyTorch and HuggingFace (HF) ecosystems and also actively contributing to them. We are not only building the best ingredients and tools for model customization embracing open-source development, but we are also on a mission to curate the best recipes that will simplify model customization and enable domain experts with little knowledge of LLM-internals to be able to customize models to suit their needs. This tuning platform forms the basis of the tuning capabilities in Red Hat Openshift AI and IBM watsonx.ai.??
We support a wide range of model architectures including transformer-based models such as Granite and Llama, as well as structured state space models such as mamba, without the user having to know anything about its internals. We support multiple tuning techniques including full fine-tuning (SFT), PEFT LoRA, and quantized LoRA, with support for other techniques such as preference tuning and model distillation planned for the coming months. We recently open-sourced our model optimizer framework to develop reduced precision models with support for GPTQ and FP8 quantization, quantization-aware-training (QAT), and post-training quantization (PTQ).?
Model Customization is Efficient?
We have made several advancements over the past year to make model customization more resource efficient. The usual method to collate multiple sequences during supervised fine tuning is to pad sequences to be of the same length. These padding tokens introduce inefficiencies as they result in unproductive computations. By packing sequences together without padding, using token position information, and making it work with FlashAttention, we were able to obtain ~2x throughput improvement across models and tuning techniques. This was contributed to HF Transformers and TRL (HF blog, PR, PR).?
We helped establish parity between FSDP and DeepSpeed backends in HF to permit users to move between them seamlessly (HF blog, Accelerate documentation, PR). This was because DeepSpeed was internally upcasting weights loaded in bf16 to fp32, causing the loss behaviors to be different. We raised a PR in HF Accelerate to upcast automatically for FSDP if mixed precision is enabled, and included a guide to help users achieve equivalence between the two backends.??
We have implemented fused operations and custom triton kernels adapted from unsloth to accelerate tuning including low-rank adapter fused operations and fast kernels for RoPE, layer norm, and cross-entropy and made them available through our acceleration library. As an example, these kernels help improve throughput by ~40% and reduce memory requirements by ~30% for Mistral-7B. We also have triton kernels for parallelizing expert computations in mixture-of-expert (MoE) models.?
领英推荐
Model Customization is Simple?
We have developed and adopted several tools to make model customization simpler for our users. We have developed a resource estimator to estimate memory requirements and training time even before you execute a tuning job. This is based on a hybrid method that leverages both theoretical knowledge of model training computations for different architectures and tuning techniques, as well as learned regression models from empirical observations of actual tuning runs. This can help provide an estimate for the trade-off that a user faces between how long a tuning job will take to complete and how much resources (and cost) they are willing to expend.?
We developed a trainer controller framework to control the training loop and perform different actions such as stopping training, checkpointing, and logging, based on user-defined metrics and rules using HF trainer callbacks. This permits user control to automatically stop training early if the training is not progressing well or to avoid overfitting. We also have a pluggable mechanism to integrate experiment tracking tools such as AimStack, wandb and MLFlow tracking, to easily monitor the training in near-real-time.??
We are just getting started and there are several other advancements we are working on jointly with others in the community. These include adding tensor parallelism support with PyTorch dtensors in HF Transformers and Accelerate, PyTorch compile optimizations for our tuning stack based on HF libraries, advances in cluster-level job management, improving tuning efficiency on the IBM Spyre AI accelerator chip, and tools to improve training data quality. I am extremely proud of and thankful to the global research team behind these advancements. We intend to publish more detailed blogs on each of these topics in the coming months.?
If you have prior experience in this area and are passionate about building the best platform for generative AI, we are hiring for our team at IBM’s India Research Lab and would love to hear from you!?
Facebook Ads Specialist @ YOC - YourOnlineConversation ? - UK
2 个月Love every bit of it.
Machine Learning | Mathematics | Data Analytics | Python | Automation | Business Intelligence
2 个月Insightful! Thanks for sharing Praveen Jayachandran sir. I liked the article when I read. You explain this topic very well in this article. Create an article more like that.
Educator | Technologist | Entrepreneur | Investor | AVP of Research @ MBZUAI | Co-founder, Proactive AI Lab | Professor @ McGill & Mila | Fellow of CAE & IEEE | Chair, ACM SIGBED | Former Executive @ Samsung AI, Tinder
2 个月Well-written and Insightful! Thanks for sharing, Praveen