Simplifying Complexity of GenAI Model Customization


Enterprises are embracing generative AI as a differentiator for their core businesses. Rather than simply using a generic generative AI model that is a jack-of-all-trades, there is a growing trend of customizing a pre-trained model with additional enterprise-specific data to make it excel in a particular domain or task within the enterprise. This involves curating and processing high quality data, choosing a good pre-trained model that is right-sized for the use case (larger models are more expensive to train and infer and may not justify the return on investment in using GenAI), using one of several fine-tuning techniques, and building a robust evaluation methodology to ensure the trained model will perform well in the domain and tasks of interest. This is indeed complex and requires highly skilled scientists and engineers.

The model customization team at IBM Research is building a tuning platform based on leading open source community projects from PyTorch and HuggingFace (HF) ecosystems and also actively contributing to them. We are not only building the best ingredients and tools for model customization embracing open-source development, but we are also on a mission to curate the best recipes that will simplify model customization and enable domain experts with little knowledge of LLM-internals to be able to customize models to suit their needs. This tuning platform forms the basis of the tuning capabilities in Red Hat Openshift AI and IBM watsonx.ai.??

We support a wide range of model architectures including transformer-based models such as Granite and Llama, as well as structured state space models such as mamba, without the user having to know anything about its internals. We support multiple tuning techniques including full fine-tuning (SFT), PEFT LoRA, and quantized LoRA, with support for other techniques such as preference tuning and model distillation planned for the coming months. We recently open-sourced our model optimizer framework to develop reduced precision models with support for GPTQ and FP8 quantization, quantization-aware-training (QAT), and post-training quantization (PTQ).?


Model Customization is Efficient?

We have made several advancements over the past year to make model customization more resource efficient. The usual method to collate multiple sequences during supervised fine tuning is to pad sequences to be of the same length. These padding tokens introduce inefficiencies as they result in unproductive computations. By packing sequences together without padding, using token position information, and making it work with FlashAttention, we were able to obtain ~2x throughput improvement across models and tuning techniques. This was contributed to HF Transformers and TRL (HF blog, PR, PR).?

We helped establish parity between FSDP and DeepSpeed backends in HF to permit users to move between them seamlessly (HF blog, Accelerate documentation, PR). This was because DeepSpeed was internally upcasting weights loaded in bf16 to fp32, causing the loss behaviors to be different. We raised a PR in HF Accelerate to upcast automatically for FSDP if mixed precision is enabled, and included a guide to help users achieve equivalence between the two backends.??

We have implemented fused operations and custom triton kernels adapted from unsloth to accelerate tuning including low-rank adapter fused operations and fast kernels for RoPE, layer norm, and cross-entropy and made them available through our acceleration library. As an example, these kernels help improve throughput by ~40% and reduce memory requirements by ~30% for Mistral-7B. We also have triton kernels for parallelizing expert computations in mixture-of-expert (MoE) models.?


Model Customization is Simple?

We have developed and adopted several tools to make model customization simpler for our users. We have developed a resource estimator to estimate memory requirements and training time even before you execute a tuning job. This is based on a hybrid method that leverages both theoretical knowledge of model training computations for different architectures and tuning techniques, as well as learned regression models from empirical observations of actual tuning runs. This can help provide an estimate for the trade-off that a user faces between how long a tuning job will take to complete and how much resources (and cost) they are willing to expend.?

We developed a trainer controller framework to control the training loop and perform different actions such as stopping training, checkpointing, and logging, based on user-defined metrics and rules using HF trainer callbacks. This permits user control to automatically stop training early if the training is not progressing well or to avoid overfitting. We also have a pluggable mechanism to integrate experiment tracking tools such as AimStack, wandb and MLFlow tracking, to easily monitor the training in near-real-time.??



We are just getting started and there are several other advancements we are working on jointly with others in the community. These include adding tensor parallelism support with PyTorch dtensors in HF Transformers and Accelerate, PyTorch compile optimizations for our tuning stack based on HF libraries, advances in cluster-level job management, improving tuning efficiency on the IBM Spyre AI accelerator chip, and tools to improve training data quality. I am extremely proud of and thankful to the global research team behind these advancements. We intend to publish more detailed blogs on each of these topics in the coming months.?

If you have prior experience in this area and are passionate about building the best platform for generative AI, we are hiring for our team at IBM’s India Research Lab and would love to hear from you!?

Mehrose H.

Facebook Ads Specialist @ YOC - YourOnlineConversation ? - UK

2 个月

Love every bit of it.

回复
Syed Shafiyullah

Machine Learning | Mathematics | Data Analytics | Python | Automation | Business Intelligence

2 个月

Insightful! Thanks for sharing Praveen Jayachandran sir. I liked the article when I read. You explain this topic very well in this article. Create an article more like that.

回复
Steve Liu

Educator | Technologist | Entrepreneur | Investor | AVP of Research @ MBZUAI | Co-founder, Proactive AI Lab | Professor @ McGill & Mila | Fellow of CAE & IEEE | Chair, ACM SIGBED | Former Executive @ Samsung AI, Tinder

2 个月

Well-written and Insightful! Thanks for sharing, Praveen

回复

要查看或添加评论,请登录

Praveen Jayachandran的更多文章

  • The Half Equal Opportunity

    The Half Equal Opportunity

    Opinions expressed here are my own and do not necessarily reflect the opinions of my employer. Despite substantial…

  • NPTEL ties up with IBM for new online blockchain course

    NPTEL ties up with IBM for new online blockchain course

    At one point of time during my PhD days, I was harboring ambitions of being an academic professor. Apart from engaging…

    19 条评论
  • On The Essentiality of Inclusion

    On The Essentiality of Inclusion

    Robert Solow, a professor from MIT, won the Nobel prize for economics in 1987 for a rather intuitive yet masterfully…

    2 条评论
  • Financial Innovation and the Need for Regulation

    Financial Innovation and the Need for Regulation

    This article presents my personal point of view of how regulation has helped nurture financial innovation and makes a…

  • Bitcoin Mining and Energy Dynamics

    Bitcoin Mining and Energy Dynamics

    A colleague of mine Shivkumar Kalyanaraman recently wrote an excellent article on bitcoin mining and its energy…

  • Why so Delirious about Blockchain? - A Technical View

    Why so Delirious about Blockchain? - A Technical View

    Over a billion dollars were invested in blockchain startups in 2016. For the first time, in Q1 2016, investment in…

    21 条评论

社区洞察

其他会员也浏览了