登录查看更多内容

Predecessor of Phi3 ????- Textbook Are All You Need ??| Speech-to-Speech ????Translation with Monolingual Data ??

Raghul Gopal

Data Science at Logitech | AWS Community Builder ??(ML & GenAI) | Talks and Writes about AI, AGI & Cloud Deployments of AI & AGI | Public Speaker ??| Blogger ??| Unlocking Data Secrets with Math & AI ????

发布日期: 2024年5月12日

Hello All,

This is Raghul Gopal, an AWS Community Builder (ML & GenAI), a Research freak who is an enthusiast in AI & AGI Research. Welcome to Learn with Me Newsletter Week 1 where I will be focusing on the advancements of Generative AI.

1.?????? Textbook Are All You Need – Demystify Microsoft Phi 1

As we know the release of Microsoft Phi 3, the Small Language Models (SLMs) made a higher impact and thus provided huge competition to the other SOTA Large Language Models, it is very important to know about the father of Phi – 3 which is Phi 1. Phi 1 is the language model more specifically designed for code. It is a transformer-based model with 1.3 B parameters trained on 4 days on 8 A100s GPU. The data given to Phi 1 is composed of two different types namely

Text Book Data from Web (6B Token)
Synthetically generated textbooks and exercises with GPT 3.5 (1B Tokens)

Phi 1 has pass@1 accuracy of 50.6% on Human Eval, and 55.5% on MBPP. The other family model includes Phi 1, a smaller model with 350M parameters trained with the same pipeline as Phi 1, which provides 45% on Human Eval. The Phi 1 team focused more on the quality of data, that leads to better results. For e.g. Data Cleaning. With the help of Data Cleaning, the smaller datasets somewhat have advantages, and more than that, it allows more passes on the dataset. The recent work of Elden and Lu on Tiny Stories, which provides a high-quality dataset synthetically generated to teach English to Neural Networks, has paved the way for some changes. It dramatically changes the shape of the Scaling Laws and allows to matching of the performance of large-scale Scale Models with much learner training/models.

Phi 1, which is being trained on high-quality data can improve the SOTA of LLMs while reducing the dataset size, and training compute. Due to the reduced size of the training compute instance, the environmental cost of setting up the LLM model can also be reduced. Phi 1 is mainly focused on Code, more specifically Simple Python functions.

Let’s have a look at the training details and the importance of high-quality data below. The data from Stack (Repository) + Web Based Datasets (From Stack overflow and Code Contest) is not optimal for teaching the model how to reason and plan accordingly. Datasets with lots of noise, ambiguity, and incompleteness in Code, provides issues and it can reduce the quality and quantity of the signal that maps natural language to code. To mitigate all those issues, the researchers released a Text Book – A clear, self-contained, instructive, and balanced dataset.

Filtered Code Language – Subset of Stack and Stack overflow (6B Tokens)
Synthetic Text Book - < 1B Tokens of GPT 3.5 generated Python Text Books

3. Small Synthetic Exercise - ~180m tokens of Python exercise and solutions

Example of a Small Synthetic Exercise Dataset

Filtered Code Language + Synthetic Text Book – is termed as CodeTextBook, which is used for pretraining the Phi 1 base, in which it provides Human eval performance of 29%. Synthetic Exercise is termed as CodeExercises, which is used to finetune the Phi 1-based model which is now Phi 1.

Training Details of Phi – 1 family models

Existing Code Datasets (The Stack and Stack Overflow) are being annotated using GPT – 4 by prompting “DETERMINE IT’S EDUCATIONAL VALUE FOR A STUDENT WHERE GOAL IS TO LEARN BASIC CODING CONCEPTS”. Using the prompt, the dataset is annotated and the data is trained using a Random Forest Classifier, to predict the quality of the file/sample using its output embedding, by using the pre-trained Codegen model as features.

Make a note that

GPT 3.5 – Generate Synthetic Data
GPT 4 – Annotate a way to avoid tedious Human – annotation efforts.

Let’s have a look at the Phi – 1 Model Architecture

It is a Decoder Only Architecture using the Flash Attention implementation of Multi Head Attention (MHA). MHA and MLP in parallel configuration have been found through the references from CodeGen, PaLM, and GPT-Neo X. The model has 24 layers, with 2048 as hidden dimensions, MLP Inner Dimension of 8192, and 32 Attention heads of dimensions 64 each.

The Phi–1 small has 20 layers, with 1024 hidden layers, MLP inner dimensions of 4096, and 16 attention heads of dimensions 64 each. Rotary Position Embedding has been used with a Rotary Dimension of 32. Phi–1 has the same choices as CodeGen, as the same tokenizer has been used here namely CodeGen 350M Nano. Fp16 training with AdamW Optimizer has been used, with a linear warmup, linear decay learning rate schedule, and attention & residual dropout of 0.1.? Phi–1 is trained on 8 Nvidia A100 GPUs using Deepseed.

Let’s see the advantages of finetuning the model Phi – 1 base to Phi – 1

Improves Model Understanding

Sample Output of Phi – 1, Phi – 1 base, and Phi – 1 Small. Look at the model response by finetuning the Phi – 1 – base Model to produce Phi – 1

2. Improves the Model’s ability to use external libraries.

领英推荐

How does a vector database work?

Algolia 11 个月前

Progress in Gen AI and Open-Source LLMs, New Product…

Provectus 1 年前

Watch#7: Small Tweaks with Big Impact

Pascal Biese 1 年前

Sample Output of finetuning the model by improving the model’s ability to use external libraries

3. Data Pruning has been done by removing irrelevant data for unbiased performance evaluation. The list of Data Pruning Techniques is given below

N-gram overlap
Embedding and Syntax similarity analysis

Comparison of Phi – 1 Model with other State – of – art Models with HumanEval and MBPP Metrics Comparison.

Access the paper using this link: https://arxiv.org/abs/2306.11644

2.?????? Translatotron 3: Speech-to-Speech Translation with Monolingual Data

It is a Novel speech-to-speech Speech Translation from Monolingual Speech to text Datasets by combining Masked Auto Encoder + Unsupervised Embedding Mapping + Back Translation. Previous S2ST Research primarily used Supervised Learning that relies on bilingual speech datasets. The issues that arose based on the Supervised learning are given below

Supporting Low-Resource Language is difficult as the collection of bilingual speech datasets including these languages is hard
Due to the lack of bilingual speech datasets with corresponding para/non-linguistic in the source speech cannot be transferred to the translated speech.

To resolve this, an Unsupervised Machine Translation is needed without the use of bilingual speech datasets.

Here are the core details behind Translatotron 3

Pretraining the entire model as a masked autoencoder with Spec Augment
Unsupervised embedding mapping based on the Multilingual unsupervised embedding (MUSE)
Reconstruction loss is based on back translation to train the encoder-decoder direct S2ST Model from Translatotron 2 in a fully supervised manner.
The model is trained using Unsupervised MUSE Embedding Loss, Reconstruction Loss, and S2S Back Translation Loss.

Let’s have a look at the Architecture of Translatotron 3 in detail

Phase 1 uses the reconstruction loss via the auto-encoding path

Phase 2 employs the reconstruction loss via back-translation

Auto Encoding Reconstruction phase – The Network generates meaningful multi-lingual representations.
Back Translation Phase – The network is further trained to translate the input Spectrogram by the Back Translation Phase.

Let’s have a deep dive into the three losses namely MUSE Loss, Reconstruction Loss, and Back Translation Loss.

Access the paper using this link: https://arxiv.org/abs/2305.17547

That’s it for Week 4. Happy Day, Happy AI.

Follow me Raghul Gopal to know more about the releases of AI, and AGI with a clear understanding ??

Learn with Me

1,516 位关注者

Raghul Gopal

6 个月

Interested in learning the base of Code Language Models? The next issue is focused on the base of all LLMs that have been used to generate code, and debug codes. Guess what? it is DeepSeekCoder.

要查看或添加评论，请登录

查看全部

Predecessor of Phi3 ????- Textbook Are All You Need ??| Speech-to-Speech ????Translation with Monolingual Data ??

Raghul Gopal

Data Science at Logitech | AWS Community Builder ??(ML & GenAI) | Talks and Writes about AI, AGI & Cloud Deployments of AI & AGI | Public Speaker ??| Blogger ??| Unlocking Data Secrets with Math & AI ????

领英推荐

Learn with Me

1,516 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

New Book on Synthetic Data: Version 3.0 Just Released

??Top ML Papers of the Week

??Top ML Papers of the Week

Issue #210 - THE ML ENGINEER ??

The Top 4 Reasons to Learn PyTorch (and start getting into AI)

Is OpenAI’s O1 Model a Scam? An In-Depth Look at the Debate

Issue #194 - THE ML ENGINEER ??

Urban Computing AI - POI Recommendation

What makes LLM inference more challenging than traditional NLP?

领英推荐

Learn with Me

1,516 位关注者

Attention as an RNN - Aaren ?? | Don't Memorize - Be like a Goldfish??to mitigate Memorization in LLMs ??

2024年6月20日

Mixed Modal FM ??- Chances of Llama 4 | Aya 23 - Successor of Aya 101 ???

2024年6月20日

Safety Responses Automation ??| Segment Anything with Lightweight Model ??|?? - Release #9

2024年5月25日

The first releases of Code LLM - Code Intelligence Breakdown | Multi-Program Synthesis

2024年5月24日

Magician behind Coding ????♂???♀?| SLMs are the best? ??♂???♀?

2024年4月30日

Interbreeding Camels ????Version 2 - Camels in a Changing Climate

2024年4月24日

Refresh LLMs with SE Data ?? | Interbreeding of Camels ??

2024年4月23日

Learn First Multimodal LLM without Trouble and Perfect Medical LLM for Medicinal Research

2024年4月17日

Fine Tune LLMs - Don't go for Billion Parameters ??

2024年4月13日

Focusing on Attention and Hallucinations

2024年4月10日

社区洞察

其他会员也浏览了

New Book on Synthetic Data: Version 3.0 Just Released

??Top ML Papers of the Week

??Top ML Papers of the Week

Issue #210 - THE ML ENGINEER ??

The Top 4 Reasons to Learn PyTorch (and start getting into AI)

Is OpenAI’s O1 Model a Scam? An In-Depth Look at the Debate

Issue #194 - THE ML ENGINEER ??

Urban Computing AI - POI Recommendation

What makes LLM inference more challenging than traditional NLP?