Dataformer

科技、信息和网络

Create, Curate & Clean Datasets for Large Language Models.

关注

查看全部 3 位员工

关于我们

Solving data for LLMs - Create quality synthetic datasets! Open Source & Local. Contact us for enterprise solutions.

网站: https://dataformer.ai/
Dataformer的外部链接
所属行业: 科技、信息和网络
规模: 2-10 人
类型: 私人持股

Dataformer员工

Akshit Gautam

Software Engineer - LLMs
SOURABH SINGH

??M.L Engineer

查看全部员工

动态

Dataformer

2,586 位关注者
5 个月
举报此动态
Find out how you can leverage synthetic data to create your own local but powerful LLMs.
Satpal Singh Rathore

Building human friendly AI | Machine Learning, AI, Synthetic Data
5 个月

Thrilled to announce that I'd be presenting synthetic data generation with Dataformer at upcoming MagicBall event. May be you can use it to create your own CoT & reasoning datasets to build a local mini o1? Who knows? Join me and hundreds of other AI makers to figure this out at India's largest AI festival. Check out Dataformer: https://lnkd.in/gd2siUtP Event Details - ?? Date: 30th Sept - 4th October, 2024 ?? Location: Bangalore ?? Link: https://lnkd.in/gCsGV2du Thanks to Siddharth Verma, Rohan Sood from Grayscale Ventures for hosting this for all.
赞评论分享
Dataformer

2,586 位关注者
7 个月
举报此动态
Llama3.1 + Together AI + Dataformer = Powerful Synthetic Datasets!
Satpal Singh Rathore

Building human friendly AI | Machine Learning, AI, Synthetic Data
7 个月

Llama3.1 license allows creating synthetic data to train more powerful models ?? Here's how you can leverage this in easiest way: Dataformer: https://lnkd.in/gdg27vm6
赞评论分享
Dataformer

2,586 位关注者
7 个月
举报此动态
Generate synthetic datasets locally with Ollama & Dataformer
1 条评论

赞评论分享
Dataformer

2,586 位关注者
7 个月
举报此动态
2 mins vs 40 mins. Dataformer makes it easy.
Together AI

46,244 位关注者
7 个月

We finetuned Llama-3-8B on math problems and pushed accuracy from 47% to 65%, getting over 90% of GPT-4's performance in an 8B model at a fraction of the cost. Our small 8B fine-tuned model outperformed the base model by nearly 20%, beat out top OSS model LLama-3-70B, and achieved over 90% of GPT-4o’s accuracy! In the blog post, we go through all the code to finetune Llama-3 models on your own data, from data cleaning to finetuning to running evals. Read more: https://lnkd.in/e7VjSSJh
赞评论分享
Dataformer

2,586 位关注者
7 个月
举报此动态
We are working to integrate the best synthetic data generation recipes in Dataformer.
Satpal Singh Rathore

Building human friendly AI | Machine Learning, AI, Synthetic Data
7 个月

Synthetic Data won AI Mathematical Olympiad 2024 Progress prize! -> Good data is all you need! -> Note: I'd be integrating the data creation pipeline in https://lnkd.in/gd2siUtP soon. Overall system used by NuminaMath for winning: 1. Used deepseek_ai math base 7B model 2. Custom synthetic datasets: > Stage 1: 100k+ CoT dataset from Math PDFs > Stage 2: 60k problems (Python REPL), GPT-4 generated w/ ToRA 3. Two-stage fine-tuning (MuMath-Code paper), no LoRA/DoRA, packed to 2048: > Stage 1: SFT on CoT data > Stage 2: SFT on TIR w/ Python REPL 4. Quantized Model to 8-bit for T4 GPUs 5. SC-TIR for inference: > Generate 48 candidates w/ Python REPL > Repeat gen/exec up to 3x for errors > Majority vote for final answer 6 Multiple validation sets to avoid overfitting. Synthetic dataset generation: 1. Collect diverse dataset of natural language math problems and solutions -> Template each solution with CoT for reasoning. Base model is fine-tuned on this. 2. Then, a synthetic dataset with tool-integrated reasoning is created. Each math problem is decomposed into a sequence of rationales, Python programs, and their outputs. This is same as in Microsoft’s ToRA paper. GPT-4 was prompted to produce solutions in the ToRA format with code execution feedback.?Then, fine-tuned the stage1 model. #AI #Mathematics #DataScience #Innovation #DataCreation #ArtificialIntelligence #SyntheticData
赞评论分享
Dataformer

2,586 位关注者
7 个月
举报此动态
Turn a passage into multi-turn conversational dataset with Dataformer. Link to colab notebook in comments.
2 条评论

赞评论分享
Dataformer

2,586 位关注者
8 个月
举报此动态
At Dataformer, we are integrating techniques to leverage LLM intelligence for data creation.
Satpal Singh Rathore

Building human friendly AI | Machine Learning, AI, Synthetic Data
8 个月

Intelligence of LLMs will continue to improve. Human intelligence will not. OpenAI has trained a model, based on GPT-4, called CriticGPT to catch errors in ChatGPT's code output. It was observed that when people get help from CriticGPT to review ChatGPT code they outperform those without help 60% of the time. They are working to integrate CriticGPT-like models into their RLHF labeling pipeline, providing their trainers with explicit AI assistance. This is a step towards being able to evaluate outputs from advanced AI systems that can be difficult for people to rate without better tools.
赞评论分享

相似主页

查看职位

登录看看您认识Dataformer的哪些人

Dataformer

科技、信息和网络

Create, Curate & Clean Datasets for Large Language Models.

关于我们

Dataformer员工

Akshit Gautam

Software Engineer - LLMs

SOURABH SINGH

??M.L Engineer

动态

立即加入，查看您错过的职场动态

相似主页

Bhabha AI

SocialPost.ai

Savormetrics

M0

Kumori.ai

Easework AI

Dishcare

ClearML

Simplismart

Ikan Inc.

查看职位

工程师职位

实习生职位

数据科学职位

安卓开发员职位

软件培训生职位

数据分析员职位

机器学习工程师职位

科学家职位

分析师职位

软件工程师职位

Javascript 开发员职位

智能专员职位

Python 开发员职位

解决方案架构师职位

经理职位

初级软件工程师职位

初级产品经理职位

Django 开发员职位

技术培训师职位