Dataformer cover photo
Dataformer

Dataformer

科技、信息和网络

Create, Curate & Clean Datasets for Large Language Models.

关于我们

Solving data for LLMs - Create quality synthetic datasets! Open Source & Local. Contact us for enterprise solutions.

网站
https://dataformer.ai/
所属行业
科技、信息和网络
规模
2-10 人
类型
私人持股

Dataformer员工

动态

  • 查看Dataformer的组织主页

    2,586 位关注者

    Find out how you can leverage synthetic data to create your own local but powerful LLMs.

    查看Satpal Singh Rathore的档案

    Building human friendly AI | Machine Learning, AI, Synthetic Data

    Thrilled to announce that I'd be presenting synthetic data generation with Dataformer at upcoming MagicBall event. May be you can use it to create your own CoT & reasoning datasets to build a local mini o1? Who knows? Join me and hundreds of other AI makers to figure this out at India's largest AI festival. Check out Dataformer: https://lnkd.in/gd2siUtP Event Details - ?? Date: 30th Sept - 4th October, 2024 ?? Location: Bangalore ?? Link: https://lnkd.in/gCsGV2du Thanks to Siddharth Verma, Rohan Sood from Grayscale Ventures for hosting this for all.

    • 该图片无替代文字
  • 查看Dataformer的组织主页

    2,586 位关注者

    2 mins vs 40 mins. Dataformer makes it easy.

    查看Together AI的组织主页

    46,244 位关注者

    We finetuned Llama-3-8B on math problems and pushed accuracy from 47% to 65%, getting over 90% of GPT-4's performance in an 8B model at a fraction of the cost. Our small 8B fine-tuned model outperformed the base model by nearly 20%, beat out top OSS model LLama-3-70B, and achieved over 90% of GPT-4o’s accuracy! In the blog post, we go through all the code to finetune Llama-3 models on your own data, from data cleaning to finetuning to running evals. Read more: https://lnkd.in/e7VjSSJh

    • 该图片无替代文字
  • 查看Dataformer的组织主页

    2,586 位关注者

    We are working to integrate the best synthetic data generation recipes in Dataformer.

    查看Satpal Singh Rathore的档案

    Building human friendly AI | Machine Learning, AI, Synthetic Data

    Synthetic Data won AI Mathematical Olympiad 2024 Progress prize! -> Good data is all you need! -> Note: I'd be integrating the data creation pipeline in https://lnkd.in/gd2siUtP soon. Overall system used by NuminaMath for winning: 1. Used deepseek_ai math base 7B model 2. Custom synthetic datasets: > Stage 1: 100k+ CoT dataset from Math PDFs > Stage 2: 60k problems (Python REPL), GPT-4 generated w/ ToRA 3. Two-stage fine-tuning (MuMath-Code paper), no LoRA/DoRA, packed to 2048: > Stage 1: SFT on CoT data > Stage 2: SFT on TIR w/ Python REPL 4. Quantized Model to 8-bit for T4 GPUs 5. SC-TIR for inference: > Generate 48 candidates w/ Python REPL > Repeat gen/exec up to 3x for errors > Majority vote for final answer 6 Multiple validation sets to avoid overfitting. Synthetic dataset generation: 1. Collect diverse dataset of natural language math problems and solutions -> Template each solution with CoT for reasoning. Base model is fine-tuned on this. 2. Then, a synthetic dataset with tool-integrated reasoning is created. Each math problem is decomposed into a sequence of rationales, Python programs, and their outputs. This is same as in Microsoft’s ToRA paper. GPT-4 was prompted to produce solutions in the ToRA format with code execution feedback.?Then, fine-tuned the stage1 model. #AI #Mathematics #DataScience #Innovation #DataCreation #ArtificialIntelligence #SyntheticData

    • 该图片无替代文字
  • 查看Dataformer的组织主页

    2,586 位关注者

    At Dataformer, we are integrating techniques to leverage LLM intelligence for data creation.

    查看Satpal Singh Rathore的档案

    Building human friendly AI | Machine Learning, AI, Synthetic Data

    Intelligence of LLMs will continue to improve. Human intelligence will not. OpenAI has trained a model, based on GPT-4, called CriticGPT to catch errors in ChatGPT's code output. It was observed that when people get help from CriticGPT to review ChatGPT code they outperform those without help 60% of the time. They are working to integrate CriticGPT-like models into their RLHF labeling pipeline, providing their trainers with explicit AI assistance. This is a step towards being able to evaluate outputs from advanced AI systems that can be difficult for people to rate without better tools.

    • 该图片无替代文字

相似主页

查看职位