June 2024 DVC Pulse!

June 2024 DVC Pulse!

Hi Friend! ????

The summer is upon us and with it, lots of great happenings in the DVC Community.? Let's jump in!

?The Latest DVCx ???? DataChain

???? Over the last 6 months or so, we've been dripping out info on the tool we've been building specifically to deal with the challenges and scale of Generative AI data.? We have been referring to it as DVCx, but true to our Iterative spirit have landed on a new name that will make things clearer: DataChain.?

While we have no new videos or blog posts this month, I can share that the team has been heads down working on content for you ahead of DataChain's release including a new online course to be released this summer designed to get you up and running with the open source version of DataChain for Computer Vision, LLMs, and Multimodal projects!

We are currently looking for beta testers for the open-source version of DataChain, so if you are interested in trying it out and giving us feedback in the next couple of weeks, please reach out to me at [email protected] or here, Jenifer De Figueiredo


?? What we're looking at

  • Claude 3.5 Sonnet - Anthropic?released Claude 3.5 Sonnet a bit ago and benchmarks are higher across the board for the latest model?while costs have decreased!?Comments and reviews have been positive, especially about its improved?natural language and humor, and its improved ability to help with code. I have found the experience superior to ChatGPT 4o as well in my initial trials.? There have?been some complaints about it stalling, but gosh, aren’t we getting a little entitled? ??. Do you remember dial-up internet?! Cut some major innovators some slack! ??

  • Safe Superintelligence Inc. (SSI) -?With a simple letter on their website, Ilya Sutskever, Open AI’s co-founder and former Chief Scientist until?his departure from the company in May, along with Daniel Gross, and Daniel Levy announced?Safe Superintelligence Inc.(SSI). The company was created to build super-intelligent AI with a strong emphasis on safety.? [Three cheers! ?? ] There are questions about how effective they will be in obtaining funding for a company that?is committed to pure development without commercial motivation, despite the team's significant pedigrees.? This will be interesting and informative to pay attention to.

  • Nebius and DVC Partnership! - Nebius, the cloud platform designed specifically to train AI models, has created a technological?partnership with DVC.? Look out for new features in the future as we collaborate with Nebius to streamline version control of ML workloads!? Find more info here.?
  • Customer success: Exscientia - Congratulations to the team at Exscientia who recently released their research (De novo antibody design with SE(3) Diffusion)?creating IgDiff, an antibody variable domain diffusion model.? They found that IgDiff produces highly designable antibodies that can have novel binding regions, and the new model outperforms current state-of-the-art generative backbone diffusion models across a variety of tasks. To see how Exscientia integrates DVC with SLURM for high-performance computing in its work, see this blog post.?


??What's Coming!

?? As noted above we are looking for beta testers of DataChain (formerly DVCx).? If you are dealing with GenAI unstructured data and would like to test out the open-source version of DataChain, please reach out to me at [email protected] and we'll get you set up!


?? Community-Generated Content

Videos:

  • Project of the week: DIY Data Version Control (DVC): This video is about a project of the week by DataTalksclub?presented by Antonis Stellas . This week's project is on DVC where participants can learn more about DVC by following a plan and working on tasks throughout the week. They can also collaborate and ask questions on Slack. Antonis?explains what DVC is and how it can be useful for data scientists and machine learning engineers. Additionally, DVC enables?reproducibility for those times?when it's difficult to remember exactly which data and model versions were used to generate particular results.
  • MLOps: From Jupyter to Production: In this video by Pablo Tomas Fernandez at Conf42 ML. He covers the tools and practices needed to take a machine learning model from development in Jupyter Notebooks to deployment in production. Thomas?argues that MLOps offers several benefits, including reduced work for data scientists, enabling them to focus on bigger tasks; the ability to scale workloads to handle larger datasets and more complex models; and improved consistency and traceability of models. The talk focuses on a use case where a Resnet convolutional neural network is used to classify cat and dog images. DVC was used to manage the large datasets involved in training the model and allows you to track the data used to train a model with your?Git version control system, while storing it where it fits in your storage of choice.
  • Using DVC to version data, Demo lesson of the “MLOps” course: [Russian Language]?In this webinar, Igor Stureiko explained how to configure DVC repositories within Git and store artifacts on S3 storage, how to switch between different versions of artifacts, upload them to external storage, and create reproducible model training pipelines using DVC pipelines.
  • Data versioning Explained: In this video, George Yates , explained how data versioning with tools like DVC is useful in data-centric projects and ensures collaboration within data teams which is crucial for machine learning and data engineering workflows where data evolves.

Articles:

  • Finding the Right Embedding Model for Your RAG Application: In this Article, Roman Purkhart discusses finding the right embedding model for a Retrieval Augmented Generation (RAG) system. The authors built a RAG chatbot for their product documentation and experimented with different embedding models to improve the retrieval component. DVC was useful for managing the data and processes for evaluating the embedding models. It allowed the authors to track changes, dependencies, and pipelines throughout experimentation which facilitated systematic tracking and adjustments, leading to a comprehensive evaluation of different embedding vector strategies.
  • Marketing Measurement series: Marketing Mix Modeling at Qonto | Part VI: In this article, Louis Magowan discusses MLOps and its usefulness in Marketing Mix Modeling (MMM). The article particularly focuses on the benefits of using DVC and MLFlow for MMM projects. It noted?DVC as a useful tool for MMM because it helps with data version control. MMMs often require a broad range of inputs, many of which may not be readily available online. DVC allows you to track changes to this data and store different versions of it. This ensures reproducibility and persistence of the data used in your MMM. For instance, if you are asked to remodel an MMM you previously delivered, you can simply go back in time in Git and use DVC to download the version of the input data from that time.
  • Essential Deep Learning Checklist: Best Practices Unveiled: In this article, Tarana Murtuzova shares a checklist for deep learning projects, covering various aspects from code organization to infrastructure. It emphasizes the importance of well-documented, efficient, and reproducible deep learning projects. It mentioned other tools alongside DVC were useful for managing the different versions of your data throughout the deep learning project. This ensures that you can reproduce your experiments and results by using the same version of the data that was used to train the model.
  • Top 10 Coding Mistakes by Data Scientists: In this article, Parvez Shah Shaik , discussed the coding errors several data scientists encounter and the solutions to these problems including proper documentation and using tools like DVC for Data versioning.
  • Ten Open source tools for building MLOps pipeline: In this article Jesse Williams, discusses MLOps tools and how they can help streamline the building of ML projects. It contrasts traditional software development with ML project development, highlighting the iterative nature of ML projects and the numerous steps involved. The article emphasizes the importance of MLOps in managing the complexities of ML projects and introduces MLOps tools like DVC, CML, Hydra, etc. It concludes by offering advice on choosing the right MLOps tools based on factors like team expertise, budget, scope, documentation, and support.
  • Understanding Data Version Control (DVC). Why is it essential in MLOps? : In this article, Ashwin S?shares how DVC is an open-source version control system designed specifically for managing and versioning machine learning models, data sets, and other large files. It tracks changes in the datasets and models by creating .dvc files stored in Git. These .dvc files contain metadata such as hash value, data path, etc. DVC supports integration with various storage options (local, remote, cloud-based) for data versioning.

Thanks for the read!? Many thanks to Gift Ojeabulu for his help finding and writing about the Community content! We'll see you next month! ?

To your continued success,

Jenifer De Figueiredo

Community Manager at DVC


要查看或添加评论,请登录

社区洞察

其他会员也浏览了