iterative.ai的动态

7,623 位关注者

2 个月已编辑

Under the hood DataChain combines power of warehouses with distributed clusters with proper data access patterns to process millions of video, images audio files: ?? Never copy data. Store references to files instead. (while still preserving versioning, data loading, efficient processing) ? Use warehouses under the hood (e.g. ClickHouse) to store metadata and perform as many operations inside it (e.g. filters). ?? Distributed compute that runs close to the data to compute Python-based UDFs ?? Data access. Pre-fetch, batching, caching, streaming - different workloads require different ways of using data. #unstructured #datachain #dvc #machinelearning #opensource

3 条评论

Transcript

Ya da s??v??yla c. The work with a means of files. Your storages. Uh, this is when data change can help you to handle a large scale datasets. So let's take a look at this data set. That's a data set of images. And it contains. 100 millions of images inside this data set and data chain can help you to efficiently work and process this amount of data and extract metadata, including for example embeddings. As well as other types of metadata in a massive scale.

iterative.ai

2 个月

Try it here https://datachain.ai/ or less scalable open source version here https://github.com/iterative/datachain

Cameron Price

2 个月

It's impressive how DataChain optimizes unstructured data processing. Could you share more about how the distributed compute works next to the data?

查看更多评论

要查看或添加评论，请登录

最相关的动态

ActionIQ

11,788 位关注者
10 个月已编辑
举报此动态
???? From our engineering team's keyboard ???? Our latest Tech Blog is out now, featuring the brilliant minds of Patrick Conway and Philip Catterall. They dive into a new view caching strategy that revolutionized our ELT pipelines at ActionIQ, reducing compute usage by 50%. Check out the full blog here: https://hubs.li/Q02tY90Q0. #Tech #ELTpipelines #TeamWork

Data Pipeline Optimizations: Implementing a View Caching Layer in an ELT Platform

medium.com
赞评论
要查看或添加评论，请登录
Mitch Becker

Sr. Storage Specialist, Amazon Web Services (AWS) HPC-Containers-AI/ML | NetApp Alum
5 个月
举报此动态
If you need to efficiently process huge datasets for HPC, machine learning or other applications, watch this video to see how FSx for Lustre (FSxL) can save you time and money by seamlessly integrating S3 object storage with a high performance filesystem. #AWS #machinelearning #bigdata #HPC https://buff.ly/4d9XhN7 Marcos Perez Seoane Laura Shepard Brendan Bouffler

Linking Lustre and object storage - Data repository associations in Amazon FSx

https://www.youtube.com/
赞评论
要查看或添加评论，请登录
Rafi Kurlansik

Data Science - ML - Developer Experience
10 个月
举报此动态
Great new article by Li (Luke) Yu on how to use #ray to perform feature extraction for #llms on #databricks: https://lnkd.in/d5qv3Xmk Key insights include: ?? How to setup and configure a Ray cluster on Databricks with #GPUs ?? How to overcome tight context windows with various summarization approaches

Feature Extraction Made Easy with LLMs and Ray on Databricks

community.databricks.com

4 条评论
赞评论
要查看或添加评论，请登录
Youssef Mrini

Data & AI Architect | NextGenLakehouse | Opinions are my own.
2 个月已编辑
举报此动态
The new?Dedicated?access mode (previously?Single user) allows you to assign a dedicated all-purpose compute to a group or single user. It's very interesting for people using the ML runtime since it's not supported yet in Shared Clusters. Martin Grund Stefania Leone #Databricks
12 条评论
赞评论
要查看或添加评论，请登录
Juicedata

301 位关注者
7 个月
举报此动态
As a multinational tech company, ??vivo's #AITraining platform faced #StorageChallenges with #GlusterFS. By switching to a distributed #FileStorageSystem, integrated with #JuiceFS, vivo boosted its training performance and efficiency. ?? Key enhancements: ?? High-performance distributed #MetadataServices ?? Flexible #caching strategies ?? Efficient capacity #LoadBalancing ??Learn more: https://lnkd.in/gKrr3zqJ #ArtificialIntelligence #AISolutions #DistributedFileSystem #DistributedStorage #DataStorage #CloudFileSystem #ArtificialIntelligenceStorage

vivo Migrated from GlusterFS to a Distributed File System for AI Training

juicefs.com
赞评论
要查看或添加评论，请登录
Ohad Levi

Co-founder & CEO at Hyperspace | High-Performance Search | Domain-Specific-Computing
8 个月
举报此动态
You often hear me speak about domain-specific computing and why it is so critical in reconstructing search to support modern real-time data retrieval. I’ll start by emphasizing again that legacy software-based search solutions have long reached their glass ceiling and are not able to support real time search at a billion-scale without compromise of price, speed, or relevancy. The reason domain-specific computing is so powerful is that it skips the standard software semantics, cache hierarchy, and other CPU abstractions. It then implements the core parts as a custom datapath processor. Together with a new software stack, this runs search and information retrieval workloads hundreds of times faster. Furthermore, Hyperspace Cloud processing unit includes tens of dedicated search cores running proprietary instruction sets to filter, rank, and aggregate search results in a super-efficient way. These custom search instructions, along with advanced data prefetching, enable speeds that can’t be matched by general-purpose CPUs. Interested in learning more about how you can shatter the limits of your search? https://lnkd.in/dGVtdfC2 #elasticsearch #dataretrieval #vectorsearch #genai #llm #keywordsearch #lexicalsearch #database.
赞评论
要查看或添加评论，请登录
NeuReality

5,403 位关注者
8 个月已编辑
举报此动态
?? Compare LLMs & SLMs for AI data center cost and energy efficiency. ?? LLMs need more resources for complex tasks, while SLMs are lighter for simpler tasks. Consider the difference when choosing your AI inference servers. Invest in energy-efficient AI data centers for a greener future! ???? #AI #DataCenter #EnergyEfficiency #Inferencing #LLM #SLM #SustainableTech

LLM vs SLM: OPTIMIZING YOUR INFERENCING SOLUTION FOR GENERATIVE
赞评论
要查看或添加评论，请登录
Naveh Grofi

I am a Results-driven HW/SW Integration Engineer with extensive experience in technical project management and projects leading
8 个月
举报此动态
?? Exciting insights on AI data center efficiency! LLMs vs. SLMs—weighing resource needs for complex tasks and lighter options for simpler ones. Make informed choices for greener, sustainable AI infrastructure. ???? #AI #DataCenter #EnergyEfficiency #Inferencing #SustainableTech

NeuReality

5,403 位关注者
8 个月已编辑

?? Compare LLMs & SLMs for AI data center cost and energy efficiency. ?? LLMs need more resources for complex tasks, while SLMs are lighter for simpler tasks. Consider the difference when choosing your AI inference servers. Invest in energy-efficient AI data centers for a greener future! ???? #AI #DataCenter #EnergyEfficiency #Inferencing #LLM #SLM #SustainableTech

LLM vs SLM: OPTIMIZING YOUR INFERENCING SOLUTION FOR GENERATIVE
赞评论
要查看或添加评论，请登录
Anjli K.

Immediate Joiner | Data Engineer | Python | SQL | Spark | AWS CCP |
3 个月
举报此动态
#ApacheSpark is a lightning-fast, distributed computing system for processing large-scale data. It’s known for its versatility, supporting batch, streaming, and machine learning workloads. Each layer of Spark's architecture handles a particular part of the process. The first layer is the interpreter, which is a tweaked version of Scala’s interpreter. When you type code in Spark, it creates an operator graph. Once you run an action, like collect, this graph is sent to the DAG Scheduler. The DAG Scheduler splits the operator graph into stages like map and reduce tasks. Each stage is made up of tasks based on your data's partitions. To make things faster, the DAG Scheduler optimizes by grouping operators together, so several map tasks can run in a single stage. The end result is a set of stages that get passed to the Task Scheduler. The Task Scheduler’s job is to launch these tasks using a cluster manager, like Spark Standalone, YARN, or Mesos. It handles tasks independently and doesn’t need to know how stages are connected. Understanding this flow is the foundation of mastering Spark. What’s your favorite Spark feature? Let’s discuss in the comments! #Sparkchallenge #DataEngineering #ApacheSpark
赞评论
要查看或添加评论，请登录
GHA Technologies, Inc

62,393 位关注者
10 个月
举报此动态
Increasing volumes of data and larger workloads mean you need more processing power and storage - especially if you plan to run AI models. You need Apache Spark to accelerate your workloads. Download this NVIDIA eBook to learn how Spark accelerates AI and data processing.

Accelerating Apache Spark 3

ghamarketing.lll-ll.com
赞评论
要查看或添加评论，请登录