HPC job scripts are simple and accessible to almost anyone. Oddly, they're pretty hard to replicate on the cloud, so we replicated the API with Coiled. Learn more about the new API in our latest newsletter, plus performance improvements in Dask-backed Xarray workflows and how to receive up to $200 in Amazon gift cards in our end-of-year referral program.
关于我们
Python, but big. Churn through a ton of data, no cloud expertise needed.
- 网站
-
https://coiled.io
Coiled的外部链接
- 所属行业
- 软件开发
- 规模
- 11-50 人
- 总部
- New York,New York
- 类型
- 私人持股
- 创立
- 2020
地点
-
主要
US,New York,New York,10018
Coiled员工
动态
-
Coiled转发了
We've started working more on the Array Integration of Dask, especially for #Geospatial workloads. This yielded some promising results so far to make Dask faster and especially more scalable. I wrote a short blogpost explaining how an internal change in Dask that improves Data Selection operations has widespread impact on many of Xarrays methods https://lnkd.in/dXXsn3HN We are still interested in feedback about what isn't working well for you with Xarray and Dask. Please reach out if you have anything that bugs you!
Improving GroupBy.map with Dask and Xarray
xarray.dev
-
Coiled转发了
New Post: SLURM-Style Job Arrays on the Cloud HPC Job scripts were the first form of parallelism I ever used as a graduate student. They're dead simple and accessible to almost anyone. Oddly, they're pretty hard to replicate on the cloud (AWS Batch/GCP Cloud Run/Azure Batch try, but aren't easy to use). We replicated the API with Coiled. It feels pretty slick to me ??
SLURM-Style Job Arrays on the Cloud with Coiled
docs.coiled.io
-
Siemens Case Study: Data Processing with Airflow + Dask The data engineering?+ analytics team at Siemens often relies on SQL for manipulating large?datasets, but recently tackled a project that stretched beyond SQL: identifying trends in employee training records using a fuzzy algorithm. With Dask, the team reduced ETL runtime by 80%, cutting execution from over an hour to just 10 minutes. Traditional SQL works well for many tasks, but more complex use cases—like fuzzy matching and advanced aggregations—require the flexibility of Python. Scaling these Python workloads on large datasets, however, can be challenging. The team chose to use Coiled?+ Dask for a few reasons: ? Minimal code rewrites: Dask DataFrame's?similarity?to pandas made it easy to parallelize existing code. ? No need to manage cloud infrastructure: Coiled's?managed Dask clusters made it easy to deploy on the cloud. ? Integration with their current stack: Coiled?+ Dask integrated with their existing Airflow workflows, making it quick to get up and running. Learn more in the case study from Stephen Schneider and Franco Bosetti:?https://lnkd.in/gksgayaE
Airflow, Dask, & Coiled: Adding Big Data Processing to Your Cloud Toolkit
docs.coiled.io
-
Lots of great talks at the Cloud-Native Geospatial Forum (CNG) virtual conference today! There's still time to register and learn more about open source geospatial tools like GeoParquet, Dask, and Icechunk.
I'm excited to be speaking at #CloudNativeGeo2024 later today. Join me for "Building Large Scale Geospatial Benchmarks" at 1:25 PM CT. Looking forward to seeing folks there. Register here:?https://lnkd.in/gkWUkvcc
Virtual Conference 2024
cloudnativegeo.org
-
Coiled转发了
Heading to my first PyData NYC tomorrow! Join us and hear more about how you can run Pandas on hundreds of GBs of data (or just gripe to Patrick H?fler about why your latest pandas PR hasn't been merged yet). We'll be at the Coiled booth all day Thursday and Friday, 5th floor right by the registration area and Quansight.
-
We're often asked how Dask + Coiled fit into existing machine learning pipelines. Recently, we worked with Hugging Face to put together an example processing the FineWeb dataset using the ?? HF FineWeb-Edu classifier. Scaling to the full dataset (>200 million rows) was possible with Dask deployed on a multi-GPU Coiled cluster. Learn more about this workflow in our latest newsletter, plus other updates on geospatial benchmarks and upcoming events (looking forward to PyData NYC next week!)
October Updates
Coiled,发布于领英
-
Coiled转发了
As we started the building out the LLM capabilities for finding activating antibodies for GPCRs at Abalone Bio we needed to go from local notebooks to code running on GPU clusters quickly, cost efficiently, and ideally without locking us into a cloud vendor. After evaluating vendors across the spectrum we chose Coiled as it seemed to check all the boxes. I am glad we did as they have proven to be amazing not just in the technical product they deliver but also in the high-touch, human support they provide. If you want to spend time shipping code and models and not thinking about k8s (eek!) all while working with great people I can easily recommend Coiled!
We have successfully integrated AI into our FAST platform, using powerful protein large language models to discover and design activating #antibodies for #GPCR targets. In our journey, Coiled has served as an indispensable partner enabling us to seamlessly scale from ideation to production. Check out this post by the Coiled team on how they have supported us: https://lnkd.in/gKxd7KNj
Abalone Bio: Accelerating Antibody Discovery
coiled.io
-
Coiled转发了
Using Hugging Face models and datasets is powerful for machine learning, but scaling tasks like model inference on large datasets can be challenging. Dask handles out-of-core computing, breaking up datasets into manageable chunks so that even large-scale tasks can run smoothly. In this example, we processed the FineWeb dataset (~715 GB in memory) using the ?? HF FineWeb-Edu classifier. Locally, processing 100 rows with pandas took ~10 seconds, but scaling up to 211M rows was possible with Dask on multi-GPU clusters deployed with Coiled. Results: - Handled large-scale text classification, filtering, and saved results to Hugging Face storage - Optimized GPU utilization to efficiently use expensive hardware This example could be adapted for other workflows like: - Genomic data filtering - Large-scale content extraction - Multimodal AI (audio, image, text) Had a lot of fun learning about Hugging Face and putting together this example with Quentin Lhoest, James Bourbeau, and Daniel van Strien Blog post: https://lnkd.in/gcV348fA
-
Coiled转发了
?? New video release ??: Pandas + Dask DataFrame 2.0 - Comparison to Spark, DuckDB and Polars Watch how #Dask DataFrame 2.0's improved performance and new features compare to #Spark, #DuckDB, and #Polars, offering a faster and more robust system for big data processing. ?? Watch the video on YouTube: https://lnkd.in/em9c2Qba Florian Jetter and Patrick H?fler discussed the significant enhancements to Dask, a Python library for distributed computing that integrates well with pandas. Historically, Dask was user-friendly but lacked robust performance. The re-implementation of the DataFrame API has addressed these concerns, making Dask faster and more efficient. Patrick Hoefler, a pandas core team member and Dask maintainer at Coiled, highlighted the improvements in Dask, including a new shuffle algorithm, a logical query planning layer, and a reduced memory footprint. These changes have led to a better user experience and a more robust system overall, especially when compared to tools like Spark, DuckDB, and Polars. The speakers emphasized the seamless integration of Dask with pandas and other PyData stack libraries, making it a compelling option for big data applications. They compared Dask's performance against other tools using TPC-H benchmarks. They also discussed future developments, including extending the logical query planning layer to frameworks like Dask Array and XArray.
Pandas + Dask DataFrame 2.0 - Comparison to Spark, DuckDB and Polars [PyCon DE & PyData Berlin 2024]
https://www.youtube.com/