登录查看更多内容

Data Platform News (January 2025)

Pawel Potasinski

CTO at InfiniteDATA Services

发布日期: 2025年2月3日

Welcome to the first edition of my newsletter in 2025. January was an interesting month in terms of what was happening in the tech world. You know, DeepSeek and stuff like that. However, in this newsletter I try to stick to my three favorite platforms: Snowflake, Fabric, and Databricks. And there was a lot of interesting stuff going on in those last month.

Snowflake

Anthropic's Claude 3.5 Sonnet model is now available in Snowflake Cortex AI, available for use in SQL and Python (COMPLETE function) as well as in Cortex Playground (you can compare this model with other available models and apply Cortex Guard rules on it). For now, the model is available only for Snowflake accounts running in AWS US West 2 region. Read the announcement: Anthropic’s Claude 3.5 Sonnet now available in Snowflake Cortex AI.
After all rumors around the DeepSeek model last week I was wondering if and when we shall see this model natively in Snowflake and other platforms. First, I read an article Running DeepSeek-R1 in Snowpark Container Services: A Complete Guide by James Cha-Earley. In this article James provides a ready-to-use receipt for custom DeepSeek environment built on top of Snowpark Container Services. One day later Snowflake officially announced DeepSeek-R1 in private preview on Snowflake Cortex AI. Wow, that was fast (at the same time Microsoft announced DeepSeek available in Azure AI Foundry).
Snowflake officially announced switching from shared responsibility model to shared destiny model for security. In the new model Snowflake will take a more proactive role to help customers secure their accounts. Read the whole announcement with detailed plans for 2025: Shared Destiny with Snowflake Horizon catalog Built-in Security.
The SnowConvert tool, that supports migrations from Oracle, Teradata, SQL Server and Redshift to Snowflake by performing code conversions, is now publicly available for download after required free training. For more details visit this site: SnowConvert | Snowflake.
You can now define join policies in Snowflake (in addition to aggregation and projection policies). When a join policy is applied to a table, queries either require or do not require a join. In addition, when joins are required, they can be restricted to certain joining columns. For more information, read release notes.
Outbound private connectivity announced in January lets you create private endpoints in Snowflake to access a cloud platform using the platform’s private connectivity solution rather than the Internet. It's available for the following Snowflake features: external network locations, external functions, external stages, external tables, external volumes for Iceberg tables, Snowpipe automation. For more details check the documentation: Private connectivity for outbound network traffic.
Data metric functions in Snowflake can now accept multiple tables as arguments (great for referential integrity, matching and comparison, or conditional checking across different datasets). See release notes.
In January Snowflake announced general availability of support for automated refreshes of Apache Iceberg? tables that use an external catalog. See release notes.
Two new switches added to CREATE and ALTER DYNAMIC TABLE commands: 1) for CREATE command it's REQUIRE USER which ensure that a dynamic table cannot refresh unless a user is specified via COPY SESSION, 2) for ALTER command it's COPY SESSION which allows to run a refresh operation in a copy of the current session, using the same user and warehouse. For more details read release notes.
Snowflake Demos (snowflake.demos) is a new library (plus API) that lets you spin up an entire Notebooks environment for a Snowflake quickstart, including all resources (e.g. roles, permissions, Snowflake objects), in a single command. More in this article: Launch Snowflake Notebooks with One Line of Code.
Last month Snowflake shared an interesting blog post showing performance improvements for ingestion processes over the last few years (even 80% time/cost reduction thanks to the USE_VECTORIZED_SCANNER option). Read the full blog post for details: Loading Terabytes Into Snowflake: Speeds, Feeds and Techniques.
As a big fan of Snowflake Horizon Catalog I was super happy to find a whole playlist on YouTube presenting features for governance, security and compliance in Snowflake: Snowflake Horizon Catalog - YouTube.
I found an interesting case of migration of real-time processing of clicks to purchases on recommendations from Kafka + Flink + ScyllaDB on AWS to Snowflake + Airflow: How we saved thousands with Snowflake and Airflow.
Cesar Segura Martín wrote a very useful blog post on Extending a full QUERY_HISTORY. This may be a great extension to Snowflake observability toolset for every platform administrator!

Microsoft Fabric

A massive monthly update landed in Fabric last month. Some of my favorite features in this update: semantic model version history, TMDL scripting experience and Direct Lake model live editing in Power BI Desktop, folder support in Git, tenant level Private Link, open in SSMS and VSCode for warehouse, JSON aggregates + spatial analytics functions + SET SHOWPLAN_XML + join and query hints for warehouse (really big update for warehouse this month). For more features have a look at the blog post: Microsoft Fabric January 2025 update.
But that was not all! Separately from monthly update we plenty of important announcements and features! One of my favorite was Surge Protection (Preview). Surge Protection helps protect capacities from excess usage by background workloads. It acts as a resource governor, rejecting background operations when the capacity reaches a limit set by the capacity admin. I'd say it's a nice first step to address the "noisy neighbor" problem well known from Power BI and affecting many Fabric workspaces today.
Another important feature announced in January was introduction of ownership takeover for Fabric items. Unfortunately, the feature works only in UI at the moment. A support for taking over ownership via APIs is on the product team's backlog.
An interesting news last month was introduction of Fabric Copilot capacity. Now you can use Copilot in workspaces backed by capacities smaller than F64. Does it mean you no longer need P1/F64 or higher capacity to use Copilot in Fabric? Unfortunately, no. You still need to assign a P1/F64 or higher capacity as a Fabric Copilot capacity for users to ensure all their Copilot usage is charged to this capacity.
A nice update for OneLake shortcuts - the ability to define security settings on sub-folders within a shortcut. Read more in this blog post: Define security on folders within a shortcut using OneLake data access roles.
Two more important updates for Fabric warehouse: 1) service principal support for Fabric Data Warehouse and 2) granular permissions for COPY INTO command in Fabric Data Warehouse.
Romain Casteres published a comprehensive article FinOps for Microsoft Fabric discussing how to address different FinOps principles in Fabric. Great read!
Nick Salch shared a useful tool - Fabric Workspace Monitoring Dashboards. You can leverage these dashboards on top of data collected by the Workspace Monitoring feature. Visit this GitHub repository to get the tool: fabric-toolbox/monitoring.

Mark Pryce-Maher shared a toolset for a very interesting POC - Generic Mirroring. Mark leverages Open Mirroring in Fabric to bring the data from different popular sources - SQL Server 2008-2022 (using Change Tracking), Excel, CSV and Access (more to come). Check this repository: fabric-toolbox/open-mirroring/GenericMirroring, and this video: Generic Mirroring: Many Sources to Many Mirrors - One Tool for SQL, Excel, CSV & Access! for instructions on how to set things up.
Mimoune Djouallah published an interesting blog post Fabric for Small Enterprises – Small Data And self service. In this blog post Mimoune discussed the scenario of using lakehouse architecture in Fabric for small businesses (up to 15 employees) from a cost perspective.

Databricks

Clean Rooms in Databricks went generally available in January. The feature uses Delta Sharing and serverless compute to provide a secure and privacy-protecting environment where multiple parties can work together on sensitive enterprise data without direct access to each other’s data. Read more details in release notes. Also, watch this video to see the feature in action: Databricks Clean Rooms Product Demo.
Delta Live Tables now support publishing to tables in multiple schemas and catalogs. Read the release notes.
Code comments in notebooks now support email notifications and @ mentions. This is a great improvement to collaboration. Read more in release notes.
You can now easily upload files into Databricks workspaces using drag & drop method in UI. Read more in release notes.
AI agent tools in Databricks can now use API to connect to external applications, like Slack or Google Calendar. See here for more details: Connect AI agent tools to external services | Databricks on AWS.
A lot of updates to AI/BI (Genie and Dashboards) showed up in January. The ones that caught my attention: calculated measures for dashboards (yes, you guess it right - measures represented by simple formulas), more and more cross-highlighting options in dashboards (this time support for point maps), download dashboard to PDF, improved query descriptions for Genie. For more check out the release notes.
Teradata joined a list of data sources supported by Lakehouse Federation in Databricks. Several pushdowns are supported: filters, projections, limit, aggregates, cast, contains, startswith, edswith, like. Read more in release notes.
New SQL functions appeared in Databricks SQL: DAYNAME (three letter day of the week name), UNIFORM (random number from specific range), RANDSTR (random string of a given name).
Egress control for Databricks serverless and Mosaic AI Model Serving workloads is available in Public Preview. Read the official announcement: Announcing egress control for serverless and model serving workloads.
Collations are now available in Public Preview with Databricks Runtime 16.1 (coming soon to Databricks SQL Preview Channel with version 2024.50 and Databricks Delta Live Tables). Collations streamline data processing by defining rules for sorting and comparing text in ways that respect language and case sensitivity. Read the official announcement: Introducing Collations to Databricks.
Oh, did I mention that you can easily run DeepSeek-R1 in Databricks? ;-) Simply follow the instruction in this blog post: DeepSeek R1 on Databricks. Wow, that was fast as well!
My discovery of the month in the Databricks world was DQX - Data Quality Framework by Databricks Labs. DQX is a data quality framework for Apache Spark that enables defining, monitoring, and reacting to data quality issues in data pipelines. DQX GitHub repository: databrickslabs/dqx.
Dustin Vannoy shared another useful video on Databricks developer best practices for version control, automated tests and CI/CD. Watch the video here: Developer Best Practices on Databricks: Git, Tests, and Automated Deployment.

领英推荐

SNOWFLAKE SUMMIT AT A GLANCE

Snowflake 7 个月前

Ensuring Data Quality in Databricks with Great…

Machine Learning Reply GmbH 1 年前

The Comprehensive Guide to Apache Parquet: A…

DataTech Integrator 1 个月前

Sources of news and updates

If you are looking for resources useful for staying up to date with Snowflake, Fabric and Databricks, see a list I shared in the August 2024 edition of this newsletter.

My quick summary

It's time to wrap things up for January:

Snowflake. Key areas of investments in January: 1) AI model integration - the introduction of Anthropic's Claude 3.5 Sonnet model and DeepSeek-R1 in Snowflake Cortex AI, 2) security - a shift from shared responsibility to shared destiny model, emphasizing proactive security measures for customers, investments in outbound private connectivity, join policies, 3) support for migrations - the SnowConvert tool made publicly available.
Fabric. Key areas of improvements last month: 1) Power BI development - TMDL view and Direct Lake live editing in Power BI Desktop, folder support in Git, 2) Fabric warehouse - tons of improvements including spatial analytics, support for service principals and better integration with external tools, 3) administration, security and governance - Surge Protection, tenant level private link and Fabric item ownership take over. Plus a lot of rumors around announcements to be shared at The Microsoft Fabric Community Conference later this year.
Databricks. Key areas of updates in January: 1) security and compliance - Clean Rooms generally available, egress control for Databricks serverless and Mosaic AI Model Serving, 2) data engineering experience (also in SQL) - investments in Delta Live Tables, enhanced experience for comments, collations, new features of Databricks SQL, 3) AI/BI - both Genie and Dashboards get better every month, 4) AI external integrations - API for connecting AI agents with external apps, , 5) query federations - support for Teradata in Lakehouse Federation (potential support for incremental migrations).

One general thought - in my opinion, companies like Snowflake and Databricks may be the winners of the recent turmoil around DeepSeek as they do not focus on their own generative AI models but instead make use of great models available on the market. But of course, I may be wrong.

That's all folks. As always, share in the comments all interesting updates, articles, videos etc. you found last month. Thanks for reading and until next time.