Data Platform News (January 2025)
Somewhere in Warsaw at dawn

Data Platform News (January 2025)

Welcome to the first edition of my newsletter in 2025. January was an interesting month in terms of what was happening in the tech world. You know, DeepSeek and stuff like that. However, in this newsletter I try to stick to my three favorite platforms: Snowflake, Fabric, and Databricks. And there was a lot of interesting stuff going on in those last month.

Snowflake

  • Anthropic's Claude 3.5 Sonnet model is now available in Snowflake Cortex AI, available for use in SQL and Python (COMPLETE function) as well as in Cortex Playground (you can compare this model with other available models and apply Cortex Guard rules on it). For now, the model is available only for Snowflake accounts running in AWS US West 2 region. Read the announcement: Anthropic’s Claude 3.5 Sonnet now available in Snowflake Cortex AI.
  • After all rumors around the DeepSeek model last week I was wondering if and when we shall see this model natively in Snowflake and other platforms. First, I read an article Running DeepSeek-R1 in Snowpark Container Services: A Complete Guide by James Cha-Earley. In this article James provides a ready-to-use receipt for custom DeepSeek environment built on top of Snowpark Container Services. One day later Snowflake officially announced DeepSeek-R1 in private preview on Snowflake Cortex AI. Wow, that was fast (at the same time Microsoft announced DeepSeek available in Azure AI Foundry).
  • Snowflake officially announced switching from shared responsibility model to shared destiny model for security. In the new model Snowflake will take a more proactive role to help customers secure their accounts. Read the whole announcement with detailed plans for 2025: Shared Destiny with Snowflake Horizon catalog Built-in Security.
  • The SnowConvert tool, that supports migrations from Oracle, Teradata, SQL Server and Redshift to Snowflake by performing code conversions, is now publicly available for download after required free training. For more details visit this site: SnowConvert | Snowflake.
  • You can now define join policies in Snowflake (in addition to aggregation and projection policies). When a join policy is applied to a table, queries either require or do not require a join. In addition, when joins are required, they can be restricted to certain joining columns. For more information, read release notes.
  • Outbound private connectivity announced in January lets you create private endpoints in Snowflake to access a cloud platform using the platform’s private connectivity solution rather than the Internet. It's available for the following Snowflake features: external network locations, external functions, external stages, external tables, external volumes for Iceberg tables, Snowpipe automation. For more details check the documentation: Private connectivity for outbound network traffic.
  • Data metric functions in Snowflake can now accept multiple tables as arguments (great for referential integrity, matching and comparison, or conditional checking across different datasets). See release notes.
  • In January Snowflake announced general availability of support for automated refreshes of Apache Iceberg? tables that use an external catalog. See release notes.
  • Two new switches added to CREATE and ALTER DYNAMIC TABLE commands: 1) for CREATE command it's REQUIRE USER which ensure that a dynamic table cannot refresh unless a user is specified via COPY SESSION, 2) for ALTER command it's COPY SESSION which allows to run a refresh operation in a copy of the current session, using the same user and warehouse. For more details read release notes.
  • Snowflake Demos (snowflake.demos) is a new library (plus API) that lets you spin up an entire Notebooks environment for a Snowflake quickstart, including all resources (e.g. roles, permissions, Snowflake objects), in a single command. More in this article: Launch Snowflake Notebooks with One Line of Code.
  • Last month Snowflake shared an interesting blog post showing performance improvements for ingestion processes over the last few years (even 80% time/cost reduction thanks to the USE_VECTORIZED_SCANNER option). Read the full blog post for details: Loading Terabytes Into Snowflake: Speeds, Feeds and Techniques.
  • As a big fan of Snowflake Horizon Catalog I was super happy to find a whole playlist on YouTube presenting features for governance, security and compliance in Snowflake: Snowflake Horizon Catalog - YouTube.
  • I found an interesting case of migration of real-time processing of clicks to purchases on recommendations from Kafka + Flink + ScyllaDB on AWS to Snowflake + Airflow: How we saved thousands with Snowflake and Airflow.
  • Cesar Segura Martín wrote a very useful blog post on Extending a full QUERY_HISTORY. This may be a great extension to Snowflake observability toolset for every platform administrator!

Microsoft Fabric

  • A massive monthly update landed in Fabric last month. Some of my favorite features in this update: semantic model version history, TMDL scripting experience and Direct Lake model live editing in Power BI Desktop, folder support in Git, tenant level Private Link, open in SSMS and VSCode for warehouse, JSON aggregates + spatial analytics functions + SET SHOWPLAN_XML + join and query hints for warehouse (really big update for warehouse this month). For more features have a look at the blog post: Microsoft Fabric January 2025 update.
  • But that was not all! Separately from monthly update we plenty of important announcements and features! One of my favorite was Surge Protection (Preview). Surge Protection helps protect capacities from excess usage by background workloads. It acts as a resource governor, rejecting background operations when the capacity reaches a limit set by the capacity admin. I'd say it's a nice first step to address the "noisy neighbor" problem well known from Power BI and affecting many Fabric workspaces today.
  • Another important feature announced in January was introduction of ownership takeover for Fabric items. Unfortunately, the feature works only in UI at the moment. A support for taking over ownership via APIs is on the product team's backlog.
  • An interesting news last month was introduction of Fabric Copilot capacity. Now you can use Copilot in workspaces backed by capacities smaller than F64. Does it mean you no longer need P1/F64 or higher capacity to use Copilot in Fabric? Unfortunately, no. You still need to assign a P1/F64 or higher capacity as a Fabric Copilot capacity for users to ensure all their Copilot usage is charged to this capacity.
  • A nice update for OneLake shortcuts - the ability to define security settings on sub-folders within a shortcut. Read more in this blog post: Define security on folders within a shortcut using OneLake data access roles.
  • Two more important updates for Fabric warehouse: 1) service principal support for Fabric Data Warehouse and 2) granular permissions for COPY INTO command in Fabric Data Warehouse.
  • Romain Casteres published a comprehensive article FinOps for Microsoft Fabric discussing how to address different FinOps principles in Fabric. Great read!
  • Nick Salch shared a useful tool - Fabric Workspace Monitoring Dashboards. You can leverage these dashboards on top of data collected by the Workspace Monitoring feature. Visit this GitHub repository to get the tool: fabric-toolbox/monitoring.

Databricks

  • Clean Rooms in Databricks went generally available in January. The feature uses Delta Sharing and serverless compute to provide a secure and privacy-protecting environment where multiple parties can work together on sensitive enterprise data without direct access to each other’s data. Read more details in release notes. Also, watch this video to see the feature in action: Databricks Clean Rooms Product Demo.
  • Delta Live Tables now support publishing to tables in multiple schemas and catalogs. Read the release notes.
  • Code comments in notebooks now support email notifications and @ mentions. This is a great improvement to collaboration. Read more in release notes.
  • You can now easily upload files into Databricks workspaces using drag & drop method in UI. Read more in release notes.
  • AI agent tools in Databricks can now use API to connect to external applications, like Slack or Google Calendar. See here for more details: Connect AI agent tools to external services | Databricks on AWS.
  • A lot of updates to AI/BI (Genie and Dashboards) showed up in January. The ones that caught my attention: calculated measures for dashboards (yes, you guess it right - measures represented by simple formulas), more and more cross-highlighting options in dashboards (this time support for point maps), download dashboard to PDF, improved query descriptions for Genie. For more check out the release notes.
  • Teradata joined a list of data sources supported by Lakehouse Federation in Databricks. Several pushdowns are supported: filters, projections, limit, aggregates, cast, contains, startswith, edswith, like. Read more in release notes.
  • New SQL functions appeared in Databricks SQL: DAYNAME (three letter day of the week name), UNIFORM (random number from specific range), RANDSTR (random string of a given name).
  • Egress control for Databricks serverless and Mosaic AI Model Serving workloads is available in Public Preview. Read the official announcement: Announcing egress control for serverless and model serving workloads.
  • Collations are now available in Public Preview with Databricks Runtime 16.1 (coming soon to Databricks SQL Preview Channel with version 2024.50 and Databricks Delta Live Tables). Collations streamline data processing by defining rules for sorting and comparing text in ways that respect language and case sensitivity. Read the official announcement: Introducing Collations to Databricks.
  • Oh, did I mention that you can easily run DeepSeek-R1 in Databricks? ;-) Simply follow the instruction in this blog post: DeepSeek R1 on Databricks. Wow, that was fast as well!
  • My discovery of the month in the Databricks world was DQX - Data Quality Framework by Databricks Labs. DQX is a data quality framework for Apache Spark that enables defining, monitoring, and reacting to data quality issues in data pipelines. DQX GitHub repository: databrickslabs/dqx.
  • Dustin Vannoy shared another useful video on Databricks developer best practices for version control, automated tests and CI/CD. Watch the video here: Developer Best Practices on Databricks: Git, Tests, and Automated Deployment.


Sources of news and updates

If you are looking for resources useful for staying up to date with Snowflake, Fabric and Databricks, see a list I shared in the August 2024 edition of this newsletter.


My quick summary

It's time to wrap things up for January:

  1. Snowflake. Key areas of investments in January: 1) AI model integration - the introduction of Anthropic's Claude 3.5 Sonnet model and DeepSeek-R1 in Snowflake Cortex AI, 2) security - a shift from shared responsibility to shared destiny model, emphasizing proactive security measures for customers, investments in outbound private connectivity, join policies, 3) support for migrations - the SnowConvert tool made publicly available.
  2. Fabric. Key areas of improvements last month: 1) Power BI development - TMDL view and Direct Lake live editing in Power BI Desktop, folder support in Git, 2) Fabric warehouse - tons of improvements including spatial analytics, support for service principals and better integration with external tools, 3) administration, security and governance - Surge Protection, tenant level private link and Fabric item ownership take over. Plus a lot of rumors around announcements to be shared at The Microsoft Fabric Community Conference later this year.
  3. Databricks. Key areas of updates in January: 1) security and compliance - Clean Rooms generally available, egress control for Databricks serverless and Mosaic AI Model Serving, 2) data engineering experience (also in SQL) - investments in Delta Live Tables, enhanced experience for comments, collations, new features of Databricks SQL, 3) AI/BI - both Genie and Dashboards get better every month, 4) AI external integrations - API for connecting AI agents with external apps, , 5) query federations - support for Teradata in Lakehouse Federation (potential support for incremental migrations).

One general thought - in my opinion, companies like Snowflake and Databricks may be the winners of the recent turmoil around DeepSeek as they do not focus on their own generative AI models but instead make use of great models available on the market. But of course, I may be wrong.

That's all folks. As always, share in the comments all interesting updates, articles, videos etc. you found last month. Thanks for reading and until next time.

Przemyslaw B.

Data Specialist | Power BI | Data Engineering & Analytics

1 个月

Great summary, thanks! I really like the article on migrating Flink to Snowflake. Had an experience with Flink and totally agree that its complexity might hurt the delivery. Better to use well defined tools that work in near real-time - such as Snowflake :)

要查看或添加评论,请登录

Pawel Potasinski的更多文章

  • Data Platform News (February 2025)

    Data Platform News (February 2025)

    In the last days of February my social media walls became full of experts in politics and geopolitics. While I have my…

    8 条评论
  • TPCH test: Databricks vs Fabric vs Snowflake

    TPCH test: Databricks vs Fabric vs Snowflake

    Introduction OK, I finally did it. Last weekend I did my own performance test of three data platforms - Snowflake…

    89 条评论
  • Data Platform News (December 2024)

    Data Platform News (December 2024)

    After November full of events and announcements, December seemed to be a bit more peaceful month in the data platform…

    11 条评论
  • Data Platform News (November 2024)

    Data Platform News (November 2024)

    Oh, boy, what a month it was! Several major events, bold statements and tons of announcements from all three vendors -…

    8 条评论
  • Data Platform News (October 2024)

    Data Platform News (October 2024)

    October 2024 is behind us. In Poland it was a month of beautiful autumn colors.

    7 条评论
  • Data Platform News (September 2024)

    Data Platform News (September 2024)

    September was a busy month for me. After joining InfiniteDATA Services in August I got to a full speed in my new role.

    6 条评论
  • Data Platform News (August 2024)

    Data Platform News (August 2024)

    Welcome to the first edition of this newsletter with a little "rebranding"! At the end of July I left Microsoft and…

    13 条评论
  • Azure Data and Power BI News (December 2023)

    Azure Data and Power BI News (December 2023)

    ?? Happy New Year! And so, we have just jumped to 2024. May this year be full of joy and success for all of us.

    17 条评论
  • Azure Data and Power BI News (Ignite 2023 Edition)

    Azure Data and Power BI News (Ignite 2023 Edition)

    What a month behind us! November in the Azure Data and Power BI world was full of big announcements, exciting updates…

    12 条评论
  • Azure Data and Power BI News (October 2023)

    Azure Data and Power BI News (October 2023)

    And so, we are in November with beautiful golden Polish autumn (see the picture above) and exciting times ahead! Don't…

    5 条评论

社区洞察

其他会员也浏览了