Azure Synapse Analytics - Quick Bites!

Azure Synapse Analytics - Quick Bites!

Hi Friends,

This week, on the basis of my learnings, I will share some high-level best practices, information, and limitations in Azure Synapse Analytics.

On top of this reference, you can do more research for your use case.

Azure Synapse Analytics Important Pointers -

1.??????IP firewall rules – In portal, under networking, configure specific IP address range & associate with rule name.

2.??????By default, when we create synapse workspace, public network is enabled (not recommended).

3.??????Endpoints in synapse-

-> Dedicated SQL pool

-> Serverless SQL pool

-> Development endpoint

4.??????From local, TCP ports 80,443 & 1433; UPD – 53 should be open in order to connect to synapse studio. TCP 1433(outbound) to connect to SSMS & Power BI.

5.??????Managed VNet is recommend securing the data in Synapse. It is managed by Synapse. No need to create inbound NSG rules on your own VNet to allow synapse to connect to VNet. No need to create subnet for any spark cluster.

6.??????Managed VNet along with managed endpoint protect with data exfiltration is possible.

7.??????Dedicated SQL pool & serverless SQL pool is outside of managed VNET. Intra workspace communication between workspace and dedicated SQL pool/serverless pool use azure private link. This azure private link is automatically created when managed VNet is created.

8.??????Azure synapse analytics enables to connect to various components through endpoints.

9.??????Private links enables to access other services like Azure storage. You can establish private link to a resource by creating a private endpoint. Private endpoints are mapped to a specific resource and no

10.???PolyBase can't load rows that have more than 1,000,000 bytes of data. When you put data into the text files in Azure Blob storage or Azure Data Lake Store, they must have fewer than 1,000,000 bytes of data. This byte limitation is true regardless of the table schema.

11.??Split large, compressed files into smaller compressed files.

12.??For fastest loading speed, run only one load job at a time. If that is not feasible, run a minimal number of loads concurrently. If you expect a large loading job, consider scaling up your dedicated SQL pool before the load.

13.??To run loads with appropriate compute resources, create loading users designated for running loads. Assign each loading user to a specific resource class or workload group. To run a load, sign in as one of the loading users, and then run the load. The load runs with the user's resource class.?

14.??Allow multiple users to load data. Deny to schema.

15.??Load to staging table- Load data to staging table. Define the staging table as a heap and use round-robin for the distribution option. Consider that loading is usually a two-step process in which you first load to a staging table and then insert the data into a production data warehouse table. If the production table uses a hash distribution, the total time to load and insert might be faster if you define the staging table with the hash distribution. Loading to the staging table takes longer, but the second step of inserting the rows to the production table does not incur data movement across the distributions.

16.??Load to a columnstore index

·??????To ensure the loading user has enough memory to achieve maximum compression rates, use loading users that are a member of a medium or large resource class.

·??????Load enough rows to completely fill new row groups. During a bulk-load, every 1,048,576 rows get compressed directly into the columnstore as a full row group. Loads with fewer than 102,400 rows send the rows to the delta store where rows are held in a b-tree index. If you load too few rows, they might all go to the delta store and not get compressed immediately into columnstore format.

17.??Loading with the COPY statement will provide the highest throughput with dedicated SQL pools. If you cannot use the COPY to load and must use the?SqLBulkCopy API?or?bcp A batch size between 100 K to 1M rows is the recommended baseline for determining optimal batch size capacity.

18.??PolyBase is the best choice when you are loading or exporting large volumes of data, or you need faster performance.

19.?Allowing authentication via Azure Active Directory (Azure AD) only is not supported for dedicated SQL pools with Azure Synapse features enabled.?

20.?Private endpoint is a network interface that uses a private IP address from your virtual network. This network interface connects you privately and securely to a service that's powered by Azure Private Link.

Inbound connections- Connections coming to access synapse resources

Endpoints to provide the point of this incoming connection

Dedicated SQL endpoint – dedicated SQL databases (tcp port 1433)

Serverless SQL endpoint – serverless SQL (tcp port 1433)

Development endpoint – Apache spark pools & pipeline/data flows (tcp port 443)

How are we going to secure these 3 endpoints?

?By default, all these endpoints are public endpoints by default. Synapse workspace provides below options to protect these endpoints from public domain-

1.??????IP firewall

2.??????DDoS protection(basic)- Includes

Active traffic monitoring

Always on Detection

Automatic attack mitigation

3.??????All inbound traffic to the workspace endpoints is encrypted in transit with TLS 1.2

4.??????Disable public access (only applicable for managed VNET). This ensures all inbound traffic to synapse workspace resources go through private endpoint only.

5.??????Each synapse endpoint discussed above should have 3 separate private endpoints. These private endpoints are powered by Azure private link and traffic stays entirely in Microsoft backbone.

6.??????These private endpoints can be accessed only within the same virtual networks or other VNet which ae globally or regionally peered to the VNet containing the private endpoint or from on-premises resources by express route or VPN gateway.

7.??????Private endpoint prevents data leakage.

8.??????Across tenants and region is possible (private endpoint implementation)

Outbound connections – Connections going out of Synapse workspace to access other resources and services.

Possible outgoing connections –

From dedicated SQL pool & serverless SQL pool – connect to Azure storage account

From Apache spark pools & pipelines – connect to 90+ services via linked services. (Within Managed VNet).

1.??????Outbound connections to the managed VNet to other azure service which supports private endpoints can be made using managed endpoint connections.

2.??????Outbound connections to the ADLS gen2 can be restricted using firewall restrictions at storage account.

3.??????Outbound network security is important to avoid data exfiltration. This is available only when managed VNet is enabled. This setting cannot be modified after the workspace has been created.

21.??Data Exfiltration cannot be possible for following scenarios –

1.??????From synapse pipeline we cannot connect to REST API or nay service that is hosted publicly outside the approved tenant list/ or organization.

2.??????Authenticating through service principal or oAuth.

3.??????Connecting to machine learning workspace to run pipeline.

4.??????Many scenarios in future if outbound connection goes outside the approved tenant.

We can overcome these limitations by using either of below approaches-

1.??????Use self-hosted integration runtime – this is deployed outside the managed VNet, and they are fully managed by customer in their own VNet.

2.??????Create separate synapse workspaces for secure zone (access to secure data sources & prevents exfiltration) and non-secure zone (access restricted to secured zone and enables access data from public data sources)

Conclusion

It's always a best practice to start with research and note down all the recommendations from Microsoft and then map it with your requirement.

These pointers were perfectly fitting for my use case. I am 100% sure, these will be helpful for you as well.

Please feel free to reach out to me to have discussion on any of the topics mentioned above.

I would love to help. Enjoy learning and sharing ??

Thank You All ??



要查看或添加评论,请登录

Preetha R.的更多文章

社区洞察

其他会员也浏览了