登录查看更多内容

ADF CUSTOM ACTIVITY

Darshika Srivastava

Associate Project Manager @ HuQuo | MBA,Amity Business School

发布日期: 2024年6月21日

WHAT IS AN ADF CUSTOM ACTIVITY?

So very quickly, in case you don’t know, an Azure Data Factory Custom Activity is simply a bespoke command or application created by you, in your preferred language and wrapped up in an Azure platform compute service that ADF can call as part of an orchestration pipeline. If you want to draw a comparison to SSIS, you can think of an ADF Custom Activity as a Script Task within an SSIS Control Flow.

For the ADF example in this post I’m going to talk about creating a Custom Activity using:

a console application
developed in Visual Studio
coded in C#
and executed on Windows virtual machines

I hope this is a fairly common approach. If not, things can be switched around as you prefer and I’ll highlight alternatives as we go through.

WHY DO WE NEED AN ADF CUSTOM ACTIVITY?

Typically in our data platform, data analytics or data science solution an element of data cleaning and preparation is required. Sadly, raw source data isn’t perfect. My go-to example here is a CSV file, text qualified, but it contains an attribute of free text that some lovely frontend user has added carriage returns to when typing in notes or whatever. These line feeds, as we know, break our CSV structure meaning schema-on-read transformation tools struggle to read the file.

So, given the last paragraph. Why do we need an ADF Custom Activity? To prepare our data using transformation tools that don’t require a redefined data structure.

Of course, we can use a Custom Activity for 9000 other reasons as well. We can write some C# to do anything! Or, to answer the question another way. We need an ADF Custom Activity to perform an operation within our orchestration pipeline that isn’t yet supported nativity by our ADF Activities.

Next, you may think why do we need a Custom Activity for the data preparation example and not handle this with some custom code wrapped up in an Azure Function instead? Well, an Azure Function can only run for 10 minutes and won’t scale out in the same way our custom activity will. Simple rules:

Azure Function – quick execution, no scale out abilities.
Custom Activity – long running (if needed), ability to scale out automatically and scale up compute nodes if needed.

Great, why, done. Next.

HOW DOES AN ADFV2 CUSTOM ACTIVITY WORK?

Let’s start with the high level image I use in my community talks and add some narrative for all the arrows and services! Please do click and enlarge this image. Maybe even save it to view on a second monitor while we go through it.

In the image above moving from left to right this is what needs to happens and what we need to under to create our own Custom Activity:

class>We start with a Visual Studio solution containing a C# console application project. Or VB, or whatever. Just something that compiles to an executable.

classes

for whatever we need. For the file cleaning example above let’s say we have an Azure Data Lake file handler class, some file cleaner class with methods for different cleaning operations and an authentication wrapper class for Azure Key Vault, I’ll talk more about key vault later. Finally, we have a Program class where everything comes together.

Azure Blob Storage

account and create a container where our compiled code can live in the form of a master copy.

.exe

file and any NuGet/references

.dll

files into our Azure Blob Storage container. I recommend creating some PowerShell to handle this deployment from Visual Studio.

Azure Batch Service

.?As Azure Data Factory can’t execute the custom code directly with its own Integration Runtime we need something else to do it.

Compute Pool

. This gives you a pool of virtual machines that can be used in parallel to run our custom code. The pool can have one node or many. Windows or Linux. It can even be given a formula to auto scale out as requests for more tasks are made against the batch service.

Azure Data Lake Storage

already deployed with some raw poorly structured data in a CSV file.

Azure Data Factory

if you haven’t already.

linked services

to the blob storage, data lake storage, key vault and the batch service as a minimum. For this example.

pipeline

and with a vanilla Custom Activity.

Custom Activity

add the batch linked service. Then in settings add the name of your exe file and the resource linked service, which is your Azure Blob Storage. AKA the master copy of the exe.

Reference Objects

from data factory that can be used at runtime by the Custom Activity console app.

trigger

the pipeline and think about the engineering that happens.

task request

with our Azure Batch Service onto its own internal queue.

working directory

on an available compute node within the pool.

JSON

files are created and copied into the same working directory. These contain information about for reference objects set in the data factory Custom Activity.

Azure Key Vault.

Service principal credentials in the ADF reference objects are no longer shared with the batch service in pain text at runtime. As of September 2018 data factory won’t do this.

Azure Active Directory

service principal and requests a key to authenticate against the Azure Data Lake Store account, in this example.

downloaded

to the batch compute node. A temporary directory will need to be defined in the application where the file can be safely downloaded to.

processes are performed on the compute node using the local disk, RAM and CPU.

uploaded

back to Azure Data Lake Storage. Depending on the size of the file this may need to be done in chucks and concatenated at the target.

and

standard error

log files are copied to the same blob storage account as the master exe but under a separate container called ‘adfjobs’ with subfolders for the Custom Activity run ID from ADF.

With an understanding of the above let’s now going even deeper into some of these steps, especially where things differ between ADFv1 and ADFv2.

THE CONSOLE APPLICATION

As mentioned above we can use C# or whatever to create our console application. The other consideration here is if you want to use Windows machines or Linux machines for the Batch Service compute pool. If Linux, then you’ll need to develop your application using .Net Core rather than the full framework. Yes, the Linux VM’s will be cheaper than Windows, but .Net Core has a limited set of libraries compared to the full fat version. It’s a trade off you’ll need to consider. For the rest of this post I’ll be referring to things in the Batch Service assuming we used Windows compute nodes.

DEPLOYING YOUR APPLICATION WITH POWERSHELL

In point 4 above, a little bit of PowerShell can be helpful to get your compiled application up into Azure Blob Storage. At least in your development environment, maybe not for production. This isn’t anything complex, just a copy of everything from your Visual Studio bin folder to a blobs store container.

I’ve but an example PowerShell script in the below GitHub repository. Please feel free to copy it. I’d normally have a PowerShell helper project in my Visual Studio solution, or just add it directly to the console application project. In both cases the PowerShell extension will be needed to execute the script.

Code in Blog Support GitHub repository. https://github.com/mrpaulandrew/BlobSupportingContent

AUTO SCALING THE COMPUTE POOL

We don’t want to be paying for Azure VM’s that aren’t in use. Therefore, auto scaling the compute pool is a must in production. Microsoft have a great document here on creating the formula that decides on how scaling occurs. The example formula is a good starting point:

startingNumberOfVMs = 1;
maxNumberofVMs = 25;
pendingTaskSamplePercent = $PendingTasks.GetSamplePercent(180 * TimeInterval_Second);
pendingTaskSamples = pendingTaskSamplePercent < 70 ? startingNumberOfVMs : avg($PendingTasks.GetSample(180 * TimeInterval_Second));
$TargetDedicatedNodes=min(maxNumberofVMs, pendingTaskSamples);

Code in Blog Support GitHub repository. https://github.com/mrpaulandrew/BlobSupportingContent

To apply the formula simply go to your Azure Batch Service > Pools >select the pool > Scale > Auto Scale > add your formula > Save.

The only down side to this option is you can only get the formula to evaluate on a minimum frequency of 5 minutes. This means tasks won’t execute as fast compared to a fixed pool size that is always available. This is of course a common pattern with any Azure using a cluster of compute… If you want instant processing, pay for it to be on all the time.

Nikola Ilic 1 年前

Parquet file format – everything you need to know!

Nikola Ilic 1 年前

What is a Power Query Template and why is it a big…

Nikola Ilic 2 个月前

COMPUTE NODES

In point 15 I mentioned a special working directory. If you RDP to a one of the compute nodes this can be found in at the following location:

C:\user\tasks\workitems\adf-{guid}\job-0000000001\{guid}-{activityname}-\wd\

Bear in mind that the C drive is not where the operating system lives. These nodes have a none standard Windows image and special environment variables listed here . That said, custom images can be used if you want to have software preloaded onto a compute node. For example if you data cleaning custom application is really large or has multiple versions it could be loaded into a VM image used when the compute pool is created. Maybe a topic for another post.

Lastly, consider my comment in point 20 about the data. This is downloaded onto the batch service compute nodes for processing and as mentioned we can RDP to them. By default this will be done using a dynamic public IP addresses. I therefore recommend adding the compute pool to an Azure Virtual Network (VNet) meaning access can only be gained via a private network and local address.

If you do decide to attach the compute nodes to an Azure VNet please also consider the amount of addresses available on your chosen SubNet when using the auto scaling pool option mentioned above. It might sound obvious now, but I’ve seen pools fail to scale because there wasn’t enough IP addresses available. #AzureFeatures

THE CUSTOM ACTIVITY

Now let’s think about Azure Data Factory briefly, as it’s the main reason for the post ??

In version 1 we needed to reference a namespace, class and method to call at runtime. Now in ADF version 2 we can pass a command to the VM compute node, settings screen shot for the ADF developer portal below.

This command can be any Windows command. We can change directory and do something else on the VM or whatever, especially if using a custom image for the VM’s. Very handy and much more flexible than version 1.

USING THE REFERENCE OBJECTS

Now addressing points 12 and 16 above let’s focus on our data factory reference objects (seen in the above screen shot as well). This is probably the biggest difference to understand between versions of Azure Data Factory. In version 2, at runtime three JSON files are copied alongside our exe console application and also put into the compute node working directory. These are named:

activity.json
datasets.json
linkedServices.json

There will only ever be three of these JSON files regardless of how many reference objects are added in the data factory settings. To further clarify; if we reference 4 linked services, they will be added to the single JSON reference objects file in an array. The array will have a copy of the JSON used by Data Factory when it calls the linked service in a pipeline.

This gets worse with datasets, especially if there is dynamic content being passed from data factory variables. The dataset name within the reference file will get suffixed with a GUID at runtime. This is nice and safe for the JSON file created on the compute node, but becomes a little more tricky when we want to use the content in our custom application and it has a random name! For example we get this: AdventureWorksTable81e2671555fc4bc8b8b3955c4581e065

To access the reference objects we need to code this into our console app, parse the JSON and figure out which key value pairs we want to use. This can get very messy and my C# skills mean I don’t yet have a clean way of doing this. Here’s an example of how we might find details of our Data Lake Storage account from the linked service reference objects file.

Code in Blog Support GitHub repository. https://github.com/mrpaulandrew/BlobSupportingContent

I’d be more than happy to accept some C# help on coming up with a better way to traverse these ‘Reference Object’ files.

AZURE KEY VAULT

With Azure Key Vault if you haven’t used it already there is plenty of documentation and examples on working with the SDK, link below. Security is important so Microsoft give us lots of help here.

https://docs.microsoft.com/en-us/azure/key-vault/key-vault-developers-guide

This extra layer of keys being passed around does initially seem a little overkill, but when keys get recycled having a central location is the best thing ever. Do also make the effort to create a comprehensive key vault utilities class that works for you within the console app.

To implement this in an Azure Data Factory Custom activity you’ll need to have Key Vault added as its own linked service. Then add the linked service as one of the Reference Objects. When parsed in your app this will give you the URL to reach the Key Vault API. Or, feel free to hard code in the application config, it depends if you plan to change Key Vaults. With the Key Vault URL and the console apps own service principal you’ll be able to hit the API and request your Data Lake Store secret, for example.

LOGGING

If you have 30+ compute nodes in your Batch Service pool creating an RDP connection to each node doesn’t really work when trying to check log files. Therefore, as mentioned in point 24 above your console application standard error and standard out log files get copied into Azure Blob Storage. The container created for you called ‘adfjobs’ will have a sub per data factory activity execution and named with the GUID from the activity run ID. Simply grab the GUID from your Data Factory monitoring and use something like Azure Storage Explorer to search the adjob container.

I also like to extend the normal C# Console.WriteLine output with a little more formatting and date/time information. For example:

Code in Blog Support GitHub repository. https://github.com/mrpaulandrew/BlobSupportingContent

DEBUGGING

Debugging your custom activity just became super easy in ADFv2, mainly because we are using a console application that can be ran locally in Visual Studio with all the usual local debugging perks. The only minor change I recommend you make is imitating the ADF Reference Objects. We of course don’t get a copy of these locally when running the application in Visual Studio. An easy way to handle this is by checking for the Batch Service compute node environment variables, which you won’t have locally, unless you want to create them. When missing or with NULL values, use copies of the Reference Object JSON files saved into your Visual Studio console application project. You can get them by remoting onto a compute node and copying them out while the Custom Activity is running if you don’t want to recreate them by hand.

要查看或添加评论，请登录

Darshika Srivastava的更多文章

FIREWALL

2024年11月12日

FIREWALL

What is a firewall? A firewall is a computer network security system that restricts internet traffic in to, out of, or…
Wi-Fi Protected Access

2024年11月11日

Wi-Fi Protected Access

What is WPA? Wi-Fi Protected Access (WPA) is a security standard for computing devices equipped with wireless internet…
VOIP

2024年11月6日

VOIP

How VoIP / Internet Voice Works VoIP services convert your voice into a digital signal that travels over the Internet…
ACCELERATION

2024年11月5日

ACCELERATION

What does Acceleration Mean? Compared to displacement and velocity, acceleration is like the angry, fire-breathing…
PHISHING

2024年11月4日

PHISHING

How Phishing Works Whether a phishing campaign is hyper-targeted or sent to as many victims as possible, it starts with…
AMBIGUITY

2024年11月2日

AMBIGUITY

WHAT IS AMBIGUITY? Ambiguity is the type of meaning in which a phrase, statement, or resolution is not explicitly…
PRESCRIPTIVE ANALYTICS

2024年10月30日

PRESCRIPTIVE ANALYTICS

What Is Prescriptive Analytics? Prescriptive analytics is a type of data analytics that attempts to answer the question…
DATA DISCOVERY

2024年10月29日

DATA DISCOVERY

What is Data Discovery? Data discovery refers to the process of exploring and analyzing data to uncover patterns…
ZIP FILE

2024年10月28日

ZIP FILE

What is a ZIP file? File compression is an important part of the digital workspace. ZIP files use compression to send…
PPI

2024年10月26日

PPI

Pixels Per Inch (PPI): Definition, Vs. Dots Per Inch (DPI) David Kindness is a Certified Public Accountant (CPA) and an…

See all articles

ADF CUSTOM ACTIVITY

Darshika Srivastava

Associate Project Manager @ HuQuo | MBA,Amity Business School

WHAT IS AN ADF CUSTOM ACTIVITY?

WHY DO WE NEED AN ADF CUSTOM ACTIVITY?

HOW DOES AN ADFV2 CUSTOM ACTIVITY WORK?

THE CONSOLE APPLICATION

DEPLOYING YOUR APPLICATION WITH POWERSHELL

AUTO SCALING THE COMPUTE POOL

领英推荐

COMPUTE NODES

THE CUSTOM ACTIVITY

USING THE REFERENCE OBJECTS

AZURE KEY VAULT

LOGGING

DEBUGGING

Darshika Srivastava的更多文章

社区洞察

其他会员也浏览了

Using Tarql to Convert Excel Spreadsheets to RDF

Working with Semi-Structured JSON Data in Databricks

DAX Query View: Game Changed?

Google Sheets, KNIME ETL, CSVs, Tigers, Bears, Excel, Tableau, oh my.

Converting Covid XML and JSON to Yellowbrick

The Biggest Half?

Microsoft Data Platform News 2024 - Week 07

No-Code InterSystems Cache ETL for Snowflake, Bigquery, Azure Synapse & Redshift

How to Create Data Warehouse and Integrate with Power BI

No-Code Gorgias ETL for Snowflake, Bigquery, Azure Synapse & Redshift

WHAT IS AN ADF CUSTOM ACTIVITY?

WHY DO WE NEED AN ADF CUSTOM ACTIVITY?

HOW DOES AN ADFV2 CUSTOM ACTIVITY WORK?

THE CONSOLE APPLICATION

DEPLOYING YOUR APPLICATION WITH POWERSHELL

AUTO SCALING THE COMPUTE POOL

领英推荐

COMPUTE NODES

THE CUSTOM ACTIVITY

USING THE REFERENCE OBJECTS

AZURE KEY VAULT

LOGGING

DEBUGGING

Darshika Srivastava的更多文章

FIREWALL

Wi-Fi Protected Access

VOIP

ACCELERATION

PHISHING

AMBIGUITY

PRESCRIPTIVE ANALYTICS

DATA DISCOVERY

ZIP FILE

PPI

社区洞察

其他会员也浏览了

Using Tarql to Convert Excel Spreadsheets to RDF

Working with Semi-Structured JSON Data in Databricks

DAX Query View: Game Changed?

Google Sheets, KNIME ETL, CSVs, Tigers, Bears, Excel, Tableau, oh my.

Converting Covid XML and JSON to Yellowbrick

The Biggest Half?

Microsoft Data Platform News 2024 - Week 07

No-Code InterSystems Cache ETL for Snowflake, Bigquery, Azure Synapse & Redshift

How to Create Data Warehouse and Integrate with Power BI

No-Code Gorgias ETL for Snowflake, Bigquery, Azure Synapse & Redshift