ADF CUSTOM ACTIVITY
Darshika Srivastava
Associate Project Manager @ HuQuo | MBA,Amity Business School
WHAT IS AN ADF CUSTOM ACTIVITY?
So very quickly, in case you don’t know, an Azure Data Factory Custom Activity is simply a bespoke command or application created by you, in your preferred language and wrapped up in an Azure platform compute service that ADF can call as part of an orchestration pipeline. If you want to draw a comparison to SSIS, you can think of an ADF Custom Activity as a Script Task within an SSIS Control Flow.
For the ADF example in this post I’m going to talk about creating a Custom Activity using:
I hope this is a fairly common approach. If not, things can be switched around as you prefer and I’ll highlight alternatives as we go through.
WHY DO WE NEED AN ADF CUSTOM ACTIVITY?
Typically in our data platform, data analytics or data science solution an element of data cleaning and preparation is required. Sadly, raw source data isn’t perfect. My go-to example here is a CSV file, text qualified, but it contains an attribute of free text that some lovely frontend user has added carriage returns to when typing in notes or whatever. These line feeds, as we know, break our CSV structure meaning schema-on-read transformation tools struggle to read the file.
So, given the last paragraph. Why do we need an ADF Custom Activity? To prepare our data using transformation tools that don’t require a redefined data structure.
Of course, we can use a Custom Activity for 9000 other reasons as well. We can write some C# to do anything! Or, to answer the question another way. We need an ADF Custom Activity to perform an operation within our orchestration pipeline that isn’t yet supported nativity by our ADF Activities.
Next, you may think why do we need a Custom Activity for the data preparation example and not handle this with some custom code wrapped up in an Azure Function instead? Well, an Azure Function can only run for 10 minutes and won’t scale out in the same way our custom activity will. Simple rules:
Great, why, done. Next.
HOW DOES AN ADFV2 CUSTOM ACTIVITY WORK?
Let’s start with the high level image I use in my community talks and add some narrative for all the arrows and services! Please do click and enlarge this image. Maybe even save it to view on a second monitor while we go through it.
In the image above moving from left to right this is what needs to happens and what we need to under to create our own Custom Activity:
With an understanding of the above let’s now going even deeper into some of these steps, especially where things differ between ADFv1 and ADFv2.
THE CONSOLE APPLICATION
As mentioned above we can use C# or whatever to create our console application. The other consideration here is if you want to use Windows machines or Linux machines for the Batch Service compute pool. If Linux, then you’ll need to develop your application using .Net Core rather than the full framework. Yes, the Linux VM’s will be cheaper than Windows, but .Net Core has a limited set of libraries compared to the full fat version. It’s a trade off you’ll need to consider. For the rest of this post I’ll be referring to things in the Batch Service assuming we used Windows compute nodes.
DEPLOYING YOUR APPLICATION WITH POWERSHELL
In point 4 above, a little bit of PowerShell can be helpful to get your compiled application up into Azure Blob Storage. At least in your development environment, maybe not for production. This isn’t anything complex, just a copy of everything from your Visual Studio bin folder to a blobs store container.
I’ve but an example PowerShell script in the below GitHub repository. Please feel free to copy it. I’d normally have a PowerShell helper project in my Visual Studio solution, or just add it directly to the console application project. In both cases the PowerShell extension will be needed to execute the script.
Code in Blog Support GitHub repository. https://github.com/mrpaulandrew/BlobSupportingContent
AUTO SCALING THE COMPUTE POOL
We don’t want to be paying for Azure VM’s that aren’t in use. Therefore, auto scaling the compute pool is a must in production. Microsoft have a great document here on creating the formula that decides on how scaling occurs. The example formula is a good starting point:
startingNumberOfVMs = 1;
maxNumberofVMs = 25;
pendingTaskSamplePercent = $PendingTasks.GetSamplePercent(180 * TimeInterval_Second);
pendingTaskSamples = pendingTaskSamplePercent < 70 ? startingNumberOfVMs : avg($PendingTasks.GetSample(180 * TimeInterval_Second));
$TargetDedicatedNodes=min(maxNumberofVMs, pendingTaskSamples);
Code in Blog Support GitHub repository. https://github.com/mrpaulandrew/BlobSupportingContent
To apply the formula simply go to your Azure Batch Service > Pools >select the pool > Scale > Auto Scale > add your formula > Save.
The only down side to this option is you can only get the formula to evaluate on a minimum frequency of 5 minutes. This means tasks won’t execute as fast compared to a fixed pool size that is always available. This is of course a common pattern with any Azure using a cluster of compute… If you want instant processing, pay for it to be on all the time.
领英推荐
COMPUTE NODES
In point 15 I mentioned a special working directory. If you RDP to a one of the compute nodes this can be found in at the following location:
C:\user\tasks\workitems\adf-{guid}\job-0000000001\{guid}-{activityname}-\wd\
Bear in mind that the C drive is not where the operating system lives. These nodes have a none standard Windows image and special environment variables listed here . That said, custom images can be used if you want to have software preloaded onto a compute node. For example if you data cleaning custom application is really large or has multiple versions it could be loaded into a VM image used when the compute pool is created. Maybe a topic for another post.
Lastly, consider my comment in point 20 about the data. This is downloaded onto the batch service compute nodes for processing and as mentioned we can RDP to them. By default this will be done using a dynamic public IP addresses. I therefore recommend adding the compute pool to an Azure Virtual Network (VNet) meaning access can only be gained via a private network and local address.
If you do decide to attach the compute nodes to an Azure VNet please also consider the amount of addresses available on your chosen SubNet when using the auto scaling pool option mentioned above. It might sound obvious now, but I’ve seen pools fail to scale because there wasn’t enough IP addresses available. #AzureFeatures
THE CUSTOM ACTIVITY
Now let’s think about Azure Data Factory briefly, as it’s the main reason for the post ??
In version 1 we needed to reference a namespace, class and method to call at runtime. Now in ADF version 2 we can pass a command to the VM compute node, settings screen shot for the ADF developer portal below.
This command can be any Windows command. We can change directory and do something else on the VM or whatever, especially if using a custom image for the VM’s. Very handy and much more flexible than version 1.
USING THE REFERENCE OBJECTS
Now addressing points 12 and 16 above let’s focus on our data factory reference objects (seen in the above screen shot as well). This is probably the biggest difference to understand between versions of Azure Data Factory. In version 2, at runtime three JSON files are copied alongside our exe console application and also put into the compute node working directory. These are named:
There will only ever be three of these JSON files regardless of how many reference objects are added in the data factory settings. To further clarify; if we reference 4 linked services, they will be added to the single JSON reference objects file in an array. The array will have a copy of the JSON used by Data Factory when it calls the linked service in a pipeline.
This gets worse with datasets, especially if there is dynamic content being passed from data factory variables. The dataset name within the reference file will get suffixed with a GUID at runtime. This is nice and safe for the JSON file created on the compute node, but becomes a little more tricky when we want to use the content in our custom application and it has a random name! For example we get this: AdventureWorksTable81e2671555fc4bc8b8b3955c4581e065
To access the reference objects we need to code this into our console app, parse the JSON and figure out which key value pairs we want to use. This can get very messy and my C# skills mean I don’t yet have a clean way of doing this. Here’s an example of how we might find details of our Data Lake Storage account from the linked service reference objects file.
Code in Blog Support GitHub repository. https://github.com/mrpaulandrew/BlobSupportingContent
I’d be more than happy to accept some C# help on coming up with a better way to traverse these ‘Reference Object’ files.
AZURE KEY VAULT
With Azure Key Vault if you haven’t used it already there is plenty of documentation and examples on working with the SDK, link below. Security is important so Microsoft give us lots of help here.
This extra layer of keys being passed around does initially seem a little overkill, but when keys get recycled having a central location is the best thing ever. Do also make the effort to create a comprehensive key vault utilities class that works for you within the console app.
To implement this in an Azure Data Factory Custom activity you’ll need to have Key Vault added as its own linked service. Then add the linked service as one of the Reference Objects. When parsed in your app this will give you the URL to reach the Key Vault API. Or, feel free to hard code in the application config, it depends if you plan to change Key Vaults. With the Key Vault URL and the console apps own service principal you’ll be able to hit the API and request your Data Lake Store secret, for example.
LOGGING
If you have 30+ compute nodes in your Batch Service pool creating an RDP connection to each node doesn’t really work when trying to check log files. Therefore, as mentioned in point 24 above your console application standard error and standard out log files get copied into Azure Blob Storage. The container created for you called ‘adfjobs’ will have a sub per data factory activity execution and named with the GUID from the activity run ID. Simply grab the GUID from your Data Factory monitoring and use something like Azure Storage Explorer to search the adjob container.
I also like to extend the normal C# Console.WriteLine output with a little more formatting and date/time information. For example:
Code in Blog Support GitHub repository. https://github.com/mrpaulandrew/BlobSupportingContent
DEBUGGING
Debugging your custom activity just became super easy in ADFv2, mainly because we are using a console application that can be ran locally in Visual Studio with all the usual local debugging perks. The only minor change I recommend you make is imitating the ADF Reference Objects. We of course don’t get a copy of these locally when running the application in Visual Studio. An easy way to handle this is by checking for the Batch Service compute node environment variables, which you won’t have locally, unless you want to create them. When missing or with NULL values, use copies of the Reference Object JSON files saved into your Visual Studio console application project. You can get them by remoting onto a compute node and copying them out while the Custom Activity is running if you don’t want to recreate them by hand.