Governing Apache Ranger

Governing Apache Ranger

We all know the great potential of Apache Ranger:

  • Authorization: Control ecosystem access to solutions such as Atlas, NiFi, HDFS, YARN, HIVE, IMPALA, KAFKA, SOLR, HBASE, NOX and many more ...
  • Auditing: Logs all those accesses
  • Anonymisation: is able to anonymise on the fly columns of a select according to ABAC data attribute or RBAC user role.

But we also know that like all open source solutions, the interface is very oriented to the operation, and logically, little towards the business.

So we are going to use the APIrest it has to present a governance dashboard to help us easily govern who is accessing what, from where (ip and location) and how many times.


Accessing Ranger API

Like the ATLAS APIrest that we already saw in another article, the RANGER api rest is also quite easy to use once you get the right examples.

The first thing we are going to do is to choose the AccessAudit function.


API rest swagger interface from Ranger web interface.

In code, we can create a function to call and retrieve the information. The first thing it returns is metadata about the data it will send and the resulting pages. The amount of data the audit can collect is astonishing.


getAccessAudit function (python)

As we are only interested in the data of non-system users, we will indicate this parameter in the call: excludeServiceUser=true

I differentiate the KOs and the OKs in 2 different call, so I use the AccessResult=0 and =1


Likewise, after the initial load, we will only be interested in the data that we don't have, then we have to use the StartDate parameter to take it into account, as you can see in the getAccessAudit function.


Transforming and Loading data into table.

This part is simpler, but interesting if we want to locate the requests on a map.

Ranger returns the client's ip, but if we want to know its latitude and longitude, we will have to compare it with an external database that has this kind of data associated to IP ranges. I have used in this example the IP2Location LITE database table.

And when you work with IPs it is convenient to convert it to integers so that it is easier to compare between ranges, without overloading the engines by doing text searches.


In python, the library ipaddress helps here

We will now save the data we collect from the API to a table in HIVE or IMPALA using parquet or iceberg as file/table formats. And we can do this with a spark session.


Converting pandas to spark DF and writing it to the table.

The Geo located IPs table

I downloaded this table in CSV and loaded it into a table by previously converting the addresses to numbers as we have seen.

Now we can create views that join the tables together to be able to paint the location of the accesses on the map.


Detail of the GeoIPs table.


The dashboard

I generate some widgets about access:


  • First audit access timestamp
  • Last audit access timestamp
  • OK records from today and yesterday filtering with SQL to control query access easily.
  • KO records from today and yesterday filtering with SQL to control query access easily
  • Geo Map to understand external internet IP access.



  • Most accessed tables to understand usage patterns.
  • OK and KO Distribution of users, IPs and service types.


You can easily see the user, IP and query or path that generated the OK or the KO.

  • List of last KO and OK events, etc.
  • what you want!


The code

As always, here is the code on github

Thank you for your time!

要查看或添加评论,请登录

Santiago Merchán Casado的更多文章

社区洞察

其他会员也浏览了