登录查看更多内容

How To Manage A Guardium Infrastructure? Pratical Advices from Pratical Experience

Frederic Petit

发布日期: 2021年3月2日

I have been managing by myself for almost 3 years an environment of about 50 appliances and around 350 agents interfaced with GBDI.

I had a few hard moments but overall I made it.

This is how ( for the Guardium Infrastructure part ).

By Guardium Infrastructure I mean : Central Manager, Collectors and Agents (GIM and STAPs). The obvious goals of managing a Guardium infrastructure are the smooth and stable running of it.

Point #1 : The GUI is not enough.

GUI on Guardium like on any other application is very efficient when it comes to poking around, investigating and other one time actions. But No GUI is good enough when it comes to repetitive activities. For that you MUST use cli, grdapi and RestAPI. You MUST automate. You MUST script. Guardium provides the Server side of it, but unfortunately not the Client side. That part is .. on you ....

Point #2 : Do NOT manage Guardium from your laptop .....

I knew this from previous customers when I was Professional Services with IBM Guardium. I saw customers "managing" their environment from their laptops and it was really tough on them. Instead, you need an Enterprise class system from which you will be accessing and managing your appliances . You should be using a jump server, either Linux (my preference) or Windows. They stay on-line 24 by 7, got backed up by your company standard backup process, can have countless tools and they can be accessed by many people. Finally they can hold logs of your actions for a very long time.

Guardium does not offer such Client, which is unfortunate, and as a result, it's up to you to develop one. Which is a bit challenging and requires development skills.

I built such a Linux system which has the following characteristics:

A service account sudo'ed in by authorized users and used for executing all scripts and commands.
There is a folder for each appliance. This way you keep logs related to each appliance grouped and separated
I developed a series of scripts, shell scripts, python scripts and restApi scripts. This is the heart of the system: the automation of repetitive actions
The scripts are in general in a separate folder and they got called from the scripts sitting in the appliance folders
For python, I use python venv but I have plan to use Docker at some point
Most of the scripts input a pre-defined command file for cli and output a log file. The scripts are executed as follows from the jumpserver/service account: service acct@jusmpserver>ssh cli@appliance<script.cmd>script.log. You have to enter the appliance password when prompted. If you don't want to, you should rather use restApi for which IBM doesn't provide any sample either (I do in my OpenSource project "Context22", see reference at the bottom)
I cron some of those scripts. Cron is reliable ...
I have scripts for the following main functions: # appliance health report (about 15 cli commands ); # STAPs configuration (especially for Linux on ktap and ATAP); # GBDI Datamarts execution; # Initial configuration of new appliances even though we now use VMWare template instead (see below). I published most of the scripts on my OpenSource project "Context22"
The scripts get executed at once with the shell script "grepping" most interesting part of the output for rapid reading
I also developed a shell script to search the log files for whatever I may be interested in.

Point #3 : Building New Appliances vs. Upgrading

For the first year, honestly, the appliances gave me absolutely no worries, just here and there some traffic explosions I would quickly take care of by redirecting most outrageous agents to less loaded collectors. This gave me plenty of time to spend on GBDI.

But after 1 year, things started degrading: bugs became apparent and sometimes difficult to fix; upgrading would sometimes just fail with no obvious reasons, and other annoying issues. My take on this is it's better to rebuild every 18 months rather than fighting entropy.

Since we use VMWare, the best was to build a VMWare template. It is very easy to build a new appliance from scratch (latest ISO file), put on it all the configuration, and make it generic by putting a dummy IP address and Hostname before converting it as a template.

Once the template is built, you can just spin it next to an existing appliance you want to rebuild, then stop the "old" one, start the new one, assign the IP and hostname of the old one and you are back in business. You may still have to perform some (minor) configuration, in particular the registration of the appliance to the CM as this doesn't carry well in the template. You may also have some specifics on DNS, ntp etc... if you are in a segmented environment with different settings.

This procedure, of course, works best for Collectors, especially if you have GBDI as, once you have stopped the sniffer and the reporting agents have failed over to their fail over collectors, the collected data got transferred to GBDI/SonarG within an hour. You really have to wait only for 1 hour or 2 for the replacement to take place. For it, just stop completely the old one, bring the new one on-line, then the agents fall back from their Fail Over collector and you can take care of the next appliance in a round-robin manner.

This does not work as well for Aggregators but you should not be using aggregators anyway, right ?

This kind of rebuilding of appliance allows for limiting the patching of appliances to minor versions upgrades, sniffer patching and emergency security patches. It also minimizes the workload for upgrading on the Guardium admins and VMWare teams.

Point #4 : Managing Agents

There are several issues to consider :

Initial deployment (point #4.1)
New agents (point #4.2)
Assigning Collectors to Agents (point #4.3)
Balancing Traffic (point #4.4)

Point #4.1 : Initial Deployment (GIM and STAPs)

Initial deployment is in general a mass deployment. The best tool and honestly the only tool really appropriate for this is the "consolidated" installer which installs at once GIM and STAP. Our Professional Services people in charge of this initial deployment, did a very good job at using it, working around a few "bugs" or limitations. We had a very successful initial deployment of the agents. I recently did a new large deployment using the newer version of this installer for Windows (11.2) and the process is even better, cleaner and simpler. Definitely a very good job on the IBM side.

This is certainly much better and simpler than "pushing" the STAP after installing the GIM as the file transfer is quite slow and many issues have time to occur. With the consolidated one, once the transfer of the installer on the server is done, the installation is really a matter of less than a minute.

To avoid the issue of the initial Collector assignment, we decided to dedicate a Collector to the new agents making the assignment to a collector fixed. We then redirect, either via GUI or RestApi to the final collector.

Point #4.2 : New Agents

For new agents we actually use the same consolidated installer approach with the same default/staging collector and then we redirect to the final destination.

Point #4.3 : Assigning Collectors to Agents

There 2 or 3 points of view/approachs

1 - Entirely rely on ELB (Enteprise Load Balancer)

2 - Partially rely on ELB

3 - Not rely on ELB

Approach 1: you actually put ALL STAPs in the one big unique pool of agents "facing" a big one unique pool with all the Collectors. And you hope for the best but should expect the worst. It's not that ELB is bad, but you need to make sure you have enough collectors assigned, which cannot be guaranteed. And when you don't have enough capacity, we know ELB may start acting up. ELB seems to handle well when it has enough capacity, but not when there is insufficient capacities.

Approach 2 : you reserve the use of ELB to some specific parts of your environment. This is what we are going to do, but we have not implemented it yet. This can be the best option for well defined and delimited environments.

But approaches 1 & 2 require both to have a procedure to quickly be alerted on insufficient capacities and be able to bring in more collectors and/or remove STAPs before the system gets into resources issues

Approach 3 : This is currently our approach as we prefer to structure the agents assignment along 2 lines : Environment and DB type. We prefer not to mix up Dev Servers and Prod Servers. And we prefer not to mix up MS SQL and MongoDB. It gives us a clearer picture as the behavior can be widely different. We can track the load on a more differentiated and structured way by environment and Database types.

Point #5 : Balancing Traffic Volumes

In a non ELB environment, the re-balancing of traffic load among collectors is mostly a manual process based on appropriate tool to detect trends early on . In case of "bursts" of traffic, redirecting traffic to another appliance is crucial and requires to know what appliance is the least loaded. The best tool for this are graphic ones, like histograms and bars, especially when they display info on all the appliances in the same graphic once. We found Kibana fed from GBDI provides the best results. Here a few examples of our Graphical views (we group them in dashboards).

Below is a set of views for all Collectors, displaying values of different Sniffer variables which give a global picture of the the balancing in the environment.

Here below is a Time Series from the beginning of the year for 1 appliance on 1 value in the BUM (Buffer Usage Monitor)

Point #6 : Philosophy of Guardium Management

Here are few of my "directions" and "approaches" when it comes to managing a Guardium environment

Do not treat Guardium as a black box. Understand its components. Treat each component the way most appropriate to it.
Be proactive, NOT just reactive . Do not wait for issues to happen. Poke around constantly to detect early signs of degradation
Automate, automate, automate
Generate and keep logs for later review and references.
Do not second guess Guardium support. They are not perfect, but second-guessing them is worse.

References:

Access to my OpenSource Project "Context22" :
https://github.com/Fredo68usa/Context_22_GuardiumRestAPI.git
https://github.com/Fredo68usa/Context22_Infra.git

Raymond Human

ICT Trusted Advisor

9 个月

Brilliant information. How can we get in contact with you?

Kaddour Boukerche

Leader technique

1 年

Bonjour Frederic, je viens de découvrir tes articles et surtout le sujet de l'analytique pour une sécurité prédictive. à ce sujet, j'ai une question: Est-ce que Guardium Insight apporte quelques réponses à certain problème (outliers, risk engine, etc.)? Merci pour le partage de ces inestimables connaissances. Cdt,

Akshay Gedam

4 年

Perfectly documented ????

Syed Mansoor M Certified IBM Security Guardium Specialist

Senior Cyber Security Consultant - Database Security

4 年

Frederic Petit, I closely read all the points and topics you have mentioned. This is one of best article i have read in a while on Guardium. It is a story of every Database Security Team over the globe who is working with Guardium or Imperva or GBDI. This article is very much in detail with the all pain-points, methods and recommendations to be considered. This is an eye opener article- Admins who think it starts and ends with Guardium,imperva or gdbi And for others a who has already adopted this it’s a corrective measure to complete the gaps which were not covered in their organisation. Thanks for bringing this up.

2 次回应

查看更多评论

要查看或添加评论，请登录

Frederic Petit的更多文章

The New Kid on the DAM Block : The Data Analyst

2024年4月29日

The New Kid on the DAM Block : The Data Analyst

In DAM (Database Activity Monitoring, we traditionally feed SOCs and Security Analysts with alerts generated by DAM…
Time To Switch From Anomaly Detection To Predictive Analytics

2023年3月6日

Time To Switch From Anomaly Detection To Predictive Analytics

Detection Anomaly In Database Activity Monitoring (DAM) is traditionally performed by searching into the past which is…

1 条评论
A New Approach in SQL Injection Detection on Databases

2023年2月14日

A New Approach in SQL Injection Detection on Databases

Usually SQL injection is detected at the Web Server level. WAF has plenty of ways to detect it.

3 条评论
How to Successfully Manage a 50 Appliances Guardium Environment on 20% of an FTE

2023年2月8日

How to Successfully Manage a 50 Appliances Guardium Environment on 20% of an FTE

20% is even when you include severe bugs you have to work around and/or troubleshoot with Guardium support. Otherwise…

1 条评论
Anomalies Detection In Database Traffic: Challenges and Options - Part 2 -

2020年10月28日

Anomalies Detection In Database Traffic: Challenges and Options - Part 2 -

In Part 1 I explained why and how we should simplify Times Series (TS) , specifically with Database Traffic data. In…

1 条评论
Anomalies Detection In Database Traffic: Challenges and Options - Part 1 -

2020年10月27日

Anomalies Detection In Database Traffic: Challenges and Options - Part 1 -

Anomalies detection in Database traffic is hard and quite a new requirement from the industry. Currently the industry…

1 条评论
Issue #5 : Are All Anomalies in Database Traffic Security Related ? The Case for Making DAM Available to Non Security Related Processes.

2020年10月27日

Issue #5 : Are All Anomalies in Database Traffic Security Related ? The Case for Making DAM Available to Non Security Related Processes.

Many anomalies are NOT Security related. Therefore they should be made available for other purposes For example, failed…
The Case for Data Enrichment - Part 4 - Security looks only to part of the traffic

2020年10月27日

The Case for Data Enrichment - Part 4 - Security looks only to part of the traffic

Issue #4: Should Priorities Be Set-up? The Case for Not Boiling the Ocean Security related anomalies are only a small…
The Case for Data Enrichment - Part 1.1 - A Comment from Ron Bennatan about SonarG 4.2

2020年7月12日

The Case for Data Enrichment - Part 1.1 - A Comment from Ron Bennatan about SonarG 4.2

Hey Frederic, Great post. I wanted to add to your post because I think that great minds think alike (and by great minds…
The Case for Data Enrichment - Part 3 - The Case for Time Series Analysis

2020年7月12日

The Case for Data Enrichment - Part 3 - The Case for Time Series Analysis

Issue #3: Can We Detect Anomalies Without Taking Into Account The Time It Occurs? The Case for Time Series Analysis…

See all articles

How To Manage A Guardium Infrastructure? Pratical Advices from Pratical Experience

Frederic Petit

Point #1 : The GUI is not enough.

Point #2 : Do NOT manage Guardium from your laptop .....

Point #3 : Building New Appliances vs. Upgrading

Point #4 : Managing Agents

Point #4.1 : Initial Deployment (GIM and STAPs)

Point #4.2 : New Agents

Point #4.3 : Assigning Collectors to Agents

Point #5 : Balancing Traffic Volumes

Point #6 : Philosophy of Guardium Management

References:

Frederic Petit的更多文章

社区洞察

其他会员也浏览了

Difference Between Server And Client | O7 Services

Software or hardware RAID implementation?

Configure and work with Zabbix Monitoring tool version 6 in Docker Containers

The Impact of Misusing 500 Errors on Server Stability

OpenSolaris 2009.06

SAP on Azure high availability – change from SPN to MSI for Pacemaker clusters using Azure fencing

Listen Carefully, Let the Machines Speak

Is scaling always the right answer? Insights from Performance Testing with JMeter

Tech Tale #1-The Heroic Load Balancer: Ensuring Smooth Sailing Systems

Point #1 : The GUI is not enough.

Point #2 : Do NOT manage Guardium from your laptop .....

Point #3 : Building New Appliances vs. Upgrading

Point #4 : Managing Agents

Point #4.1 : Initial Deployment (GIM and STAPs)

Point #4.2 : New Agents

Point #4.3 : Assigning Collectors to Agents

Point #5 : Balancing Traffic Volumes

Point #6 : Philosophy of Guardium Management

References:

Frederic Petit的更多文章

The New Kid on the DAM Block : The Data Analyst

Time To Switch From Anomaly Detection To Predictive Analytics

A New Approach in SQL Injection Detection on Databases

How to Successfully Manage a 50 Appliances Guardium Environment on 20% of an FTE

Anomalies Detection In Database Traffic: Challenges and Options - Part 2 -

Anomalies Detection In Database Traffic: Challenges and Options - Part 1 -

Issue #5 : Are All Anomalies in Database Traffic Security Related ? The Case for Making DAM Available to Non Security Related Processes.

The Case for Data Enrichment - Part 4 - Security looks only to part of the traffic

The Case for Data Enrichment - Part 1.1 - A Comment from Ron Bennatan about SonarG 4.2

The Case for Data Enrichment - Part 3 - The Case for Time Series Analysis

社区洞察

其他会员也浏览了

Difference Between Server And Client | O7 Services

Software or hardware RAID implementation?

Configure and work with Zabbix Monitoring tool version 6 in Docker Containers

The Impact of Misusing 500 Errors on Server Stability

OpenSolaris 2009.06

SAP on Azure high availability – change from SPN to MSI for Pacemaker clusters using Azure fencing

Listen Carefully, Let the Machines Speak

Is scaling always the right answer? Insights from Performance Testing with JMeter

Tech Tale #1-The Heroic Load Balancer: Ensuring Smooth Sailing Systems