How To Manage A Guardium Infrastructure? Pratical Advices from Pratical Experience


I have been managing by myself for almost 3 years an environment of about 50 appliances and around 350 agents interfaced with GBDI.

I had a few hard moments but overall I made it.

This is how ( for the Guardium Infrastructure part ).

By Guardium Infrastructure I mean : Central Manager, Collectors and Agents (GIM and STAPs). The obvious goals of managing a Guardium infrastructure are the smooth and stable running of it.

Point #1 : The GUI is not enough.

GUI on Guardium like on any other application is very efficient when it comes to poking around, investigating and other one time actions. But No GUI is good enough when it comes to repetitive activities. For that you MUST use cli, grdapi and RestAPI. You MUST automate. You MUST script. Guardium provides the Server side of it, but unfortunately not the Client side. That part is .. on you ....

Point #2 : Do NOT manage Guardium from your laptop .....

I knew this from previous customers when I was Professional Services with IBM Guardium. I saw customers "managing" their environment from their laptops and it was really tough on them. Instead, you need an Enterprise class system from which you will be accessing and managing your appliances . You should be using a jump server, either Linux (my preference) or Windows. They stay on-line 24 by 7, got backed up by your company standard backup process, can have countless tools and they can be accessed by many people. Finally they can hold logs of your actions for a very long time.

Guardium does not offer such Client, which is unfortunate, and as a result, it's up to you to develop one. Which is a bit challenging and requires development skills.

I built such a Linux system which has the following characteristics:

  • A service account sudo'ed in by authorized users and used for executing all scripts and commands.
  • There is a folder for each appliance. This way you keep logs related to each appliance grouped and separated
  • I developed a series of scripts, shell scripts, python scripts and restApi scripts. This is the heart of the system: the automation of repetitive actions
  • The scripts are in general in a separate folder and they got called from the scripts sitting in the appliance folders
  • For python, I use python venv but I have plan to use Docker at some point
  • Most of the scripts input a pre-defined command file for cli and output a log file. The scripts are executed as follows from the jumpserver/service account: service acct@jusmpserver>ssh cli@appliance<script.cmd>script.log. You have to enter the appliance password when prompted. If you don't want to, you should rather use restApi for which IBM doesn't provide any sample either (I do in my OpenSource project "Context22", see reference at the bottom)
  • I cron some of those scripts. Cron is reliable ...
  • I have scripts for the following main functions: # appliance health report (about 15 cli commands ); # STAPs configuration (especially for Linux on ktap and ATAP); # GBDI Datamarts execution; # Initial configuration of new appliances even though we now use VMWare template instead (see below). I published most of the scripts on my OpenSource project "Context22"
  • The scripts get executed at once with the shell script "grepping" most interesting part of the output for rapid reading
  • I also developed a shell script to search the log files for whatever I may be interested in.

Point #3 : Building New Appliances vs. Upgrading

For the first year, honestly, the appliances gave me absolutely no worries, just here and there some traffic explosions I would quickly take care of by redirecting most outrageous agents to less loaded collectors. This gave me plenty of time to spend on GBDI.

But after 1 year, things started degrading: bugs became apparent and sometimes difficult to fix; upgrading would sometimes just fail with no obvious reasons, and other annoying issues. My take on this is it's better to rebuild every 18 months rather than fighting entropy.

Since we use VMWare, the best was to build a VMWare template. It is very easy to build a new appliance from scratch (latest ISO file), put on it all the configuration, and make it generic by putting a dummy IP address and Hostname before converting it as a template.

Once the template is built, you can just spin it next to an existing appliance you want to rebuild, then stop the "old" one, start the new one, assign the IP and hostname of the old one and you are back in business. You may still have to perform some (minor) configuration, in particular the registration of the appliance to the CM as this doesn't carry well in the template. You may also have some specifics on DNS, ntp etc... if you are in a segmented environment with different settings.

This procedure, of course, works best for Collectors, especially if you have GBDI as, once you have stopped the sniffer and the reporting agents have failed over to their fail over collectors, the collected data got transferred to GBDI/SonarG within an hour. You really have to wait only for 1 hour or 2 for the replacement to take place. For it, just stop completely the old one, bring the new one on-line, then the agents fall back from their Fail Over collector and you can take care of the next appliance in a round-robin manner.

This does not work as well for Aggregators but you should not be using aggregators anyway, right ?

This kind of rebuilding of appliance allows for limiting the patching of appliances to minor versions upgrades, sniffer patching and emergency security patches. It also minimizes the workload for upgrading on the Guardium admins and VMWare teams.

Point #4 : Managing Agents

There are several issues to consider :

  • Initial deployment (point #4.1)
  • New agents (point #4.2)
  • Assigning Collectors to Agents (point #4.3)
  • Balancing Traffic (point #4.4)

Point #4.1 : Initial Deployment (GIM and STAPs)

Initial deployment is in general a mass deployment. The best tool and honestly the only tool really appropriate for this is the "consolidated" installer which installs at once GIM and STAP. Our Professional Services people in charge of this initial deployment, did a very good job at using it, working around a few "bugs" or limitations. We had a very successful initial deployment of the agents. I recently did a new large deployment using the newer version of this installer for Windows (11.2) and the process is even better, cleaner and simpler. Definitely a very good job on the IBM side.

This is certainly much better and simpler than "pushing" the STAP after installing the GIM as the file transfer is quite slow and many issues have time to occur. With the consolidated one, once the transfer of the installer on the server is done, the installation is really a matter of less than a minute.

To avoid the issue of the initial Collector assignment, we decided to dedicate a Collector to the new agents making the assignment to a collector fixed. We then redirect, either via GUI or RestApi to the final collector.

Point #4.2 : New Agents

For new agents we actually use the same consolidated installer approach with the same default/staging collector and then we redirect to the final destination.

Point #4.3 : Assigning Collectors to Agents

There 2 or 3 points of view/approachs

1 - Entirely rely on ELB (Enteprise Load Balancer)

2 - Partially rely on ELB

3 - Not rely on ELB

Approach 1: you actually put ALL STAPs in the one big unique pool of agents "facing" a big one unique pool with all the Collectors. And you hope for the best but should expect the worst. It's not that ELB is bad, but you need to make sure you have enough collectors assigned, which cannot be guaranteed. And when you don't have enough capacity, we know ELB may start acting up. ELB seems to handle well when it has enough capacity, but not when there is insufficient capacities.

Approach 2 : you reserve the use of ELB to some specific parts of your environment. This is what we are going to do, but we have not implemented it yet. This can be the best option for well defined and delimited environments.

But approaches 1 & 2 require both to have a procedure to quickly be alerted on insufficient capacities and be able to bring in more collectors and/or remove STAPs before the system gets into resources issues

Approach 3 : This is currently our approach as we prefer to structure the agents assignment along 2 lines : Environment and DB type. We prefer not to mix up Dev Servers and Prod Servers. And we prefer not to mix up MS SQL and MongoDB. It gives us a clearer picture as the behavior can be widely different. We can track the load on a more differentiated and structured way by environment and Database types.

Point #5 : Balancing Traffic Volumes

In a non ELB environment, the re-balancing of traffic load among collectors is mostly a manual process based on appropriate tool to detect trends early on . In case of "bursts" of traffic, redirecting traffic to another appliance is crucial and requires to know what appliance is the least loaded. The best tool for this are graphic ones, like histograms and bars, especially when they display info on all the appliances in the same graphic once. We found Kibana fed from GBDI provides the best results. Here a few examples of our Graphical views (we group them in dashboards).

Below is a set of views for all Collectors, displaying values of different Sniffer variables which give a global picture of the the balancing in the environment.

No alt text provided for this image
No alt text provided for this image
No alt text provided for this image
No alt text provided for this image

Here below is a Time Series from the beginning of the year for 1 appliance on 1 value in the BUM (Buffer Usage Monitor)

No alt text provided for this image


Point #6 : Philosophy of Guardium Management

Here are few of my "directions" and "approaches" when it comes to managing a Guardium environment

  • Do not treat Guardium as a black box. Understand its components. Treat each component the way most appropriate to it.
  • Be proactive, NOT just reactive . Do not wait for issues to happen. Poke around constantly to detect early signs of degradation
  • Automate, automate, automate
  • Generate and keep logs for later review and references.
  • Do not second guess Guardium support. They are not perfect, but second-guessing them is worse.

References:

  • Access to my OpenSource Project "Context22" :
  • https://github.com/Fredo68usa/Context_22_GuardiumRestAPI.git
  • https://github.com/Fredo68usa/Context22_Infra.git




Raymond Human

ICT Trusted Advisor

9 个月

Brilliant information. How can we get in contact with you?

回复

Bonjour Frederic, je viens de découvrir tes articles et surtout le sujet de l'analytique pour une sécurité prédictive. à ce sujet, j'ai une question: Est-ce que Guardium Insight apporte quelques réponses à certain problème (outliers, risk engine, etc.)? Merci pour le partage de ces inestimables connaissances. Cdt,

回复
Akshay Gedam

CISSP | Security Consultant |GUARDIUM SME | Imperva | Azure | AWS Certified

4 年

Perfectly documented ????

回复
Syed Mansoor M Certified IBM Security Guardium Specialist

Senior Cyber Security Consultant - Database Security

4 年

Frederic Petit, I closely read all the points and topics you have mentioned. This is one of best article i have read in a while on Guardium. It is a story of every Database Security Team over the globe who is working with Guardium or Imperva or GBDI. This article is very much in detail with the all pain-points, methods and recommendations to be considered. This is an eye opener article- Admins who think it starts and ends with Guardium,imperva or gdbi And for others a who has already adopted this it’s a corrective measure to complete the gaps which were not covered in their organisation. Thanks for bringing this up.

要查看或添加评论,请登录

Frederic Petit的更多文章

社区洞察

其他会员也浏览了