Rethinking the world of logging?
Maarten Ectors
Innovative Technologist, Business Strategist and Senior Executive | Bridging Technology & Business for Lasting Impact
Whenever you deploy lots of cloud servers, containers or smart devices, you need logs to find out whenever something has gone wrong. The traditional set up is to export logs from each instance, transform them and load them into some data lake or analytics platform. Not only is this a slow and expensive process, it often does not work. Whenever a device or instance has an issue, what you really want is access to extremely detailed logs for this specific device. You don't want to export logs at this detail all the time for all devices/instances because it would be too much data. Issues that are time-critical, e.g. disk filling up quickly, intruder, ... often get caught too late if you are working with an overnight batch process. The same logs can be used by IT teams to check if an application is working, operational/support teams to find information about a customer problem, finance teams to understand financial transactions that happened on a device, security teams to be warned when ransomware or intruders have tripped a wire, ... Making one data analytics platform that solves every department's problem all the time in any medium to large organisation is impossible. So why are we still trying?
What if we rethink logging? Divide it into two categories: proactive and reactive.
Proactive logging or events
If we know the device we are monitoring is a point of sale [PoS] in a supermarket then we could push one or multiple log event generators to it. A log event generator is a small piece of highly optimised code that can call an external system every time an interesting event happens. When a customer pays their groceries, the log event generator can see this in the logs and generate an event with whatever information HQ wants to know. If the disk of the machine is about to fill up due to some issue, an IT support event can be sent. The same if a virus is attacking the machine. Basically proactive logging or events allow logic to be pushed to the device or instance on which logs need to generate an event and where to deliver these events. Each department might have their own system where they want these events to be send to. Proactive maintenance could avoid customers seeing outage. Security staff can take immediate action, even before users know they have a virus.
领英推荐
Reactive logging
Whenever something is wrong, you likely want to dig deep through the logs of the device or instance that has the issue. Some IT problems only happen on one in a thousand machines. Forget about increasing the logs on the other 999. You want to go ad-hoc and deep on that one with the problem. Reactive logging should allow logs to be generated ad-hoc based on what is needed. IT support staff should be able to request extra logs from a specific machine or groups of machines. Business people thinking about running a new type of campaign, might need info they never requested before. If each cloud instance or device has a reactive logging module which can run log requests real-time provide they do not interfere with the core activities running on the same machine, e.g. reactive logging can only use 5% of resources, then ad-hoc queries should be possible. ChatGPT type of interfaces can be used to convert a human request for logs to potentially specific technical tools or languages [e.g. think SQL]. ChatGPT can you ask all PoS devices in the store on Liverpool street if anybody has inserted a new USB device in the last 2 hours?
Any other logging requirements or problems
Are there any other logging requirements or problems you would like to see addressed? If you think more innovation is needed in the logging space, please use the comments.
Data Network & Security Advisor | Network Performance Analysis | Blogger | Entrepreneur | Maritime Archeology Volunteer
8 个月Did you just describe original (Juniper) MIST AI ? In case of certsin event, debugging start. This way engineer can get logs without trying to reproduce issue.