Filter the Noise
Syed Nadeem
DevOps Architect | Kubernetes Expert | System Design Innovator | Multi Cloud Expert | Transforming Ideas into Robust Architectures
??Modern day systems generate an overwhelming amount of noise in the form of logs, events, and notifications. This noise can often drown out important alerts and critical information, making it challenging to identify and address significant issues promptly. IOT devices and microservices can produce millions of garbage logs and investigating something within this pile can make you feel as if you are trapped beneath a mountain. While there are long term solutions at application and service level, there are quite few short term solutions that can help you filter the digital noise.
??Log Parsing: This involves meticulously examining and extracting structured data from log entries. Start with identifying the key fields which you are looking for. Define Regular Expressions or pattern-matching technique and put them in your script or tools. Key thing here is to not do parsing manually, include any tool like Graylog, Fluentd or Grok which should act as Log Centralization point for all the applications and services. All of these tools come with Inbuild Streams and Filters which are excellent for parsing.
??Intelligent Alerting: Artificial Intelligence has taken alerting by leaps and bounds in recent times. Based on sample data over a sample time frame, we can get not only the live alerts but we can also get alerts for future events. Isn't that crazy :)
For example, based on your current usage AWS can alert and predict what you will spend on the future months. DataDog can automatically identify abnormal behavior in metrics and generate intelligent alerts and BigPanda can automatically correlates related alerts, filters out noise, and present high-priority incidents.
领英推荐
??Less tools: Less is good, that's right only choose the tools that you need and never try to find a need for tool that you have chosen. There will be experts suggesting you all the new fancy tools and that okay, but if you don't have a need that requires a new tool then just take that advice as a side note and move on. You should strictly have only one solution for log management and one solution for alerting and reporting. In other words, you don't need two anti virus to go crazy in one laptop.
??Defining Threshold's and Aggregate metrics: This is the most important part. To decide that something is serious and needs immediate attention you need to define thresholds and this will also improve the security and reliability for the over all system. If you are using Prometheus, then use static thresholds or dynamic thresholds and aggregate metrics using functions like sum(), avg(), rate(), increase() etc to define what is acceptable. Make use of record_transformer and kubernetes_metadata if you are using Fluentd for transforming logs and then setting what is acceptable. In Sysdig use filters like contains, matches to get a count and eventual metrics. Once you have your threshold and limits reached, automatically a event driven procedure needs to be inplace. For example, having a matrices system can let you know your system current capacity and if it reach's 60% you automatically want extra nodes to be added (what we call HPA or Auto-Scaling ). Or if there is IP that is consistently coming in logs and is not from your know CIDR, you automatically want to block it. The examples are many but the logic is simple, define and make use of thresholds and aggregated metrics."