Data Logging for Audit, Debugging, and Fraud Detection
As we divide large applications up into smaller “microservices,” we encounter common problems that go back to the Apollo project days. Every development team must solve these problems, so it makes sense for the system architects to specify solutions in advance. This post discusses logging, an area where common solutions are more important than in many other areas.
Back when computer memory cost 10 cents per bit, computer memories were so small and network capacity was so low that it wasn’t possible to do much logging in the sense we use the term today. That made debugging a lot harder than it is now. Legacy system generally implemented logging pretty casually – developers added print statements as the spirit moved them. In order to be able to monitor and debug when thousands of different micorservice instances are cooperating to solve problems, logging needs a lot more thought than in the old days of wooden ships and iron men.
Given that any user-initiated action can be processed by any number of microservice calls, the variety of bugs and logic flaws that can bedevil a service-based system is richer than bugs in a monolithic application. Each user request should be assigned a user transaction ID which is included in all logging messages. Tracking user requests helps untangle errors related to flow and helps identify normal activity patterns which can be distinguished from fraudulent activity. Request IDs can also be used to support Once and Only Once which is discussed in a separate post.
The architecture team should choose a common logging library and supply a standard logger configuration file so that messages are formatted consistently across all projects and services. Each message should start with a GMT date and time in yyyymmddhhmmss format. This supports merging logs from different systems spread all over the world.
Many issues arise because of unexpected interactions between calls to microservices. Merging log files from different sources and keeping them in time-sequence is helpful, particularly if each message also includes the user request ID. Log messages should also include a hostname, service name, and a trigger for the SCADA system which is explained later.
All times should be stored as raw GMT milliseconds instead of using database-specific time formats. Airlines learned long ago that any other approach leads to chaos.
It is important to be able to associate log messages with individual users, either by including the user ID in the log messages or being able to associate the transaction ID with a specific user as the need arises. The Target breach occurred because their IT system let a hacker who had stolen HVAC vendor credentials access their point-of-sale systems. This was a permissions issue, not an authentication failure. Target should never have permitted the attacker who impersonated the vendor to escalate privileges at all, to say nothing of escalating to the point of accessing critical data. If all requests identified the originating user, it would have been easier to protect the more sensitive systems.
The only known way to detect such internal attacks is activity pattern analysis. Mr. Snowden had root access to the NSA computers, but there was no job-related need for him to actually read gigabytes of data. The fact that he was collecting, moving, and copying large volumes of data should have set off alarms, or at least spurred someone to ask questions. We saw the same “invade, collect, compress, export” sequence in the Capital One data breach.
Banks require that all actions with financial effect be associated with some responsible individual. Activity logs support both auditing and activity pattern analysis, but only if all log file formats are consistent.
Logging also feeds our Supervisory Control and Data Acquisition system which makes sure that thousands of microservice instances are working as expected and that overall system performance meets customer expecations. SCADA is discussed in its own post.
AWS has services which merge logs and send alarms as needed. Is being dependent on AWS OK or should we incorporate and maintain some other log management system? If we already have a well-organized log management in our legacy data center, using it for the cloud makes sense. If we don’t, getting into the cloud is a golden opportunity to configure one.
clergy at godsownpenticostalministry
5 年God bless u
Musician at Be?+
5 年Bill Taylor what’s your input on this matter, pro and cons. Could universities or education systems help brainstorm ideas in classrooms?
IPT-Lead at Aitech Systems Ltd.
5 年I do this for a living. when I first srarted 15 years ago, we barely had much data coming in. Im talkinf a few hundred MBs. Today we are talking 10s of GBs of data. With bew servies coming online from our customers they are capturing more and more data. We are playing catchup with Data servicing/processing amd debugging. We are working with finding lost data and why it was missing, down to a second of lost data or less. Its difficult when you process the data and 1 million lines or more you have to shift through to find the issue. So we have to create the test/processing tools.