Open-Source Contribution to Google's Timesketch
Section 1: About the Project?
Introduction
My opportunity to participate in the micro-internship by CodeDay, in partnership with the Computing Talent Initiative (CTI), gave me the experience of working with a great project team consisting of Brodan Whelan, Han Ngo, and our mentor Tyler Menezes. I was able to contribute to the Timesketch open-source application maintained by some of Google’s finest, such as Johan Berggren.?
What is Timesketch?
Timesketch is a collaborative forensic timeline analysis application maintained by Google. It makes sense that one of the biggest contributors to the cybersecurity field is actively developing the Timesketch application. It’s just one of the vital tools that professionals have demonstrated using, indicated by the tool’s community built on social media platforms such as Twitter and Youtube.? As of this year, there are 3-2.5 thousand users and contributors on the project with a healthy amount of stars to show it on github[source].?
The typical user of Timesketch would be a cybersecurity professional who needs to perform a forensic analysis of digital traffic on a cloud or private network. Timesketch enables cybersecurity professionals to track intruders’ actions by using data from network traffic so that they can analyze the patterns and their implications about the repercussions and provide data-driven decision-making to solve the issue.
Moreover, Timesketch can be used collaboratively to visualize these data patterns with the Graph feature. This is pivotal in the effectiveness of a cybersecurity team that is tasked with incident response.
Section 2: Diagnosing The Issue
Initially, the group was tracking down the Error parsing datetime from CSV files #2279. We came to understand that the issue occurred because read_and_validate_csv(), the function from the project’s timesketch/timesketch/lib/utils.py directory, was unable to parse the datetime from the CSV file into a DataTime type. This issue impedes the ability of the application to properly populate the timestamp object so that the labels, during the analysis of timelines of interest, are accurate. If the parsing function is unable to perform these tasks properly, then the application loses reliability and fails to be robust.?
With the datetime formatted as:
2022-08-09T05:29:39-05:00
The following error message was thrown.
To track down the problem in the source code, we looked to the read_and_validate_csv() function as our first clue. It turned out that the format of the datetime field being parsed with the help of the chunk dictionary was a type that was unexpected for the .dt method.
chunk["datetime"] = pandas.to_datetime( ? ? ? ? ? ? ? ? ? ? chunk["datetime"], errors="coerce" ? ? ? ? ? ? ? ? ? ? )
In Line 292, despite the attempt to cast the timestamp object returned by chunk[“datetime”], the issue arose when calling the .dt method on it when assigning its value to?chunk[“timestamp”] in line 306.
chunk["timestamp"] = chunk["datetime"].dt.strftime( ? ? ? ? ? ? ? ? ? ? "%s%f").astype(int)
The initial approach to this problem was to convert the input value to a DataFrame or Series type so that the .dt method could be called.
However, when running tests the team and I discovered that the codebase was updated without any explicit notification about it. The issue assigned to my team and I was resolved. So, we had to switch our train of thought to a preventative one and ask ourselves; what are some problems that could arise during real-time use and what are some cases that could cause regression in the application that could lead to issues like #2279?
Section 3: Codebase Overview
This train of thought required a better understanding of the codebase how the project was composed and with what different technologies. Below is a breakdown of what the project consists of.
TECH STACK:
Frontend
Vue.js: Manages UI interactions. (24.5% of the project)
Bootstrap: Used for styling the frontend elements
Webpack: Bundles frontend assets
Backend
Flask: Serves as the application server
Python: The core programming language (60.1% of the project)
Database
Elasticsearch: Manages the timeline data
PostgreSQL: Handles relational data storage
Task Management
Celery: Manages background tasks
领英推荐
Redis: Performs as the queue and cache for tasks
Containers and Servers
Docker: Provides a scalable and robust deployment environment
Nginx: Serves as the reverse proxy server
Our team and I used and interacted with the following:
Python
The main programming language for backend development
Docker
Containterization environment that ensures a scalable consistency of the deployment environment
Pandas
Python library used for processing csv data?
Pytest
Python framework for testing backend development in the project
Github
Version control system for project and collaboration
Section 4: Challenges
The first challenge that I had faced was working with Docker. It took awhile for me to familiarize myself with the documentation and some Youtube. After I had reached the point of running the docker commands, I ran into another wall. My machine took an eternity to run docker compose the first time and run the application. This was an issue that was seen across the team. Moreover, the others were on a different page since they had macintosh machines while I have a Windows. After a session with our mentor Tyler, we were able to better understand the utility with Docker and use it to deploy the application natively. This gave us the ability to have a thorough understanding of what the user experiences with the development issues and their implications about the tool’s performance.
Another challenge that I had faced was understanding the codebase and what exactly the code in the assigned issue was doing. It took a significant amount of time to build a testsuite that allowed me to fully understand what was happening when a user in the application would create a timestamp for a timeline.?
The last hurdle that I had faced was a common one with my team. While the team and I were doing a session of pair programming to replicate the issue, we found out that the error message was gone. After further analysis of the project, we learned that we were running tests when the issue was fixed, however this wasn’t made clear within the Github community. Eventually, we found a cryptic message about the issue implying it was resolved but was not flagged as resolved. After reaching out to our CTI coordinators, we were guided towards building unit tests to prevent further issues down the road and enforce the robustness of the tool.?
Section 5: Solution?
Our solution consisted of creating preventative unit test that would maintain the integrity of the codebase in the event that modifications made to it would result in related issues to ##2279. Moreover, it would prevent datetime conversion from being altered.
We wrote our unit test in the utils_test.py file and used Timesketch’s test Pytest framework to make sure that the datetime values from the CSV file always returns in ISO format. In the event that the format is incorrect, the test will fail because the conversion would become altered.?
The CSV test file created called validate_timestamp_conversion.csv contains three values of datetime in varying formats.?
The test is called upon loading the CSV file and verifies that the input data generated matches the expected output. When this happens, the datetime conversion process was completed successfully.?
#Checking the converted datetime output format is not altered results = iter( ? ? ? ? ? ? read_and_validate_csv( ? ? ? ? "test_tools/test_events/validate_timestamp_conversion.csv" ? ? ? ? ? ? ) ? ? ? ? ? ) for output in expected_outputs: ? ? self.assertDictEqual(next(results), output)
Before testing the new changes, the testing framework had to be installed in addition to the flask module that’s used within Timesketch.
This is done running the following commands:
$ pip install pytest
$ pip install Flask-Testing
Then, run the command?
$ python3 -m pytest ./timesketch/lib/utils_test.py
Once my team and I finalized our contribution, we submitted the changes and successfully merged our pull request into the project.
Acknowledgments