How do we analyze 240 GB/Day of open-source code+data to detect leaked credentials?
GitHub open-source code releases 240 GBs of code daily, 100K Commits / Hour, and 20 Potential Credentials/Min (ref Darkreading). Leaked credentials include passwords, API Keys, and Cloud Access Keys such as AWS, Gcloud, Azure keys, SSH keys, env files, and many more.
As a security partner, we often ask the following questions to our customers:
1. Do the security teams know how many of your developers are contributing to open-source repositories??
2. Do the security teams know whether developers are not leaking out sensitive credentials, unintentionally?
3. How fast can you detect a code leak incident, and fix it?
“Detect any credential leaked in a Public Open Source Code Repository within 15 Mins”
Solution
In this article, we will give a brief solution on how you can build a code leak detection mechanism to identify leaked credentials in open-source code published on Github, relevant to your organization.
The following architecture shows a pipeline to capture the GitHub events containing code commits from the open-source contributors, and how to detect leaked credentials related to an organization within 15-30 mins of publishing the code.
Step 1: Capture and Process Github events: GitHub provides APIs to fetch events related to code commits as a stream. Write a custom events processor or use an open-source tool such as commit-stream.?
领英推荐
Step 2: Crawl Code Commits: A GitHub event has links to a set of? Code commits including diff/patch. There can be multiple files affected in a patch. The goal of the crawler is to fetch the Diff so that further analysis can be done.
Step 3: Indexing into a Bigdata Solution: Index the event and code diff into a warehouse or database. One of my favorites is Google Cloud Big Query as querying TBs of data takes only a few secs.
Step 4. Attribution to My Org: Does this code committed by a developer of the org? This is the goal of this step ie to attribute code committed related to a specific org. Matching keywords and regular expressions can be a relatively simple but effective technique. BigQuery makes this step super easy. You can create two tables one for keyword and another for code commits. The keyword searching problem is really a join between two tables.?
Step 5. Detect Leaked Credentials: The keyword search problem, used in step 4, can also be used to detect leaked credentials. In this step, you can create a table with secret key patterns.
Key Challenges
False Positives: Regex and keyword-based matching of the repositories and secrets may lead to false positive alerts. Keywords and secret patterns must be fine-tuned to get optimal results.
False Negatives: Regex and keyword-based matching of the repositories and secrets may lead to missing alerts. For example, a code commit may have AWS key but the repository may not get attributed to the organization.
I would love to do a Hands-On webinar on Using BigQuery to find secret keys in Public Code. If you need a webinar, either put a comment or press the Loved It button.