登录查看更多内容

How do we analyze 240 GB/Day of open-source code+data to detect leaked credentials?

Jitendra Chauhan

CEO & Co-Founder at Detoxio, Detox your GenAI

发布日期: 2023年7月12日

GitHub open-source code releases 240 GBs of code daily, 100K Commits / Hour, and 20 Potential Credentials/Min (ref Darkreading). Leaked credentials include passwords, API Keys, and Cloud Access Keys such as AWS, Gcloud, Azure keys, SSH keys, env files, and many more.

As a security partner, we often ask the following questions to our customers:

1. Do the security teams know how many of your developers are contributing to open-source repositories??

2. Do the security teams know whether developers are not leaking out sensitive credentials, unintentionally?

3. How fast can you detect a code leak incident, and fix it?

“Detect any credential leaked in a Public Open Source Code Repository within 15 Mins”

Solution

In this article, we will give a brief solution on how you can build a code leak detection mechanism to identify leaked credentials in open-source code published on Github, relevant to your organization.

The following architecture shows a pipeline to capture the GitHub events containing code commits from the open-source contributors, and how to detect leaked credentials related to an organization within 15-30 mins of publishing the code.

Step 1: Capture and Process Github events: GitHub provides APIs to fetch events related to code commits as a stream. Write a custom events processor or use an open-source tool such as commit-stream.?

领英推荐

Save Receipt Data and Attachments in Google Sheets

DigitalOcean 2 个月前

Table of Contents for my postings 1

Walter Lee 6 年前

Dev Update - August 2024

Supabase 6 个月前

Step 2: Crawl Code Commits: A GitHub event has links to a set of? Code commits including diff/patch. There can be multiple files affected in a patch. The goal of the crawler is to fetch the Diff so that further analysis can be done.

Step 3: Indexing into a Bigdata Solution: Index the event and code diff into a warehouse or database. One of my favorites is Google Cloud Big Query as querying TBs of data takes only a few secs.

Step 4. Attribution to My Org: Does this code committed by a developer of the org? This is the goal of this step ie to attribute code committed related to a specific org. Matching keywords and regular expressions can be a relatively simple but effective technique. BigQuery makes this step super easy. You can create two tables one for keyword and another for code commits. The keyword searching problem is really a join between two tables.?

Step 5. Detect Leaked Credentials: The keyword search problem, used in step 4, can also be used to detect leaked credentials. In this step, you can create a table with secret key patterns.

Key Challenges

False Positives: Regex and keyword-based matching of the repositories and secrets may lead to false positive alerts. Keywords and secret patterns must be fine-tuned to get optimal results.

False Negatives: Regex and keyword-based matching of the repositories and secrets may lead to missing alerts. For example, a code commit may have AWS key but the repository may not get attributed to the organization.

I would love to do a Hands-On webinar on Using BigQuery to find secret keys in Public Code. If you need a webinar, either put a comment or press the Loved It button.

要查看或添加评论，请登录

Jitendra Chauhan的更多文章

Pocket's Checklist to get DPDP started

2023年12月17日

Pocket's Checklist to get DPDP started

Why DPDP? Key Risks and Pocket Checklist to get started. It will probably cost 5% of the DPDP Penalty to implement the…

2 条评论
Pentesting vs. Red Teaming? When do you need what?

2023年9月8日

Pentesting vs. Red Teaming? When do you need what?

“ If you can't explain it to a six-year-old, you don't understand it yourself.” - Richard Feynman In my personal…
Decoding the Code: Unraveling Vulnerabilities in the Shadows of Open Source

2023年8月30日

Decoding the Code: Unraveling Vulnerabilities in the Shadows of Open Source

As per research, 70% to 90% of the code comes from the open-source ecosystem in any modern production application. The…
My Decade Later Contribution to Open Source World

2023年8月17日

My Decade Later Contribution to Open Source World

Hello Readers, I want to share with you my exciting story about my experience of contributing to the Open Source World…
Unlocking CICD Security: A Secret Almanack of 20 Essential Controls for 80% Protection

2023年8月2日

Unlocking CICD Security: A Secret Almanack of 20 Essential Controls for 80% Protection

In the ever-evolving world of cybersecurity, malicious actors often exploit human vulnerabilities to infiltrate…
10 Tips to Establish Baseline Security with minimal cost in Startups

2023年7月20日

10 Tips to Establish Baseline Security with minimal cost in Startups

I have been a part of startups throughout my life. iViZ, Firecompass, Cygilant, and many startups that I have advised.
Is your Data Safe with Corporates?

2017年6月14日

Is your Data Safe with Corporates?

Increasingly, more and more Corporates are collecting your private and personal data, in the name of personalized…
"Rob my House", an invitation on Facebook by Henry?

2017年6月2日

"Rob my House", an invitation on Facebook by Henry?

I accept that Social Networks have made the world more connected and more social. As per research, you are just 6 hops…
Will you pay $4/month for privacy preserving Google, Facebook and Twitter services?

2017年5月26日

Will you pay $4/month for privacy preserving Google, Facebook and Twitter services?

$4 a month is just below 50$ a year. If this is the price of your privacy and data confidentiality, are you ready to…

2 条评论
Can Artificial Intelligence (#AI) replace my job?

2017年5月18日

Can Artificial Intelligence (#AI) replace my job?

One of the most popular and dreaded question being asked in the industry today, is that whether AI can replace my Job…

3 条评论

See all articles

How do we analyze 240 GB/Day of open-source code+data to detect leaked credentials?

Jitendra Chauhan

CEO & Co-Founder at Detoxio, Detox your GenAI

“Detect any credential leaked in a Public Open Source Code Repository within 15 Mins”

Solution

领英推荐

Key Challenges

Jitendra Chauhan的更多文章

社区洞察

其他会员也浏览了

What Apache Iceberg REST Catalog is and isn't

Parsing AWS AppSync Responses, Elm GraphQL Libraries, and Only Doing Front-End

What is Gray Log?

How to benchmark

Resolving Unsupported OP_QUERY Command Errors in Your MongoDB and Node.js Application

RisingWave Newsletter March 2024

Discovering Docker Hub: The Central Repository of Docker Images

How to get into trouble using some Postgres features

Tip of the Apache Iceberg

Building Scalable Multi-Tenant Systems with Django and PostgreSQL

“Detect any credential leaked in a Public Open Source Code Repository within 15 Mins”

Solution

领英推荐

Key Challenges

Jitendra Chauhan的更多文章

Pocket's Checklist to get DPDP started

Pentesting vs. Red Teaming? When do you need what?

Decoding the Code: Unraveling Vulnerabilities in the Shadows of Open Source

My Decade Later Contribution to Open Source World

Unlocking CICD Security: A Secret Almanack of 20 Essential Controls for 80% Protection

10 Tips to Establish Baseline Security with minimal cost in Startups

Is your Data Safe with Corporates?

"Rob my House", an invitation on Facebook by Henry?

Will you pay $4/month for privacy preserving Google, Facebook and Twitter services?

Can Artificial Intelligence (#AI) replace my job?

社区洞察

其他会员也浏览了

What Apache Iceberg REST Catalog is and isn't

Parsing AWS AppSync Responses, Elm GraphQL Libraries, and Only Doing Front-End

What is Gray Log?

How to benchmark

Resolving Unsupported OP_QUERY Command Errors in Your MongoDB and Node.js Application

RisingWave Newsletter March 2024

Discovering Docker Hub: The Central Repository of Docker Images

How to get into trouble using some Postgres features

Tip of the Apache Iceberg

Building Scalable Multi-Tenant Systems with Django and PostgreSQL