CyberSecurity Feed Summarisation with Context using AI
XexonStack

CyberSecurity Feed Summarisation with Context using AI

One of the challenges faced by security professionals is the need to be abreast of current security trends. However, the constant flood of digital content throughout the day is an important hurdle in achieving this.

The President's Daily Brief, sometimes referred to as the President's Daily Briefing or the President's Daily Bulletin, is a top-secret document produced and given each morning to the president of the United States; it is also distributed to a small number of top-level US officials who are approved by the president. It includes highly classified intelligence analysis, information about covert operations, and reports from the most sensitive US sources or those shared by allied intelligence agencies.


Motivation - Imagine the incredible boost you'll get from having your very own personalised PDB!

This article aims to present an overview of one of the AI model I developed for summarising cyber security news, which is hosted in Huggingface, with just over 400 million parameters producing state-of-art performance with ROUGE-1 score over 49 staying ahead of today's top models, here is list curated with state-of-art performing models and their metric scores.

https://medium.com/besedo-engineering/text-summarization-part-2-state-of-the-art-ae900e2ac55f


Because I work in the CyberSecurity domain and assume the audience is more inclined, I would keep it really high-level in terms of how I built.

I used several news emails sent out by service providers such as Mandiant, RiskIQ, Microsoft, and others to extract summaries and full website links to create a dataset containing just two fields: summary and full news from web page added some scraped information from selected cybersecurity websites.

We don't require an LLM to complete the operation, and it should be able to operate on a personal computer with modest resources. I chose BARD since it has been shown to perform well in similar tasks. Dataset over 6K and trained on RTX 4090 for several hours (~14).

Results are much better than what I expected and proven to have better than state-of-art models.


from HuggingFace Space



https://huggingface.co/venkycs/securityShots

How to use ?

You can scrape a website and send body html preprocessed with little cleaning; the model will intelligently remove tags.

Here is the preprocessing script https://github.com/venkycs/urlShots/blob/main/urlShots.py

The model uses not more than 2GB of ram and can run in CPU, with inference time 2-3 seconds. You can use new aggregation services and preprocess content to generate your own version of PDB. Please contact me if you require a dataset that I cannot publicly share because it contains scraped content.

Towards more complex applications in CyberSec

Some of the use cases for GenAI listed below:

  1. Generate recommendations for particular APT Group
  2. Identify Possible APT group with behaviour X
  3. Anonymise logs
  4. Detect PII or PHI information in the logs
  5. Extract IP addresses, usernames or context from the logs.
  6. Building langChains for implementing SOAR functionality, literally SOAR can be replaced for good.
  7. Communication for Security tickets automation based on incident.
  8. Vulnerability detection in Source code.
  9. Malicious packages detection in repos like PyPi, NPM etc..
  10. Explain CVE, CVE, Mitre etc..
  11. Generate policy documents for example - AUP or InfoSec aligned with standards X, Y, Z
  12. Interactive SecOps Bot, who is in shift with me, SLA violations, what are the mishandles user X do often ?
  13. Threat hunting plan generation and automation using LangChain and fine-tuned LLM.
  14. Design and implementation steps generation for new Security device using LLM.
  15. Log explanation or analysis with detailed description.
  16. Document Classification
  17. Many more in threat hunting ...


To get started towards building a LLM for Security pros, I already used Stanford Alpaca techniques to create dataset and generated few tasks specific listed above. Overall Zero-shot performance on few cases like IP extraction, log anonymization, Mitre explanation etc.. were achieved.

Model - https://huggingface.co/venkycs/llama-v2-7b-32kC-Security

Dataset - https://huggingface.co/datasets/venkycs/llm4security

Model created was tuned from LLAMA 7B having 32K context length and I have few reasons to choose the complex model, as my idea is to have semantic search in threat based applications, uses PEFT to optimise and needs adapters to be loaded. The code is complicated, so if you have a background in AI and want to know more, feel free to get in touch with me. This might need more training with new information to make it work better. That might be a different topic, depending on how much time I have.

Conclusion

Researchers are publishing a lot of interesting solutions for legacy concerns in the field of new AI models, particularly in the field of NLP and NLU. However, in my opinion, LLMs are similar to knowledge bases that require specific domain experience to apply correctly in order to obtain or meet expectations. In other words, knowledge necessitates certain expertise in order to be transferred to a domain-specific skill set. It is important for domain experts to understand, how AI can be applied in their own specific domains. We are only limited with domain specific implementation ideas and right dataset, things are not too far as they seems.

Running such service often requires certain efforts and could be tedious task, focusing on same issue to produce news and summaries here is the mobile app (AttackIO) created for cybersecurity pros, incase if you are one then you should try it.








Adam Chen Longhui

Quant Trading Enthusiast, MSc in Quant Finance

1 年

Hi Venkatesh, thank you for sharing the wonderful article. May I know what are some ways for individuals to get enough dataset to train a text-summarization model to a real-world deployable level?

回复
Rakib Hossain

Graphic Designer at Fiverr

1 年

Are you looking for flyer, brochure, one pager, business card, t- shirt design for your company then order me now without any delay and get your?desired design made in a very short time. Please contact with?me: cutt.ly/awlpr7gX

回复
Harshil Shah

Information and Cyber Security Professional | Senior Consultant @ KPMG Lower Gulf | Red Team Lead | Security Architecture | AI and Data Security | CISSP | OSCP | CRTP

1 年

Brilliant stuff, Venky! P.S - Jaw drop moment at ROGUE1 score of 49! ??

Shiba M.

Threat Hunting | AI & ML Cyber Security Investigator | OSINT Adversary Hunting

1 年

Lovely ?? Venky

要查看或添加评论,请登录

Venkatesh S.的更多文章

  • Untangle AI Model's Security Assessments

    Untangle AI Model's Security Assessments

    Artificial intelligence (AI) is a rapidly growing field with the potential to revolutionize many aspects of our lives…

    5 条评论
  • ActiveDefense - Hack the Hacker

    ActiveDefense - Hack the Hacker

    It is always interesting to learn about system design and hacking it. And before we move on, my understanding of…

  • Real Vulnerability - Threat Hunter's formula

    Real Vulnerability - Threat Hunter's formula

    The Virus days are gone, even malware authors has no time to waste these days. Now, it's either about wild attacks for…

    4 条评论
  • Unreported WhatsApp Bug

    Unreported WhatsApp Bug

    Since the starting of mobile-era, I'm very much clear that there is no such word called privacy. Due to which - I'm…

    8 条评论
  • Vulnerable SMB Protocol - Beyond WannaCry

    Vulnerable SMB Protocol - Beyond WannaCry

    By now everyone in Security domain should've gained enough insights of WannaCry Ransomeware. In this post lets talk…

    3 条评论
  • Thick Client Security Assessment - I

    Thick Client Security Assessment - I

    Now-a-days we see lot of Security Professionals come from application security background and having no idea about…

    14 条评论
  • BlackNurse Attacks - Analysis & Detection

    BlackNurse Attacks - Analysis & Detection

    While spending boring jobless days sitting at home all of sudden I came across "BlackNurse Attack" - Single computer…

    7 条评论

社区洞察

其他会员也浏览了