CyberSecurity Feed Summarisation with Context using AI
One of the challenges faced by security professionals is the need to be abreast of current security trends. However, the constant flood of digital content throughout the day is an important hurdle in achieving this.
The President's Daily Brief, sometimes referred to as the President's Daily Briefing or the President's Daily Bulletin, is a top-secret document produced and given each morning to the president of the United States; it is also distributed to a small number of top-level US officials who are approved by the president. It includes highly classified intelligence analysis, information about covert operations, and reports from the most sensitive US sources or those shared by allied intelligence agencies.
Motivation - Imagine the incredible boost you'll get from having your very own personalised PDB!
This article aims to present an overview of one of the AI model I developed for summarising cyber security news, which is hosted in Huggingface, with just over 400 million parameters producing state-of-art performance with ROUGE-1 score over 49 staying ahead of today's top models, here is list curated with state-of-art performing models and their metric scores.
Because I work in the CyberSecurity domain and assume the audience is more inclined, I would keep it really high-level in terms of how I built.
I used several news emails sent out by service providers such as Mandiant, RiskIQ, Microsoft, and others to extract summaries and full website links to create a dataset containing just two fields: summary and full news from web page added some scraped information from selected cybersecurity websites.
We don't require an LLM to complete the operation, and it should be able to operate on a personal computer with modest resources. I chose BARD since it has been shown to perform well in similar tasks. Dataset over 6K and trained on RTX 4090 for several hours (~14).
Results are much better than what I expected and proven to have better than state-of-art models.
How to use ?
You can scrape a website and send body html preprocessed with little cleaning; the model will intelligently remove tags.
Here is the preprocessing script https://github.com/venkycs/urlShots/blob/main/urlShots.py
领英推荐
The model uses not more than 2GB of ram and can run in CPU, with inference time 2-3 seconds. You can use new aggregation services and preprocess content to generate your own version of PDB. Please contact me if you require a dataset that I cannot publicly share because it contains scraped content.
Towards more complex applications in CyberSec
Some of the use cases for GenAI listed below:
To get started towards building a LLM for Security pros, I already used Stanford Alpaca techniques to create dataset and generated few tasks specific listed above. Overall Zero-shot performance on few cases like IP extraction, log anonymization, Mitre explanation etc.. were achieved.
Model created was tuned from LLAMA 7B having 32K context length and I have few reasons to choose the complex model, as my idea is to have semantic search in threat based applications, uses PEFT to optimise and needs adapters to be loaded. The code is complicated, so if you have a background in AI and want to know more, feel free to get in touch with me. This might need more training with new information to make it work better. That might be a different topic, depending on how much time I have.
Conclusion
Researchers are publishing a lot of interesting solutions for legacy concerns in the field of new AI models, particularly in the field of NLP and NLU. However, in my opinion, LLMs are similar to knowledge bases that require specific domain experience to apply correctly in order to obtain or meet expectations. In other words, knowledge necessitates certain expertise in order to be transferred to a domain-specific skill set. It is important for domain experts to understand, how AI can be applied in their own specific domains. We are only limited with domain specific implementation ideas and right dataset, things are not too far as they seems.
Running such service often requires certain efforts and could be tedious task, focusing on same issue to produce news and summaries here is the mobile app (AttackIO) created for cybersecurity pros, incase if you are one then you should try it.
Quant Trading Enthusiast, MSc in Quant Finance
1 年Hi Venkatesh, thank you for sharing the wonderful article. May I know what are some ways for individuals to get enough dataset to train a text-summarization model to a real-world deployable level?
Graphic Designer at Fiverr
1 年Are you looking for flyer, brochure, one pager, business card, t- shirt design for your company then order me now without any delay and get your?desired design made in a very short time. Please contact with?me: cutt.ly/awlpr7gX
Information and Cyber Security Professional | Senior Consultant @ KPMG Lower Gulf | Red Team Lead | Security Architecture | AI and Data Security | CISSP | OSCP | CRTP
1 年Brilliant stuff, Venky! P.S - Jaw drop moment at ROGUE1 score of 49! ??
Threat Hunting | AI & ML Cyber Security Investigator | OSINT Adversary Hunting
1 年Lovely ?? Venky