登录查看更多内容

Data Science Milan #009

Data Science Milan

The Community of Data Scientists and Machine Learning Practitioners based in the Greater Milan area.

发布日期: 2024年4月30日

Dear Data Science Milan Community,

Welcome back to our newsletter, bringing you another edition packed with the latest developments, inspiring projects, and invaluable insights from the world of data science!

In previous editions of our newsletter, we have explored how transformers can be utilized for classification, summarization, and even time-series forecasting.

But have you ever wondered how transformers, in their most renowned role as generative models, decide which word to insert next? I mean... how?

Generative models adopt decoding methods to determine which word (aka token) to use. Choosing the right decoding method is crucial and comes with significant trade-offs:

Greedy Search Decoding operates on a simple principle: at each step, it selects the word with the highest probability of being the correct next word in a sequence. This method is fast and computationally inexpensive, making it appealing for tasks that require quick responses. However, its major drawback is that it often leads to suboptimal solutions since it never revisits or corrects its choices and the generated text goes BLA BLA (yeah, it's repetitive). Anyway, greedy search is ideal for real-time applications, such as speech recognition, where speed is more critical than absolute accuracy.
Beam Search Decoding is when things get serious and offers a more sophisticated approach. Instead of choosing the single best word at each step, it considers multiple possibilities (or parallel universes), keeping the top 'k' most probable sequences at each point in the text generation process. This parameter 'k' is known as the beam width. Yes, beam search is more computationally intensive than greedy search but typically results in higher quality outputs as it balances between breadth and depth of search. It is particularly beneficial in applications like machine translation or complex question-answering systems where the output quality is paramount.
Sampling Methods provide another approach where the next word is chosen based on a probability distribution rather than the highest probability alone. Techniques like top-k sampling limit the choice to the top 'k' candidates, reducing the likelihood of selecting less probable words. Meanwhile, top-p sampling (or nucleus sampling) selects from a set of words cumulatively making up a certain probability 'p', focusing on a more likely subset while allowing for more diversity than greedy or beam search. Sampling methods can generate more varied and human-like text, making them suitable for creative writing and chatbots where unpredictability enhances the conversational quality.

All decoding methods have their merits and are chosen based on the requirements and constraints of the NLP task. Indeed, to choose between one method or another is important to find the correct metric to evaluate as suggested in our article for text summarization evaluation.

Data?Science?Milan?events

Data Science applications in Cybersecurity

Application of Graph Theory To Anomaly Detection in Cybersecurity: an Example - Alberto Mazzetto, Artificial Intelligence Modelling Engineer at Ferrari Racing

The scale and complexity of cyber-attacks have been increasing dramatically in recent years, making it necessary to accompany rule-based detections with statistically principled anomaly detection. Alberto explained how graph theory applies to this problem and reviewed global and local modelling approaches. He demonstrated one possible local approach based on a Bayesian conjugate model, the Dirichlet process, that allows for fast, scalable, explainable computations. He then explored a global-flavoured methodology, based on graph variational auto-encoders, aimed at reducing the number of false positives.

A Data-Driven Approach to Cybersecurity - Luigi De Luca, Data Scientist at Data Reply

In today’s data-driven world, Big Data and Data Science have become indispensable tools in transforming the way we approach complex problems. Big Data and Data Science are very useful in handling large volumes of data to derive actionable insights. As cyber threats continue to evolve, traditional cybersecurity methods have proven to be insufficient in effectively defending against modern attacks. So, Data Analytics plays a crucial role in the field of cybersecurity. Luca explored the benefits that a data-driven approach brings to cybersecurity, with a focus on three use cases that are subcases of anomaly detection: "UEBA", "malware detection" and "DGA detection". For each of these three use cases, he explained the improvements compared to the traditional methods and how to implement the solution.

Watch the video

Alkemy’s GenAI ecosystem

On February 20th, 2024 Marcello Villa presented Alkemy’s GenAI ecosystem and some of the use cases they are working on. Shifting perspective from the clients to the developers, in the second part Davide Posillipo reflected on how the latest Generative AI applications are impacting our field, Data Science, and what we can expect to happen in the future to our profession. As an example of new ways of working, in the final part, Milica Cvjeticanin talked about an unconventional Transformer model. LLMs modern architectures based on Transformers represent an extremely powerful tool for solving a variety of problems. However, these architectures are mostly cited when approaching natural language processing. However, by combining meta-learning, Bayesian Neural Network prior (BNN) and Transformer’s architecture the application field of transformer-based models is expanded so that it solves even classification problems with tabular data. Milica showed an example of these models named TabPFN, which could be concurrent to the best-known Machine Learning algorithms for solving these classical ML tasks, pointing out why this model is something worth keeping an eye on.

Watch the video

领英推荐

Data Science Talent | Newsletter Edition 3

Data Science Talent 12 个月前

What is Data Science in simple words?

BM INFOTRADE PRIVATE LIMITED 2 个月前

Data Talks: Are you listening?

Darkocean 5 个月前

"BRIOxAlkemy: A bias detecting tool"

On December 13th, 2023, Greta Coraglia and Davide Posillippo spoke about a bias-detecting tool.

The aim of the collaboration between BRIO and Alkemy, is to develop software applications for analyzing bias, risk, and opacity in AI technologies, which often rely on non-deterministic computations and are inherently opaque. The first tool produced by the BRIOxAlkemy collaboration is designed for the detection and analysis of biased behaviours in AI systems, complete with its theoretical foundation. This tool targets developers and data scientists who wish to evaluate their algorithms—those that depend on probabilistic and learning mechanisms—to identify and document any biases or related misbehaviours. A live demo will showcase the tool, highlighting our commitment to open-source and collaborative development.

Watch the video

Knowledge section

Here are some selected resources about decoding methods:

A great tutorial from Hugging Face with examples and code snippets to become familiar with decoding methods for LLMs - How to generate text
This paper gives a survey on both deterministic and stochastic methods used to generate text from large language models (LLMs) in natural language processing tasks - A Thorough Examination of Decoding Methods in the Era of LLMs

Be involved!

We want also to remind you that if you like and enjoy our events, you can get in touch with us at [email protected] to be involved in organizing new great online activities.

We are also very happy if you are interested in being a speaker or if you want to share your expertise or experience with the?Data?Science?Milan?community!!!

Wallboard

Would you like to become one of our sponsors and increase your popularity among the?Data?Science?community? Write here

If instead, you would like to promote a message to the wallboard, please contact us and send us your relevant announcements. We will publish them here.

Data Science Milan #009