ChatGPT Over Time; LLMs on Graphs; Why Llama2 new ChatGPT Rival; OpenAI Playground For Beginners; and More;
Danny Butvinik
Chief Data Scientist | 100K+ Followers | FinCrime | Writer | Author of AI Vanguard Newsletter
Editor's Paper Recommendations
How is ChatGPT's behavior changing over time? GPT-3.5 and GPT-4 are the most widely used large language model (LLM) services. However, when and how these models are updated over time is opaque. Here, we evaluate the March 2023 and June 2023 versions of GPT-3.5 and GPT-4 on four diverse tasks: 1) solving math problems, 2) answering sensitive/dangerous questions, 3) generating code, and 4) visual reasoning. We find that the performance and behavior of both GPT-3.5 and GPT-4 can vary greatly over time. For example, GPT-4 (March 2023) was very good at identifying prime numbers (accuracy 97.6%), but GPT-4 (June 2023) could have been better on these same questions (accuracy 2.4%). Interestingly GPT-3.5 (June 2023) was much better than GPT-3.5 (March 2023) in this task. GPT-4 was less willing to answer sensitive questions in June than in March, and GPT-4 and GPT-3.5 had more formatting mistakes in code generation than in March. Overall, our findings show that the behavior of the same LLM service can change substantially in a relatively short time, highlighting the need for continuous monitoring of LLM quality.
FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets: Evaluation of Large Language Models (LLMs) is challenging because aligning to human values requires the composition of multiple skills, and the required set of skills varies depending on the instruction. Recent studies have evaluated the performance of LLMs in two ways, (1) automatic evaluation on several independent benchmarks and (2) human or machined-based evaluation giving an overall score to the response. However, both settings are coarse-grained evaluations, not considering the nature of user instructions that require instance-wise skill composition, which limits the interpretation of the true capabilities of LLMs. This paper introduces FLASK (Fine-grained Language Model Evaluation based on Alignment SKill Sets). This fine-grained evaluation protocol can be used for model-based and human-based evaluation, which decomposes coarse-level scoring into an instance-level skill set level. Specifically, we define 12 fine-grained skills needed for LLMs to follow open-ended user instructions and construct an evaluation set by allocating a set of skills for each instance. Additionally, by annotating the target domains and difficulty level for each instance, FLASK provides a holistic view with a comprehensive analysis of a model's performance depending on skill, domain, and difficulty. Using FLASK, we compare open-sourced and proprietary LLMs and observe highly-correlated findings between model-based and human-based evaluations. FLASK enables developers to more accurately measure the model performance and how it can be improved by analyzing factors that make LLMs proficient in particular skills. For practitioners, FLASK can be used to recommend suitable models for particular situations through comprehensive comparison among various LLMs. We release the evaluation data and code implementation at?this https URL.
?
Exploring the Potential of Large Language Models (LLMs) in Learning on Graphs: Learning on Graphs has attracted immense attention due to its wide real-world applications. The most popular pipeline for learning on graphs with textual node attributes primarily relies on Graph Neural Networks (GNNs), and utilizes shallow text embedding as initial node representations, which has limitations in general knowledge and profound semantic understanding. In recent years, Large Language Models (LLMs) have been proven to possess extensive common knowledge and powerful semantic comprehension abilities that have revolutionized existing workflows to handle text data. In this paper, we aim to explore the potential of LLMs in graph machine learning, especially the node classification task, and investigate two possible pipelines: LLMs-as-Enhancers and LLMs-as-Predictors. The former leverages LLMs to enhance nodes' text attributes with their massive knowledge and generate predictions through GNNs. The latter attempts to employ LLMs as standalone predictors directly. We conduct comprehensive and systematic studies on these two pipelines in various settings. From comprehensive empirical results, we make original observations, find new insights that open new possibilities, and suggest promising directions to leverage LLMs for learning on graphs.
?
A Survey on Evaluation of Large Language Models: Large language models (LLMs) are gaining popularity in academia and industry due to their unprecedented performance in various applications. As LLMs continue to play a vital role in research and daily use, their evaluation becomes increasingly critical, both at the task and societal levels, for a better understanding of their potential risks. Over the past years, significant efforts have been made to examine LLMs from various perspectives. This paper presents a comprehensive review of these evaluation methods for LLMs, focusing on three key dimensions: what to evaluate, where to evaluate, and how to evaluate. Firstly, we provide an overview of evaluation tasks, encompassing general natural language processing tasks, reasoning, medical usage, ethics, education, natural and social sciences, agent applications, and other areas. Secondly, we answer the `where' and `how' questions by diving into the evaluation methods and benchmarks, which serve as crucial components in assessing the performance of LLMs. Then, we summarize LLMs’ success and failure cases in different tasks. Finally, we shed light on several future challenges in LLMs evaluation. We aim to offer invaluable insights to researchers in evaluating LLMs, thereby aiding the development of more proficient ones. Our key point is that evaluation should be essential to improve LLMs’ development. We consistently maintain the related open-source materials at:?this https URL.
--
?Are you looking to advertise a product, job opening, or event to an audience of over 35,000 AI researchers and engineers? Get in touch with us at?[email protected]?to explore your options.
?Enjoy the newsletter? Help us make it bigger and better by sharing it with colleagues and friends.
--
Industry Insights
Expert Systems Counselor with 6+ Years of Experience. Leads Complex Projects and Develops Innovative Solutions to Challenging Problems.
1 年Amazing, Thanks for posting ??
Sales Associate at American Airlines
1 年Great opportunity
??CEO, evyAI -AI LinkedIn? Trainer, Business Development Training B2B Marketing via Ajax Union // Networking Connector, Author, Speaker, Entrepreneur, AI Expert, Single Father????????????
1 年Have you head of evyAI? Its a LinkedIn assistant that helps you generate comments on posts and customize invite notes to LinkedIn connections with AI. It does not Automate Linkedin but it does save a ton of time. You can try it with no CC at www.evyai.com - Let me know what you think! BOOM
Next Trend Realty LLC./ Har.com/Chester-Swanson/agent_cbswan
1 年Thanks for Sharing.
#AI he/him #DataScientist
1 年I Get it big tech rule the above innovation so they need to be regulated since regulation is never perfect there are bubbles and speculative capital pockets similar to bitcoin blockchain technology