登录查看更多内容

ChatGPT Over Time; LLMs on Graphs; Why Llama2 new ChatGPT Rival; OpenAI Playground For Beginners; and More;

Danny Butvinik

Chief Data Scientist | 100K+ Followers | FinCrime | Writer | Author of AI Vanguard Newsletter

发布日期: 2023年7月25日

Editor's Paper Recommendations

How is ChatGPT's behavior changing over time? GPT-3.5 and GPT-4 are the most widely used large language model (LLM) services. However, when and how these models are updated over time is opaque. Here, we evaluate the March 2023 and June 2023 versions of GPT-3.5 and GPT-4 on four diverse tasks: 1) solving math problems, 2) answering sensitive/dangerous questions, 3) generating code, and 4) visual reasoning. We find that the performance and behavior of both GPT-3.5 and GPT-4 can vary greatly over time. For example, GPT-4 (March 2023) was very good at identifying prime numbers (accuracy 97.6%), but GPT-4 (June 2023) could have been better on these same questions (accuracy 2.4%). Interestingly GPT-3.5 (June 2023) was much better than GPT-3.5 (March 2023) in this task. GPT-4 was less willing to answer sensitive questions in June than in March, and GPT-4 and GPT-3.5 had more formatting mistakes in code generation than in March. Overall, our findings show that the behavior of the same LLM service can change substantially in a relatively short time, highlighting the need for continuous monitoring of LLM quality.

FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets: Evaluation of Large Language Models (LLMs) is challenging because aligning to human values requires the composition of multiple skills, and the required set of skills varies depending on the instruction. Recent studies have evaluated the performance of LLMs in two ways, (1) automatic evaluation on several independent benchmarks and (2) human or machined-based evaluation giving an overall score to the response. However, both settings are coarse-grained evaluations, not considering the nature of user instructions that require instance-wise skill composition, which limits the interpretation of the true capabilities of LLMs. This paper introduces FLASK (Fine-grained Language Model Evaluation based on Alignment SKill Sets). This fine-grained evaluation protocol can be used for model-based and human-based evaluation, which decomposes coarse-level scoring into an instance-level skill set level. Specifically, we define 12 fine-grained skills needed for LLMs to follow open-ended user instructions and construct an evaluation set by allocating a set of skills for each instance. Additionally, by annotating the target domains and difficulty level for each instance, FLASK provides a holistic view with a comprehensive analysis of a model's performance depending on skill, domain, and difficulty. Using FLASK, we compare open-sourced and proprietary LLMs and observe highly-correlated findings between model-based and human-based evaluations. FLASK enables developers to more accurately measure the model performance and how it can be improved by analyzing factors that make LLMs proficient in particular skills. For practitioners, FLASK can be used to recommend suitable models for particular situations through comprehensive comparison among various LLMs. We release the evaluation data and code implementation at?this https URL.

Exploring the Potential of Large Language Models (LLMs) in Learning on Graphs: Learning on Graphs has attracted immense attention due to its wide real-world applications. The most popular pipeline for learning on graphs with textual node attributes primarily relies on Graph Neural Networks (GNNs), and utilizes shallow text embedding as initial node representations, which has limitations in general knowledge and profound semantic understanding. In recent years, Large Language Models (LLMs) have been proven to possess extensive common knowledge and powerful semantic comprehension abilities that have revolutionized existing workflows to handle text data. In this paper, we aim to explore the potential of LLMs in graph machine learning, especially the node classification task, and investigate two possible pipelines: LLMs-as-Enhancers and LLMs-as-Predictors. The former leverages LLMs to enhance nodes' text attributes with their massive knowledge and generate predictions through GNNs. The latter attempts to employ LLMs as standalone predictors directly. We conduct comprehensive and systematic studies on these two pipelines in various settings. From comprehensive empirical results, we make original observations, find new insights that open new possibilities, and suggest promising directions to leverage LLMs for learning on graphs.

A Survey on Evaluation of Large Language Models: Large language models (LLMs) are gaining popularity in academia and industry due to their unprecedented performance in various applications. As LLMs continue to play a vital role in research and daily use, their evaluation becomes increasingly critical, both at the task and societal levels, for a better understanding of their potential risks. Over the past years, significant efforts have been made to examine LLMs from various perspectives. This paper presents a comprehensive review of these evaluation methods for LLMs, focusing on three key dimensions: what to evaluate, where to evaluate, and how to evaluate. Firstly, we provide an overview of evaluation tasks, encompassing general natural language processing tasks, reasoning, medical usage, ethics, education, natural and social sciences, agent applications, and other areas. Secondly, we answer the `where' and `how' questions by diving into the evaluation methods and benchmarks, which serve as crucial components in assessing the performance of LLMs. Then, we summarize LLMs’ success and failure cases in different tasks. Finally, we shed light on several future challenges in LLMs evaluation. We aim to offer invaluable insights to researchers in evaluating LLMs, thereby aiding the development of more proficient ones. Our key point is that evaluation should be essential to improve LLMs’ development. We consistently maintain the related open-source materials at:?this https URL.

领英推荐

ChatGPT 4 is here!

Chris Clark 1 年前

ChatGPT in a Knutshell (for Beginners)

Alan Spicer 1 年前

How to teach an old model new tricks

Gantry 1 年前

?Are you looking to advertise a product, job opening, or event to an audience of over 35,000 AI researchers and engineers? Get in touch with us at?[email protected]?to explore your options.

?Enjoy the newsletter? Help us make it bigger and better by sharing it with colleagues and friends.

Industry Insights

Growth Zone

The AI Vanguard

43,690 位关注者

Nadav Levi

Expert Systems Counselor with 6+ Years of Experience. Leads Complex Projects and Develops Innovative Solutions to Challenging Problems.

1 年

Amazing, Thanks for posting ??

KRISHNAN N NARAYANAN

Sales Associate at American Airlines

1 年

Great opportunity

Joe Apfelbaum

??CEO, evyAI -AI LinkedIn? Trainer, Business Development Training B2B Marketing via Ajax Union // Networking Connector, Author, Speaker, Entrepreneur, AI Expert, Single Father????????????

1 年

Have you head of evyAI? Its a LinkedIn assistant that helps you generate comments on posts and customize invite notes to LinkedIn connections with AI. It does not Automate Linkedin but it does save a ton of time. You can try it with no CC at www.evyai.com - Let me know what you think! BOOM

1 次回应

CHESTER SWANSON SR.

Next Trend Realty LLC./ Har.com/Chester-Swanson/agent_cbswan

1 年

Thanks for Sharing.

3 次回应

Marcin Brdys

#AI he/him #DataScientist

1 年

I Get it big tech rule the above innovation so they need to be regulated since regulation is never perfect there are bubbles and speculative capital pockets similar to bitcoin blockchain technology

查看更多评论

要查看或添加评论，请登录

查看全部

ChatGPT Over Time; LLMs on Graphs; Why Llama2 new ChatGPT Rival; OpenAI Playground For Beginners; and More;

Danny Butvinik

Chief Data Scientist | 100K+ Followers | FinCrime | Writer | Author of AI Vanguard Newsletter

Editor's Paper Recommendations

领英推荐

Industry Insights

Growth Zone

The AI Vanguard

43,690 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

Decoding ChatGPT: Inside the Power of Language Patterns

Elastacloud on OpenAI

I Asked ChatGPT about GOOGLE Bard.......... I got this ??

Chatgpt-4

GPT-4o and its Applications for Free Users

Right Prompts Make All The Difference: Chat GPT Tricks

How to Run ChatGPT-like LLMs Locally on Your Computer in 3 Easy Steps

What is ChatGTP and why should you care?

OpenAI announces ChatGPT successor GPT-4

A neat new feature in the new year: DEX RS and ChatGPT.

Editor's Paper Recommendations

领英推荐

Industry Insights

Growth Zone

The AI Vanguard

43,690 位关注者

Assessing GPT-4 on Reasoning; Mathematical Perspective On Transformers; Family Of Multimodal Models; Why Small LMs Are The Next Thing; and More.

2024年4月18日

First Hallucination-Free LLM; Fine-Tune or Retrieval; Privacy Issues in LLMs; New Embedding Model by Google; What Resilience Means and More.

2024年4月4日

LLM Fine-Tuning on Graphs; How To Evaluate LLMs; Uncovering Knowledge Gaps Using RAG; Claud 3 on Bedrock; Overcoming Limits Of RAG; and More.

2024年3月12日

Generation Model – What Do They Know? Cracking Length Generalization: AI's Reasoning Evolution; Can We Drastically Reduce Training Costs?; and More.

2024年3月3日

Multimodal LLMs; Orca 2; Cosmopedia – Largest Open Synthetic Data by Huggin Face; How To Fine-Tune On Single GPU; and More.

2024年2月27日

ChatGPT vs Gemini; Uncertainty Quantification in GenAI; GPT-4 vs. GPT-4V vs. Humans On Abstraction and Reasoning; Private vs Public LLMs; and More.

2024年2月20日

Survey on Hallucination in LLM; LLM’s Understanding Math; GPT4All Open-Source LMs; Next Chapter of Gemini; Improved GPT-4 Performance; and More.

2024年2月13日

Bard vs. ChatGPT; Jina Embedding 2; Text2Structure; Does GPT-4 Pass Turing Text?; Transformer As Graph2Graph; and More.

2024年2月6日

Hallucination in LLMs – Perspectives and Remediations; Fine-Tuning With Feedback; What LLMs DO NOT KNOW; LLaMA 2 Explained; and More.

2024年1月30日

What Algorithms Can Transformers Learn; Reasoning Agent for Graphs; Supervised Fine-Tuning; Context Understanding in LLMs; and More.

2024年1月23日

社区洞察

其他会员也浏览了

Decoding ChatGPT: Inside the Power of Language Patterns

Elastacloud on OpenAI

I Asked ChatGPT about GOOGLE Bard.......... I got this ??

Chatgpt-4

GPT-4o and its Applications for Free Users

Right Prompts Make All The Difference: Chat GPT Tricks

How to Run ChatGPT-like LLMs Locally on Your Computer in 3 Easy Steps

What is ChatGTP and why should you care?

OpenAI announces ChatGPT successor GPT-4

A neat new feature in the new year: DEX RS and ChatGPT.