Give Us the Facts: Large Language Models vs. Knowledge Graphs
Vishwas Mruthyunjaya, Pouya Pezeshkpour, Estevam Hruschka, and Nikita Bhutani. 2023. Rethinking Language Models as Symbolic Knowledge Graphs.

Give Us the Facts: Large Language Models vs. Knowledge Graphs

In this age of LLMs and generative AI, do we still need knowledge graphs (KGs) as a way to collect and organize domain and world knowledge, or should we just switch to language models and rely on their abilities to absorb knowledge from massive training datasets?

An early paper in 2019 [1] posited that compared to KGs, it is easier for language models to adapt to new data without human supervision, and they allow users to query about an open class of relations without much restriction. To measure the knowledge encoding capability, the authors construct the LAMA (Language Model Analysis) probe where facts are turned into cloze statements and language models are asked to predict the masked words (screenshot 1). The result shows that even without specialized training, language models such as BERT-large can already retrieve decent amount of facts from their weights (screenshot 2).

But is that all? A recent paper revisits this question and offers a different take [2]. The authors believe just testing isolated fact retrieval is not sufficient to demonstrate the power of KGs. Instead, they focus on more intricate topological and semantic attributes of facts, and propose 9 benchmarks testing modern LLMs’ capability in retrieving facts with the following attributes: symmetry, asymmetry, hierarchy, bidirectionality, compositionality, paths, entity-centricity, bias and ambiguity (screenshot 3 & 4).

In each benchmark, instead of asking LLMs to retrieve masked words from a cloze statement, it also asks the LLMs to retrieve all of the implied facts and compute scores accordingly (screenshot 5). Their result shows that even #GPT4 achieves only 23.7% hit@1 on average, even when it scores up to 50% precision@1 using the earlier proposed LAMA benchmark (screenshot 6). Interestingly, smaller models like #BERT can outperform GPT4 on bidirectional, compositional, and ambiguity benchmarks, indicating bigger is not necessarily better.

There are surely other benefits of using KGs to collect and organize knowledge. They do not require costly retraining to update, therefore can be updated more frequently to remove obsolete or incorrect facts. They allow more trackable reasoning and can offer better explanations. They make fact editing more straightforward and accountable (think of GDPR) compared to model editing [3]. But LLMs can certainly help in bringing in domain-specific or commonsense knowledge in a data-driven way. In conclusion: why not both [4]?? :-)


REFERENCES

[1] Fabio Petroni, Tim Rockt?schel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, Alexander H. Miller, and Sebastian Riedel. 2019. Language Models as Knowledge Bases? https://arxiv.org/abs/1909.01066

[2] Vishwas Mruthyunjaya, Pouya Pezeshkpour, Estevam Hruschka, and Nikita Bhutani. 2023. Rethinking Language Models as Symbolic Knowledge Graphs. https://arxiv.org/abs/2308.13676

[3] Previously: “Model Editing: Performing Digital Brain Surgery”. https://www.dhirubhai.net/posts/benjaminhan_llms-causal-papers-activity-7101756262576525313-bIge

[4] Shirui Pan, Linhao Luo, Yufei Wang, Chen Chen, Jiapu Wang, and Xindong Wu. 2023. Unifying Large Language Models and Knowledge Graphs: A Roadmap. https://arxiv.org/abs/2306.08302


#KnowledgeGraphs #GenerativeAI #LLMs #NLP #NLProc #Paper

Sebastian Riedel

Researcher at DeepMind, Honorary Professor at UCL, Co-Founder Bloomsbury AI & FactMata

1 年

Worthwhile pointing out that the LAMA paper [1] did not try contrast LMs with KBs per se, but KBs _extracted automatically from text_. The context of that comparison is a long line of work on “automated knowledge base construction” within NLP, trying to do just that but with limited robustness, at least at the time. It might be interesting to revisit this question, e.g. by spinning up a more modern LLM based knowledge base construction pipeline, run on the LLM training corpus, and then compare the constructed KB against the “knowledge within the LLM”. That said, my money would actually (primarily) be on semi-parametric models (retrieve-and-read) such as RAG or ATLAS that leave text in their unstructured form wherever possible.? CC Fabio Petroni

Samantha Jane Waters

Human-Focused Engineer and Data Scientist

1 年

Also - couldn't a knowledge graph make up for when the broad public consensus is biased/incorrect? E.g. the issues years ago where Google would direct a specific racial slur to the white house in maps? I imagine a LLM using wild data could end up suggesting dangerous pseudoscientific remedies.

回复
Juan Sequeda

Principal Scientist & Head of AI Lab at data.world; co-host of Catalog & Cocktails, the honest, no-bs, non-salesy data podcast. Scientist. Interests: Knowledge Graphs, AI, LLMs, Data Integration & Data Catalogs

1 年

Fantastic distillation of why KG + LLM are made for each other

Michele Filannino, Ph.D.

?? EMBA Candidate @ SDA Bocconi | ?? AI Scientist @ Prometeia [AI, NLP, GenAI, LLMs]

1 年

Hardly any artifact has ever replaced a previous one, especially when they have very different peculiarities, inner workings and functionalities. Maybe the key question is not about whether KG will be replaced but more about how can the cons of such tool be mitigated with new technologies.

Jun Xu

Executive Director, Machine Learning Engineering at Standard Chartered Bank; Guest Professor at South China University of Technology

1 年

KG needs an explicit extraction from the content. So, when comparing with llm, in particular, instruct-llm, the prompts shall be carefully designed with the similar level of details, considering the large impact of prompts on the final performance.

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了