登录查看更多内容

Enhancing Discovery in Scientific Research through Object-Oriented Approach for Large Language Models

Jong Hang Siong

I founded OTONOCO in Singapore to design and build SaaS and Mobile Apps that incorporates Generative and Agentic AI to solve complex problems in the industry

发布日期: 2023年10月28日

The Motivation

This article presents an idea of combining Object-Oriented programming (OOP) paradigm for Large Language Models (LLM), particularly, GPT to enhance scientific discovery throught GPT-learning from scientific papers. This idea came about as a result of release of several new features in Python LangChain package for LLM in late 2023.

Pydantic package was used to facilitate the use of OOP.? Another reason Pydantic was used is to address 2 of the challenges facing LLM – malicious intent of hijacking LLM applications and hallucinations.? Pydantic is primarily a data validation and parsing library.? It doesn't inherently deter malicious intent on its own, its main purpose is to ensure that the data received or sent follows a predefined structure and correct data type. Nevertheless, it indirectly helps in deterring malicious intent by enforcing strict data validation, serving as a contract for expected data schema, and reducing the risk of unexpected type-related issues that could be exploited maliciously.

Text summarization, entity recognition, question-answering, chatbots are some of the most common applications of LLMs.? In these applications, both input and output are UNSTRUCTURED data.? In this experiment, a combination of OOP, Pydantic, LangChain and GPT was used to facilitate the generation of STRUCTURED output.? LangChain has also offered a short course on Function Calls via DeepLearning.ai.? Some of the codes here have been inspired by the course.

There are 2 LLM tasks in this discovery:

Tag and extract important information from a scientific paper
Extract citations into structured data for further analysis

The Scientific Paper

The scientific paper selected for this experiment was authored by Professor David C. Page and his co-workers published in American Journal of Human Genetics in 2018 titled “Selection Has Countered High Mutability to Preserve the Ancestral Copy Number of Y Chromosome Amplicons in Diverse Human Lineages”.? Professor Page was the director at the?Whitehead Institute for Biomedical Research?at Cambridge, Massachusetts.? He has spent nearly his entire scientific career studying and defending the honor of Y chromosome which was widely believed to be degrading leading to the extinction of human males. ?His ground breaking research led to discoveries of pivotal role of Y genes beyond merely sex determination, suggesting the well being of the human race could be hinging on this diminutive Y.

I have also previously authored an article on Y-chromosome based on Page's research here.

Required Python Packages for the Experiment

The required packages consists of standard Python, LangChain and Pydantic libraries.? Access to GPT was brokered by LangChain ChatOpenAI in the langchain.chat_models package.? OpenAI token is required and must be loaded into the environment.

Tagging Instantiation

The selected paper was first subjected to tagging through the creation of a Tagging class that contains C++-like declaration of variable annotations using colons.? This declaration was introduced in Python 3.6.? The Tagging class was passed to LangChain’s convert_pydantic_to_openai_function and converted into a list object.

Notice also the Tagging class inherit Pydantic’s BaseModel so that it has access to the variables and functions of the package.

Tagging instructions were provided to the prompt template to teach GPT on how to behave when carrying out Tagging work. ?The chaining process was made up of 3 components.? The first one was functions collection to enable automated function calling. The second component was tagging of model that contain API call to ChatGPT and finally the chaining process.

Scientific Paper Content

The URL of the paper was passed to LangChain’s WebBaseLoader to load the content.? After loading the content, a small chunk of was sampled to ensure the content was there.

Class Object for Paper Overview

The purpose of PaperOverview class is to extract information such as summary, statistics, discipline and keywords from the paper.? The chaining took place with the collection of functions for automated function calls, template creation and chaining.? Notice that the class name was passed to LangChain pydantic function and the tagging model.

The output obtained is as follow:

Summary: This paper investigates the evolutionary forces that govern the formation, maintenance, and diversification of Y chromosome amplicons in humans. The authors develop computational tools to detect amplicon copy number with unprecedented accuracy from high-throughput sequencing data. They find that amplicon copy number is maintained among divergent branches of the Y chromosome phylogeny, indicating that the reference copy number is ancestral to all modern human Y chromosomes. The distribution of males with copy number variants within the phylogenetic tree is incompatible with neutral evolution and instead displays hallmarks of mutation-selection balance. The authors also observe cases of amplicon rescue, in which deleted amplicons are restored through subsequent duplications.

?Statistics: 16.9% of males in the dataset have an amplicon copy number variant. The study analyzed whole-genome sequencing data of 1,216 males from the 1000 Genomes Project.

Discipline: Genetics, Evolutionary Biology

Keywords: Y chromosome, amplicons, copy number variation, mutation-selection balance, human evolution

Extract Citations

Citation Classes

The next task of discovery was to extract and analyze citations that came with Professor Page’s paper.? To accomplish this, 2 Python classes were created that contained attributes for the extraction.? ?The Citations class was simply referring to the Page’s paper itself.? This class was passed to CitationInfo class to ensure only citations from Page’s paper were extracted and not other papers in GPT’s memory. Notice the citations annotation is CitationInfo class was passed to the chain.

Prompt Template

The classes in this task worked differently from the previous one. Here, a separate instruction template was created to allow a set of rules to be defined so that GPT can extract information more precisely.? The subsequent chaining process worked similarly.? The instructions told GPT to extract only authors, title, journal name and year from each citation found only in Page’s paper.

Check for Hallucinations

Before starting the extraction process, irrelevant prompts to the paper were used to check if our guardrail is working satisfactorily to deter hallucinations.? The prompts were ‘How are you?’ and ‘Have a nice weekend’ . As illustrated in the following, GPT returns nothing.

Split Text

For LangChain and GPT to work more efficiently, RecursiveCharacterTextSplitter was used to split the content into manageable chunks.? The following illustration shows Page’s paper was splitted into 22 chunks.

To carry out Extraction, splitted text was subjected to LangChain tracing using RunnableLambda followed by the chaining process.? The flatten function ensured the output is in 1-dimension.? Otherwise, errors would be thrown.? Invocation of the chaining process returned a JSON-formatted output containing title, author, journal and year as instructed earlier.

领英推荐

The Role of Python in AI/ML Development: A Deep Dive…

Dignizant Technologies LLP 3 个月前

Top 10 AI Programming Languages You Need to Know…

BrainerHub Solutions 4 个月前

Develop AI Using Python: A Step-by-Step Guide

Awesome Analytics 8 个月前

Citations Dataframe

To better analyze the citations, JSON output was converted into pandas dataframe as follow. Notice the outcome was imperfect. ?Data cleaning was done to remove citations that did not have journal names.

The following illustration shows cleaned data.

Top-Rated and Regular Journals

It is every scientist aspiration to get his / her works published in top-rated journals such as AAAS Science publications, Nature Publications, Cell, and Proceedings of the National Academy (PNAS).? For this reason, investigations were done to see if LLM helped.

The following illustration shows the list of journal names used to identify top-rated journals. This list was used to separate top-rated from regular journals.? Journal names that were not on the list were automatically taken out as regular.

The resulting top-rated journal articles are as follow:

The resulting regular journal articles are as follow:

Retrieve Abstracts from PUBMED using Titles

In order to better understand the characteristics of top-rated and regular journals, we needed much more than just titles. Titles from both tables were extracted and into Markdown bullet points.? These titles were then used by GPT to retrieve the corresponding abstracts from PUBMED.

The following illustration shows several functions were created to orchestrate the retrieval of abstracts from PUBMED. get_gpt_response is a generic GPT completion function. Functions get_abstract_prompt and get_abstract_prompt_json were created to construct prompt to retrieve completions with different output format.

Prompt constructed using get_abstract_prompt returns a standard output while get_abstract_prompt_json returns a JSON-formatted output.

Top-Rated Title-Abstract

Similarities and Differences

With abstracts from Top-Rated and Regular journals properly separated, finding the similarities and differences of both groups could be carried out.? GPT was again enlisted to sort this out.? Lists of both groups were used to construct a prompt with the instruction to find similarities and differences with regards to the study of Y chromosome on human health.

Similarities between Top-Rated and Regular Journals with Respect to the Study of Y-Chromosome Associated with Human Health

Differences between Top-Rated and Regular Journals with Respect to the Study of Y-Chromosome Associated with Human Health

Refining Discovery Through the Retrieve of Relevant Abstracts

Research discovery can be improved and refined by having the ability to efficiently, quickly and accurately retrieving specific abstracts relevant to a question asked.? This can be achieved by using a technique called Vectorstore Embedding.

The following function uses Vectorstore to store the abstracts that we built earlier. The parameter in_question takes a query, passes it to the Vectorstore to retrieve the relevant abstracts.

It's now time to ask some questions to see if we can get the relevant abstracts.

Question 1: Genes involved in male fertility such as sperm production and function

Question 2: What are the challenges in the analysis of Y chromosome genome sequences?

Question 3: Implications for human evolution and health associated with human Y chromosome

要查看或添加评论，请登录

Jong Hang Siong的更多文章

OTONOCO Medical AI at your Fingertips Phase 2 - Microcontrollers & Single Board Computers

2024年8月28日

OTONOCO Medical AI at your Fingertips Phase 2 - Microcontrollers & Single Board Computers

The Second Phase - Embedded AI on Microcontrollers and Pi OTONOCO is entering the second phase of 'AI at your…
Applications of Multimodal and Multilingual Generative AI for Patient Care at Home

2024年7月24日

Applications of Multimodal and Multilingual Generative AI for Patient Care at Home

The Problem Under the Hood Youtube Demo Getting to Know Your Medicine Before Taking It - ENGLISH Getting to Know Your…
Real-Time Anomaly Detection in Medical Images using Embedded Deep Learning Models on iOS and Android Devices

2024年7月19日

Real-Time Anomaly Detection in Medical Images using Embedded Deep Learning Models on iOS and Android Devices

Data Sources Images used to train deep learning models for real time anomaly detection from medical images were…
Embedded Machine Learning - Scaling Deep Learning Models for Medical Images to Mobile Devices

2024年7月17日

Embedded Machine Learning - Scaling Deep Learning Models for Medical Images to Mobile Devices

Data Sources Data for deep learning model training have been obtained from the following sources for NON-COMMERCIAL…
Generative AI and Large Multimodal Models for Petroleum Refining

2024年5月19日

Generative AI and Large Multimodal Models for Petroleum Refining

This use case presents the capability of Large Language Models and Large Multimodal Modals in transforming the…
7 Science and Engineering Masterpieces (books) that out-of-print

2024年3月29日

7 Science and Engineering Masterpieces (books) that out-of-print

I have compiled a list of science and engineering books that I consider to be masterpieces that are out-of-print but…
Massive Scale-Out of Deep Learning (DL) Models for Computer Vision to Android and iOS Devices using Flutter Framework

2023年11月21日

Massive Scale-Out of Deep Learning (DL) Models for Computer Vision to Android and iOS Devices using Flutter Framework

Gerald Yong What this Article is About This article discusses scaling out deployment of deep learning for computer…
AI for Engineering: GPT-Powered Numerical Methods to Solve Engineering Problems

2023年8月4日

AI for Engineering: GPT-Powered Numerical Methods to Solve Engineering Problems

Motivation The ability to solve complex problems methodically and systematically is of utmost importance in the…

1 条评论
Instruction-Tuned GPT for Medicine: Applications in Oncology, Neurosurgery and Aeromedical Evacuation

2023年6月6日

Instruction-Tuned GPT for Medicine: Applications in Oncology, Neurosurgery and Aeromedical Evacuation

GPT and Prompting Engineering There's been a lot of articles and blogs and youtube videos made focusing on the chatGPT…
Neural Data Science Part 1: Analysis of Electroencephalography (EEG) and Magnetoencephalography (MEG) Data

2023年5月1日

Neural Data Science Part 1: Analysis of Electroencephalography (EEG) and Magnetoencephalography (MEG) Data

Functional Imaging of the Brain Functional brain imaging techniques such as positron emission tomography (PET)…

2 条评论

See all articles

Enhancing Discovery in Scientific Research through Object-Oriented Approach for Large Language Models

Jong Hang Siong

I founded OTONOCO in Singapore to design and build SaaS and Mobile Apps that incorporates Generative and Agentic AI to solve complex problems in the industry

The Motivation

The Scientific Paper

Required Python Packages for the Experiment

Tagging Instantiation

Scientific Paper Content

Class Object for Paper Overview

Extract Citations

Citation Classes

Prompt Template

Check for Hallucinations

Split Text

领英推荐

Citations Dataframe

Top-Rated and Regular Journals

Retrieve Abstracts from PUBMED using Titles

Top-Rated Title-Abstract

Similarities and Differences

Refining Discovery Through the Retrieve of Relevant Abstracts

Jong Hang Siong的更多文章

社区洞察

其他会员也浏览了

How to Become an LLM Developer?

AI and the Future of Python Development

A Guide to Integrating the Pythia API Using Wisecube Python SDK

A Guide To Integrating Pythia With Text Summarizers

AI Development: How to Build a Neural Network & Make Predictions using Java and Python

Unlocking New Potential in AI: How Combining Natural Language and Programming Enhances Reasoning Capabilities

Langchain

Artificial intelligence

Importance of Python in AI & ML-Alpinetechq

The Motivation

The Scientific Paper

Required Python Packages for the Experiment

Tagging Instantiation

Scientific Paper Content

Class Object for Paper Overview

Extract Citations

Citation Classes

Prompt Template

Check for Hallucinations

Split Text

领英推荐

Citations Dataframe

Top-Rated and Regular Journals

Retrieve Abstracts from PUBMED using Titles

Top-Rated Title-Abstract

Similarities and Differences

Refining Discovery Through the Retrieve of Relevant Abstracts

Jong Hang Siong的更多文章

OTONOCO Medical AI at your Fingertips Phase 2 - Microcontrollers & Single Board Computers

Applications of Multimodal and Multilingual Generative AI for Patient Care at Home

Real-Time Anomaly Detection in Medical Images using Embedded Deep Learning Models on iOS and Android Devices

Embedded Machine Learning - Scaling Deep Learning Models for Medical Images to Mobile Devices

Generative AI and Large Multimodal Models for Petroleum Refining

7 Science and Engineering Masterpieces (books) that out-of-print

Massive Scale-Out of Deep Learning (DL) Models for Computer Vision to Android and iOS Devices using Flutter Framework

AI for Engineering: GPT-Powered Numerical Methods to Solve Engineering Problems

Instruction-Tuned GPT for Medicine: Applications in Oncology, Neurosurgery and Aeromedical Evacuation

Neural Data Science Part 1: Analysis of Electroencephalography (EEG) and Magnetoencephalography (MEG) Data

社区洞察

其他会员也浏览了

How to Become an LLM Developer?

AI and the Future of Python Development

A Guide to Integrating the Pythia API Using Wisecube Python SDK

A Guide To Integrating Pythia With Text Summarizers

AI Development: How to Build a Neural Network & Make Predictions using Java and Python

Unlocking New Potential in AI: How Combining Natural Language and Programming Enhances Reasoning Capabilities

Langchain

Artificial intelligence

Importance of Python in AI & ML-Alpinetechq