From Perplexity to ScholarAI GPT: Assessing the performance of AI tools for serious research
Created with DALL-E and Canva

From Perplexity to ScholarAI GPT: Assessing the performance of AI tools for serious research

Research can be understood at three levels, similar to preparing different types of meals with varying complexity. Everyday research is like grabbing a quick snack—it's casual, quick, and practical, suitable for answering simple questions. Workplace research resembles cooking a simple dinner, requiring more thoroughness and reliability to solve problems or make informed decisions. Finally, academic research is like preparing a gourmet meal, demanding rigorous verification and depth to achieve precise and well-supported outcomes. With so many AI choices to help us do research and make our research meals, it is useful to know what kind of AI tools will be most useful for what type of research.

The rise of AI tools, particularly generative AI (genAI), has made information more accessible than ever, and is a benefit to not only researchers but also the general public. These tools can inspire curiosity and foster learning and even provide cited sources that can, in theory, turn mere information into reliable knowledge.

Large language models (LLMs) like ChatGPT are adept at explaining concepts and providing general knowledge, but their limitations become apparent when tasked with detailed fact-finding for academic research. Newer AI-powered platforms like Perplexity AI have emerged as promising alternatives by combining internet search functionality with LLM capabilities. Unlike traditional Google searches, Perplexity provides linked sources for its concise responses, making it an efficient tool for obtaining information. However, its reliance on open-source materials such as Wikipedia, Reddit, and blogs—rather than scholarly databases like Google Scholar or PubMed—limits its suitability for in-depth academic research and validated knowledge.

The rapid evolution of other AI research tools has led to the creation of specialized products to fill the gap for serious researchers. Many of these services offer limited free features, but full access typically requires a subscription. With the growing popularity of platforms like ChatGPT and its GPT store, several research-focused AI platforms have launched limited versions as GPTS for free use. These GPTs are customized versions of ChatGPT with a specific functionality, like creating images, doing finance reports, or conducting academic research. As a marketing strategy for research-oriented AI platforms, the free and research-customized GPTs provide users an opportunity to explore their capabilities before committing to paid plans on their platforms outside of ChatGPT.

This brief study used the same prompt to compare and evaluate 7 genAI tools on their ability to conduct a literature review for a specific research topic: 5 free research-dedicated AI tools on ChatGPT’s GPT store, and the more general AI search tool Perplexity and the general genAI tool Google Gemini. I included the last two because Perplexity has become the most representative genAI internet search tool and Google’s Gemini can also search the internet in real time, and since it is a Google product, may have access to Google Scholar info. Or so I thought.

Brief overview of top ranked research GPTs

GPTs have become a free value-added strategy for paid service providers to access the millions of people who use ChatGPT and its specific-function GPTs. The websites behind the most popular GPT research tools offer a range of services tailored to assist students, researchers, and educators in research process (literature review searches, summarizing, planning, writing, etc).?Within the top 10 most used research tools on the ChatGPT’s GPT store are the following: Scholar GPT (#1), Consensus (#2), SciSPace (#3), Scholar AI (#5), Ask Your PDF Research Assistant (#9).


GPT Store rankings for Research & Analysis GPTS (Oct. 8, 2024)

Although these GPTs are specifically designed to access research articles from sources like Google Scholar, Pubmed, and Arxiv, their parent company’s websites usually offer a number of different services to researchers:

·???????? Scholar GPT is available through platforms like @YesChat.ai, which specializes in real-time access to scholarly information and problem-solving assistance, including interpreting academic code and handling PDF documents.

·???????? Consensus is from the consensus.app, which acts as an AI-powered search engine for research, citation management, and finding gaps in research.

·???????? SciSpace?is from typeset.io, which offers research tools for research, writing, extracting data, and collaboration among researchers.?

·???????? ScholarAI?is from @scholarai.io, which provides functionalities for research paper and patent search, summarizing academic content, and interpreting visual data.?

·???????? Ask Your PDF Research Assistant?is from @askyourpdf.com, focuses on efficiently extracting information from PDF documents, streamlining the research process, and creating research libraries for users.

Each tool is designed to improve the efficiency and effectiveness of academic research. Here is an overview of each tool’s statistics from the GPT store.

GPT Store rankings for Research & Analysis GPTS (Oct. 8, 2024)

The Prompt that was used

The 5 GPTs, Perplexity, and Google Gemini were compared using a prompt that contained a 3-part framework of task, context, and content:

1. Task-Based Component - What the AI should do?

2. Context-Based Component - Why the task is important, and who it is for

3. Content-Based Component - What specific details or data the AI should use

The actual prompt was very detailed and focused on new generative AI technology for a very specific educational use (academic writing). Here is the actual prompt:

·???????? [TASK] "Do a literature review for the research question: “How can generative AI tools, like ChatGPT, enhance the quality of research writing in academic settings?”

·???????? [CONTEXT] The literature review will serve as the beginnings of research for a research project focusing on the impact of AI tools on writing quality in the final revising stages of the writing process.

·???????? [CONTENT] For the literature review, find the top 5 most relevant papers published in 2024 and separate the findings in terms of peer reviewed papers and other papers, like conference papers, published notes, etc. Priority should be given to quantitative research designs (but qualitative designs are ok too) and significant findings, and if the focus was on writing process or final output. Also mention use of different free AI tools for improving writing (like ChatGPT, Google Gemini, Claude, and also Grammarly and Quillbot), and how they affect writing style, clarity, coherence and organization, vocabulary complexity, and also citation management, and ethical considerations such as avoiding plagiarism. Use plain English and bullets where possible, and cite the findings in APA format and with a link, including significance values and effect sizes if any."

Prompt evaluation criteria with 5 weighted factors

This evaluation checklist for conducting academic research for a literature review has a maximum possible score of 50 points. Because it prioritizes accuracy (relevance of research papers to the prompt) and reliability (real [not fake] papers from credible journals) as the most important, these are given the most weight with scores of 15 points each. These are followed by meeting prompt requirements (how much of the prompt requirements were followed), while usefulness and clarity are given smaller weights for a balanced evaluation.

1. Accuracy (15 points)

  • Weight: Most important criterion, ensuring the information is correct, relevant, and up-to-date
  • 0: Irrelevant content results --> 15: Completely relevant and up-to-date

2. Reliability of Citations & Sources (15 points)

  • Weight: Crucial for the credibility and verifiability of sources used
  • 0: Fake or no citations provided --> 15: Fully credible, peer-reviewed sources with functional links

3. Meeting Prompt Requirements (10 points)

  • Weight: Ensures the response addresses the key tasks outlined in the prompt.
  • 0: Fails to meet key prompt requirements --> 10: Fully meets all aspects of the prompt (peer-reviewed papers, quantitative focus, writing improvement)

4. Usefulness & Insight (5 points)

  • Weight: Assesses the depth and value of insights provided.
  • 0: Lacks any practical insight --> 5: Highly useful, thoughtful and builds on real and relevant cited research results

5. Clarity & Structure (5 points)

  • Weight: Evaluates how clearly and logically the information is presented.
  • 0: No organized information, hard to understand --> 5: Well-structured, clear, and easy to follow

Results? ?

The evaluation of major LLM search tools reveals a range of performance levels in terms of accuracy, reliability of citations, meeting prompt requirements, usefulness, and clarity. Scholar AI ranked highest with an 80% total score and excelled particularly in accuracy and reliability of sources. Ask Your PDF followed with a 66% score, though it lagged behind in citation reliability. Scholar GPT and Consensus achieved moderate scores of 55% and 52%, respectively, with some strengths in meeting prompt requirements. SciSpace, Gemini, and Perplexity scored very low (8% or below), indicating significant limitations across all evaluated categories, particularly in accuracy and source reliability.

Ranking overview – Table and Figure


GPT rankings according to evaluation criteria


GPT rankings in a stacked bar chart

Salient comments on each tool - from worst to best

The AI search services of Perplexity and Google Gemini were useless. Perplexity suffered from hallucinations and for its six suggested “papers”, one had a fake link and the five others were not research papers but actually blogs superficially comparing AI tools; Perplexity even added fabricated statistical findings for two citations to meet prompt requirements. ?As for Gemini, I had assumed it would be integrated with Google Services like Scholar, but after several tries, it still refused to give me any research paper recommendations—it only gave me search strategies that could be used on Google Scholar. I gave it a slightly higher score than Perplexity because it at least didn’t hallucinate and actually gave clear and structured advice on how to conduct a research literature review.

SciSpace offered no papers, even after I expanded to prompt to include 2023 papers. It did provide a well-structured summary overview of the topic I suggested and even cited statistical research findings, but not the study it came from, which made me think they were hallucinated. ?

Consensus provided 5 peer reviewed papers and 2 non-peer reviewed, but only 2 accurate and relevant papers about AI tools and writing. All of the papers were from 2024 as specified in the prompt but from January 2024 and from its own platform. It did provide a useful “Comon themes and tools” section at the end and general conclusion.

Scholar GPT found relevant topics from 5 2024 sources, but only 2 were very relevant. Some references were from non-academic blogs, and 3 out of 5 had wrong info, one had a broken link, and two had fabricated statistical findings. It did give an interesting “Key findings” section at the end, but it was not obvious how much was true or hallucinated information.

Ask you PDf research assistant was generally quite good and useful. All had working links and 3 out 5 directly relevant. It gave a combination of peer-reviewed and non-peer reviewed papers in line with the prompt, but the author citations were wrong (either a middle or final author), which shows it struggles with identifying the first author for citations purposes.

Scholar AI was by far the most useful GPT and recommended a total of 10 papers (5 peer-reviewed and 5 not), which were all relevant. The links worked and primary authors were identified for citations. The overview was basic, but the paper descriptions had title, link, findings, focus, and which AI tools were used in the study. The only real problem was that the papers were from 2023, not 2024 as specified by the prompt. Unlike some of the other GPTs, this one did not give a summary at the end, which would be useful.

Caveats and conclusions ?

This was not an exhaustive study with different prompts different evaluators, but I think it can shed light on the performance of the research GPTs on ChatGPT’s GPT store. Also, perhaps because of the specificity of the research topic, the recency of the ChatGPT technology, and publication scope of 2024, it is possible that have not been many published research articles on the impact of AI tools, like ChatGPT, on writing. ?likely that there have been few relevant examples. I also did not evaluate the journals that published the articles. After all there are many peer-reviewed journals, but the best research finds its way into the top, most prestigious journals, and it is not clear (and probably unlikely) that these research GPTs will be able to make these distinctions or even access the top journals. ?

These limitations may be in part responsible for SciSpace’s poor performance, especially in light of its high ranking of number 3 in the GPT store’s category for Research and Analysis and its 700k uses and the highest rating of 4.3 from over 25K votes. It is more difficult to determine whether the GPT Store’s number 1 rated Scholar GPT (over 2 million uses) and number 2 rated Consensus deserve their high ranking given that neither, strangely, have any rating score with the explanatory comment “Not enough ratings yet”.

A word of caution is worth mentioning here. Several of these research GPTs completely fabricated studies, links, and statistical findings, and mistook superficial blog article research for research published in research journals. ?Clearly, results provided by these GPTs need to be double checked and verified. ?

The abovementioned limitations aside, Scholar AI was still able to come up with 10 resources and Ask Your PDF Research Assistant came up with 3. At this point, for this prompt, Scholar AI completely outperformed all others.

Research is a broad concept and has many levels. It is useful to think of research using the metaphor of meal preparation. Different meals have varying levels of complexity and specificity depending on their purpose, as the following three types of meal show.

1.????? Everyday research is like grabbing a quick snack and involves casual inquiries like checking the weather or resetting a router; it requires some accuracy but is generally informal and practical. Here, Google or Youtube search, or Perplexity and Google Gemini may be enough. ?

2.????? In contrast, workplace research resembles cooking a simple dinner and demands more thoroughness and reliable sources to effectively complete tasks or solve problems, such as analyzing market trends. For this, Perplexity, Google Gemini, or ChatGPT research GPTs will be more useful.

3.????? Finally, academic or serious research is like preparing a gourmet meal, where every detail matters and rigorous vetting of sources is essential for precise outcomes. At this point, ChatGPT GPTs might be a useful start for unknown fields, but for the specialized field of the researcher, there is no shortcut and scholarly databases and specific journal archives will be necessary.

Across all levels, the fundamental research process—finding information, organizing it, and presenting results—remains consistent; however, as the stakes increase, so too does the necessity for quality and precision in both ingredients and preparation methods.

?

要查看或添加评论,请登录

社区洞察

其他会员也浏览了