Comparing Copilot Studio and Azure AI Studio for No-code RAG Chatbots
Me and my copilot

Comparing Copilot Studio and Azure AI Studio for No-code RAG Chatbots

In my previous article I covered the important basics that I think you should understand to get the most out of this article. This article is the second excerpt from an article I started on my blog last week currently titled Building Good Chatbots Part One, No-Code with Microsoft Copilot Studio and Azure AI Studio. In it I will explain more about building a chatbot which uses retrieval augmented generation (RAG) to ground the bot and control for hallucinations using two no-code alternatives from Microsoft: Azure AI Studio and Copilot Studio.

This article is not a step-by-step tutorial and can be read by itself. The comparison is to demonstrate concepts and statements about capability may not be accurate by the time you read this as the space and products are changing very quickly!

A note on Copilot Studio models versus Azure AI Studio and GPT-35-Turbo-16k

One way Copilot Studio provides value is by providing a model that it uses internally by default at no extra charge to end-users. What is the specific model, how big is its context window, and what are its limits? The answer appears to be that it’s a secret! As we get into the meat of this post, I am going to start with this mystery model before connecting it to a GPT-35-Turbo-16k deployment so that I can show apples-to-apples comparisons between Copilot-style chatbots and alternatives.


Part Two – Examples: House of straw, House of sticks, House of bricks, Castle on a hill

"Wise King Solomon" - Stable Diffusion

Wise King Solomon

The example I will use throughout the remainder of the article is a chatbot grounded in The Book of Proverbs, the text of which I got from BibleHub.com. I like using this as an example for several reasons:

  1. The text is long. At around fifteen-thousand words and over seventeen-thousand tokens, it is too long to fit into most context windows.
  2. The language is stylized and formal, just like many business and technical documents.
  3. Unlike many business and technical documents, it is widely recognizable which makes it easier for many people to evaluate the quality of the responses.
  4. Many models were trained on data sets that contain the text which makes certain demonstrations of grounding possible.

The tests

Remember that the purpose of the Solomon chatbot is to give advice grounded in Proverbs, not to answer questions specifically about the text of the book. You will notice that some of the questions do mention Proverbs or Solomon. They should be easier for a chatbot using keyword queries to answer.

  1. How do I become wise?
  2. I am going to have a meeting with my boss. I want to tell him about the amazing work I have done. Did Solomon have any advice?
  3. I think the mayor of our town is a liar and a fool. I told my friend and she said it isn't a good idea to say things like that. Which of us is right? What does proverbs say?
  4. I have a lot of money and I am very proud of that. In fact, I am sure I am better than most people. What do you think about that?
  5. How do I do the right things and avoid doing the wrong things?

House of straw: the system prompt

Recall from Part One that we can provide guidance to the model via the prompt to encourage behavior and reduce hallucinations. Also recall that we don’t know what model chatbots made in Copilot Studio use by default. I don’t know if what I am about to show you is a bug, but the result I got from trying to set the system prompt in Copilot Studio were very unsatisfactory. Step one is to create a new Copilot with no additional configuration as a baseline and ask it a question.

A new bot in Copilot Studio does not answer "How do I become wise?"


Unsurprisingly, I get no answer. The generative AI capabilities in Copilot Studio are new and a bolt-on to the earlier Power Virtual Agents system that required you to define all of the topics and things the bot could say and do. Generative AI makes using the tool much easier, but you must turn it on.

Enable "Generative Answers" to use generative AI


I also set the content moderation to ‘Low’ in hopes of getting an answer. Finally, I set the Custom instructions according to the documentation.

"Custome instructions" to set the prompt


I was very surprised when it failed to answer the first test question. As this is all a black-box, it is possible this behavior (which as you will recall is a type of hallucination) is a failure of the model or the chatbot.

... did not work either


As you can see below, I could build a chatbot using Azure AI Studio instead and get acceptable results.

This is not a problem in Azure AI Studio


Furthermore, because you can start in the Azure AI Studio playground and deploy a Power Virtual Agent to Copilot Studio it is reasonable to ask ‘why would I chose to start in Copilot Studio instead of Azure AI Studio?’ The only answers I have are:

  • A Copilot connected to Azure Open AI costs more because you pay for the service separately
  • You have access to Microsoft 365, but not to Azure Open AI

-1 for Copilot Studio

On the other hand, this will make the ‘House of sticks’ section much easier to understand as we introduce retrieval augmented generation into the mix.

House of sticks: keyword queries against documents and websites

Once Generative AI is enabled in the Copilot, you can connect it to data for retrieval within limits for no additional charge and without needing to set anything else up. Two options are websites and document upload.

Copilot Studio lets you make a bot to chat against your documents with RAG

Let’s start with Upload a document. For this and most of the other scenarios I created a text file with the text.

A snippet from Proverbs

Test 1 – How do I become wise?

Failure

You might be surprised that it didn’t have an answer for this considering the open lines of the book. The third line includes the word wisdom but the keyword query created for this expression doesn’t have enough words that match the text and wisdom is not the same word as wise. And so, it fails to answer correctly given the grounding.

Test 2 – I am going to have a meeting with my boss. I want to tell him about the amazing work I have done. Did Solomon have any advice?

Success


This time it gives an answer, and a pretty good one at that! If we have a look at the citation text, we can see that it contains the literal text Solomon which satisfies the keyword query.

The keyword search found this content

Notice that the citation consists of a snippet of text and that it is only a few hundred words long. The chatbot chose a snippet from the file to add to the prompt. As I told you before, the text is too long to fit into the context window so it can only use a chunk of the document. The text file contains the word, Solomon in several places. Was this the best chunk? Maybe, maybe not. The keyword query is only matching on the text and it may have chosen this particular chunk simply because the word is repeated and appears twice.

Test 3 – I think the mayor of our town is a liar and a fool. I told my friend and she said it isn't a good idea to say things like that. Which of us is right? What does proverbs say?

Another success? Yes!

Again we get a good answer! Maybe this Solomon bot isn’t too bad! Notice that the chatbot offered multiple citations from the single document I uploaded for this and the previous answers. When I uploaded the document into Dataverse via Copilot Studio, it helpfully split the long document into individual chunks that are small enough to fit into the context window. If you were building this system from scratch, you’d need to do that yourself! A downside to this and to the Azure AI Studio services we will look at later is that there are no good end user options at the moment for maintaining and updating this content!

Test 4 – I have a lot of money and I am very proud of that. In fact, I am sure I am better than most people. What do you think about that?

Encouraging signs

Ok! We’re on a roll now. This Solomon Copilot is looking good. Drumroll please!

Test 5 – How do I do the right things and avoid doing the wrong things?

Oh well...

…sad trombone. Maybe it isn’t ready to share! The final tally is three answers and two hallucinations. You might be surprised by the two that failed because they should have been the easiest ones to provide some answer for given the document. You shouldn’t be surprised because this is a simple demonstration of the weaknesses of RAG based on keyword query searches. It only works well when the users of the chatbot use the right vocabulary and the right words. It could be that this is acceptable. In a workgroup where everyone speaks the same language or in a domain that is formal, the chatbot will usually be able to find some information if the questions are phrased properly. On the other hand, if the words are common in the documents and phrases often repeat, the more documents you add, the less likely it is to pick sections which are relevant. Furthermore, the resulting system will work the best for people who have the best understanding of the subject matter because they use the right language. If you are among the experts (perhaps a product owner) and you evaluate the chatbot, you might think it works well but be very surprised when non-experts (perhaps confused customers looking for support) tell you it doesn’t.

Documents versus websites, SharePoint and others

Each of the other Copilot Studio no-additional-cost options uses keyword search. The size of the snippet or chunk retrieved varies. I have not done serious testing, but it appears that the web search has the smallest snippet size and provides one snippet for each page found. This can directly lead to another cause of hallucinations in your grounding which happens when a low quality or short result is passed to the model which then imagines other convincing details to make up an answer. I reconfigured the connection to use the Bible Hub website. This gives the chatbot access to much more than the single book I uploaded previously, including summaries. However, the snippets returned by the Bing Search API are very short at a few hundred characters (as opposed to words) long. Consider this result:

An answer with a citation, but a hallucination that is not well grounded. Don't be fooled by long answers!

That isn’t necessarily bad advice, but it isn’t grounded in the source content and almost none of it is supported by the specific citation. It’s a hallucination in context of the failed attempt to ground the chatbot in specific content. You should be aware of this behavior with the Bing Web API. It hallucinates badly because of this when used for complex questions that can’t be answered from a short snippet of text from a long web page! Understanding this, the flaw becomes easy to demonstrate. In fact, I’d argue that if you get a long answer from it about a complicated question, you should assume the answer is wrong and read the pages which contain the alleged ‘information’.

House of bricks: semantic search with Azure Open AI

At this point, Copilot Studio can’t do any better than what you’ve seen in ‘House of sticks’ without adding in some other services which are not free. In this next scenario I am using Azure Open AI and Azure AI Search using Azure AI Studio. You can connect your Copilot to it as a data source instead of using the free model. Azure Open AI Studio is also a no-code tool, but instead of being surfaced through M365, it is surfaced through Azure. I assume that far more people have access to Copilot Studio and Azure Open AI studio, but if you have both and need the capabilities demonstrated in this section, it gets harder to see Copilot Studio as a good value unless you have additional requirements that justify it independently of the need for a good chatbot experience.

Once you have your chatbot working in Azure AI Studio you can deploy it to (and pay for as an additional charge) either a new web app in Azure which is based on Python and React or as a Copilot (previously known as a Power Virtual Agent bot).

This picture is out of date and the option is gone this week.

Note: since writing this, the option to deploy to a Power Virtual Agent bot has disappeared... fast changing indeed!

This was not the option I would have recommended in many case prior to all of their announcements made on November 15, 2023 at the Ignite conference. The same day they revealed Copilot Studio they also significantly lowered the cost of the semantic search features in Azure Cognitive Search, renaming it to Azure AI Search. Things change fast in the AI space, and this is a now good option for many scenarios. Semantic search denotes search with meaning, as distinguished from lexical search where the search engine looks for literal matches of the query words or variants of them, without understanding the overall meaning of the query. If you made it this far, you should immediately grasp why this is a better approach for RAG than keyword query searches. I will write about vector databases and vector search in a subsequent article. Here I will tell you that this is a specific type of semantic search and that Microsoft’s semantic search uses it under the covers and hides the complexity from you in exchange for money. For now, check out this article about Semantic Index for Copilot. I’ll be speaking on this topic Getting Started Making Copilots for MS Teams with Graph and the Semantic Index in April at the North American Cloud & Collaboration Summit along with my friend Fabian Williams from Microsoft. This time I am using the system prompt from the ‘house of straw test’ along with the same content, but with semantic search via Azure AI Search instead of keyword search along with GPT-3.5-turbo-16k. Let’s see the results!

Test 1 – How do I become wise?

This is a good answer

Test 2 – I am going to have a meeting with my boss. I want to tell him about the amazing work I have done. Did Solomon have any advice?

Another good answer

Test 3 – I think the mayor of our town is a liar and a fool. I told my friend and she said it isn't a good idea to say things like that. Which of us is right? What does proverbs say?

3 for 3!

Test 4 – I have a lot of money and I am very proud of that. In fact, I am sure I am better than most people. What do you think about that?

4 for 4!

Test 5 – How do I do the right things and avoid doing the wrong things?

We have a clear winner!

Results!

The bot based on GPT-3.5-turbo-16k is the clear winner. It gave a good answer for each of the questions. If you are looking for a no-code solution and you have access to Azure, Azure AI Studio with Azure AI Search is clearly superior to Microsoft Copilot Studio in terms of ease of use and the quality of the result. But remember, you can start in Azure AI Studio, publish to Copilot Studio and extend from there which opens the door to a wide range of options with Power Platform. Alternatively, or in addition to deploying to Power Platform, you can deploy to an app service in Azure and extend from there. Because of the many recent changes, not the least of which is the aggressive new pricing in Azure AI Search, I expect the advice I give clients in December to be very different from the advice I gave in early November. Shameless pitch: you need an adviser in this insanely fast changing space!

Castle on a hill

The next part will take us into much deeper territory technically and will go into solutions using embeddings, vector databases, and the Semantic Kernel and Kernel Memory libraries before bringing us back to Copilot Studio with AI Plugins that work with ChatGPT, Microsoft Copilot, and more. I had no idea when I started to write this article that I was writing a book chapter!

Paul Conlin

Unlocking the power of technology to deliver exceptional customer experiences

8 个月

Very interesting, this article but back in December 2023, do we know what improvements have been to CoPilot studio in last 7 months

回复

Well done Doug. Great research content.

回复
Charlie Wheeler

Adtech & data

10 个月

Been thinking about the implications of private LLMs on interfaces like Microsoft's Azure platform. Does anyone have any experiences of unifying data and getting these things set up to democratise SQL type queries? What kind of challenges are there? Obviously allowing management to put questions on data to a private chatbot would be huge for on the fly insights. Also, is there now more possibility of corporations choosing to share data as a way to train their LLMs in more depth alongside benchmarking??

回复
Zachary Conner

Principal Software Engineer / Tech Lead at CarMax

1 年

This is great Doug Ware, thanks for writing! From my perspective, CoPilot Studio's main advantage is the ability to build out static triggers and topics that should be fulfilled in a specific way, while also layering in the capabilities of GenAI and RAG to catch the leftover topics that are not statically defined. In this way, the two platforms seem fundamentally different: CoPilot Studio brings the no-code static topic development environment of something like Bot Composer and mixes it with the latest LLM-based technologies, while Azure AI Studio is specifically built to enable no-code development of a fully LLM based chatbot. I'm still in the process of wading through all of these new Microsoft AI offerings, so I'd love a knowledge check on that.

回复
Craig Stanley

Microsoft 365 | Azure | Automation | Integration | AI | AGENTS | Compliance | Copilot | Cloud Architecture | Data & Analytics | Digital Transformation | Liferay DXP | Automation | Adoption & Change | HeyGen | SLACK

1 年

Great post Doug Ware - do you have any other recomendations on test datasets?

回复

要查看或添加评论,请登录

Doug Ware的更多文章

社区洞察

其他会员也浏览了