Flow XO Evaluations
Measuring the effectiveness of your AI powered chatbots
Flow XO makes it very easy to create powerful AI-driven conversational experiences for your business, from analyzing sentiment, translating messages to and from various languages, quickly creating automated customer service agents from your own knowledge bases, and much more. But as exciting and powerful as AI can be, especially when coupled with your own business data, it is still not perfect. While some AI models (eg GPT-4) are better than others (ChatGPT 3.5), no model will respond perfectly 100% of the time.There are many reasons for this - from inherent limitations of generative AI models in general (hallucinations), to missing, inaccurate or confusing data in the reference material from your websites that the AI has to work with, once you put your AI bot out into the world, it will make mistakes.
Because your AI bot is, at the end of the day, a computer program , as long as everything else stays the same, it will tend to make the same mistakes over and over again.
These mistakes can take many forms, some totally harmless, and others more significant:
- Incorrectly telling the user that it doesn't have enough information to answer a question even though the data is available in some form in one of your knowledge bases
- Correctly telling a user that it doesn't have enough information to answer a question because the data really isn't available, even though your bot really should be able to answer the question
- Inventing (or hallucinating) incorrect responses based on partial or missing reference data
- Answering a different question than the user asked
- Partially answering a question in a way that isn't helpful to your users
- User your imagination...
So while it would be great if we could treat AI as if it were a magical being that could be turned loose with your customers and perform perfectly all the time, we must treat it more like a human agent, and measure and monitor how it performs and how our users react to its responses, so that over time our conversational assistances can become more and more helpful, and remove as much work from ourselves and our teams as possible.
Fortunately, Flow XO makes monitoring and evolving your AI agents performance over time as easy as possible with our Evaluation system.
The Flow XO evaluation feature does several things for your:
- Collects all (or whatever percentage you configure) of your AI generated Q&A responses into a convenient place where you can review them for accuracy and relevancy
- Lets record your evaluations so you can monitor trends in the performance of your AI
- Provides a simple framework to make it easy to sort evaluations into "Good" and "Needs Fixing" categories, so you can easily focus on just the responses that need attention
- Gives you an easy way to prompt your end users to provide feedback on how helpful the AI was in answering their questions, and calls attention to responses that did not satisfy your users so you can make improvements
- Allows you to easily craft custom flows that can be triggered when your users provide feedback, so you can ask for more information, transfer a user to your helpdesk, segment them for later contact, or anything else you can do with a Flow XO flow
- Provides you with easy to understand analytics (*coming soon*) so that you can monitor trends in response quality to ensure you're bot is always improving and never getting worse over time.
Getting started with Evaluations
To get started using the Flow Xo evaluation feature, you need to start collecting some evaluations. First, of course, you will need to have a Knowledge Base set up for your bot, or be using our AI powered "Answer Question" task. Next, you can enable evaluations in your account profile:
Here you will find an AI Evaluation section, and can choose how frequently you want to sample AI Q&A responses. There are two main settings you can configure, and for each one you can decide if you want a human to evaluate All, None, or some percentage of AI responses.
1. When the AI provides an answer - this setting determines how often to flag an AI Response for evaluation when the AI is able to provide a response other than "I don't know".
2. When the AI indicates it does not know the answer - this setting determines how often to flag an AI Response when the AI is unable to provide an answer based on the knowledge available to it
Why do we have two different settings? The main reason is that an "unknown" answer is very often an opportunity to add more content to your knowledge base. If the question was a legitimate question (and not just spam or smalltalk), then your bot should be able to provide a useful response. So you may want to set this setting to Always or at least very frequently. If you find that over time you are generally satisfied with the answers the AI gives when it IS able to provide an answer (which, by the way, will be much more often with GPT-4), then you can set the "When the AI provides" an answer setting to Never or a less frequent setting.
Once you have configured one or both of these settings to a value other than "Never", Flow XO will start flagging AI generated Q&A responses for you to evaluate.
To find them, go to the 'Evaluation' tab on the main site navigation:
This link takes you to the Evaluation Dashboard. The Evaluation Dashboard has three main sections:
* Awaiting Review
Here you will find evaluations that have been flagged for human review and are waiting on you or a team member to perform an evaluation
* Requires Correction
These are the evaluations that received a poor score, or that indicate a question the AI was unable to answer, and you need to make changes to your system to improve the AI's performance
* Recent Evaluations
This section lists all recently completed evaluations, both evaluations performed internally by you or your team, or evaluations (feedback) submitted by one of your users.
The first step in the lifetime of an evaluation is actually performing the evaluation to record a quantitative, numerical score. This is important because it will allow you to track the quality of your AI chatbot's responses over time using the analytics (coming soon). To get started, once an evaluation has been created, click 'View' next to the first evaluation in the Awaiting Review list. This will take you to the Evaluation page:
The evaluation page should have everything you need to assess the quality of the AI response. You can see the original question, the AI generated response, any sources that were used to construct the answer, as well as the parameters applied to the AI at the time the question was asked. Additionally, on the right hand side of the screen, you can see any user generated evaluation (feedback), and have the opportunity to provide your own evaluation.
For internal evaluations, you can rank the quality of the answer from 0 - 4 across two different dimensions:
Relevancy - was the answer provided by the AI relevant to the question that the user asked. Answer 0 for not relevant at all, 4 for perfectly relevant, and 1-3 if the answer was partially relevant.
Accuracy - was the answer provided factually accurate. 0 for not at all accurate, the answer was completely made up by the AI and is not at all true. 4 for a perfectly correct answer, and 1-3 for a partially correct answer where the AI added some incorrect information along with some correct information.
Once you have completed the evaluation, click 'Save', and a final score will be calculated:
If the overall score is 75% or higher, then the default next action will be 'No action needed'. If you complete the evaluation with 'No action needed', then the evaluation will be closed, and you won't need to look at it again. If the score is < 75%, the default action will be "Requires correction":
When you close an evaluation with "Requires correct" then the evaluation will be kept on the dashboard in a list along with all other evaluations that need attention. The reason there are two different lists (Awaiting Evaluation & Requires Correction) instead of just a single list of evaluations, is that it is often much more efficient to evaluate a batch of AI responses first in one session, and then address any issues for a batch of responses as a different process. You may even have different team members focus on performing evaluations than you do making corrections.
Keep in mind that it is not required to accept the recommended action. Clicking on the "down arrow" on the action button will always show you both choices, along with a third choice to indicate that the question that the evaluation is based on is garbage - either spam, smalltalk, or some other irrelevant input. These items will be ignored from analytics.
To support an efficient and seamless workflow, once you resolve an evaluation with either "No action needed", "Requires correction" or "Send to trash", the next evaluation in the list will automatically be loaded. This enables a fast and effective way to move through a set of evaluations quickly.
NOTE: You can also move evaluations directly to the trash from the evaluation dashboard:
Once you have completed all (or a batch) of pending evaluations, it will be time to make improvements to your bot. For this, very similarly to the process of performing evaluations, you can click on "View" on the "Requires correction" section of the Evaluation Dashboard to begin your corrections.
This will open up a screen very similar to the evaluation screen, where you can review the question, the response, the result of any evaluations as well as evaluator comments:
In this phase, you will make changes to your AI bot to improve the answer:
* Choose a different AI model, such as GPT-4
* Add, remove or modify documents in your website
* Add, remove or modify documents in a fine tuning knowledge base
* Create a FAQ or other intent in your intent detector
Which steps to take and when are beyond the scope of this article. In the future, this screen will contain some hints based on the evaluation, as well as some shortcuts to easily make necessary changes. For now, however, you will need to make whatever changes you need, test the results, and when you are satisfied, you will click "Mark resolved". This will clear the evaluation from the list, and navigate you to the next evaluation requiring corrections.
Getting Feedback from your Users
While performing internal evaluations on a random sample (or all) of your AI generated responses is an important and helpful practice, you may also want your users to have the ability to tell you directly if an AI response helped them or not. For this, we have added a "Feedback" setting to the Knowledge Base "Answer Question" task:
When you set this setting to "Yes", then every time a Q&A answer is sent to your user, they will also receive "Thumbs Up" and "Thumbs Down" buttons to indicate if they were happy with the answer they received:
These buttons will not pause the flow, and can be clicked at any time (even when the conversation has already moved on). Once a user provides the feedback, then the following will occur:
* If you have any flows configured with a "Feedback Received" trigger, those flows will be triggered
* If no "Feedback Received" flows are triggered, the user will receive the message "Thanks for your feedback!" (this will be translated into the language set on their profile or the language configured on the bot)
* Their response will automatically generate an evaluation with their rating, which will show up in your analytics (*coming soon*)
* If they give the response a "Thumbs Up" - no further action is required. The evaluation they generated will automatically be closed with "No action needed"
* If they give the response a "Thumbs Down" the evaluation will be flagged for internal follow up, since you definitely want to review any questions that are not helping your users
The users response will show up in the evaluation screens on the Evaluation dashboard as well:
Reacting to User Feedback
If you just want to track user feedback, and review negative feedback in your evaluation dashboard, you don't need to do anything else. Just setting "Request feedback" to "Yes" on your Answer Question task is enough. However, you may want to take some action when a user provides feedback. For example, you might want to:
* Provide a custom response to the user based on the actual value of their feedback, such as "Glad to hear we helped!" for positive feedback or "Sorry about that, I'm always learning!" for negative feedbac
* Ask the user for a follow up if they gave negative feedback, such as asking what was wrong with the answer, and writing their response to a google sheet
* Set a custom attribute on a user to segment them for later follow up using a broadcast
* Send them a user satisfaction survey
* Or anything else you can imagine
For these scenarios, we provide a "Feedback Received" flow trigger than can let you trigger a custom flow when a user provides feedback via this mechanism. This trigger can be found in the "Other Services" section of available triggers:
This flow can contain anything you want it to.
That's it for now!
We hope these evaluation features will be useful to you in monitoring and improving your AI chatbots. Keep on the lookout for powerful new features coming in the future, such as: Evaluation Analytics, AI assisted evaluations, general purpose, CSAT and NPS surveys that aren't tied to Q&A responses, and more.
As always, please reach out to us at email@example.com with any questions or feedback. Happy flowing