Loading your Knowledge Base from a Website

One of the best features of Flow XO's Knowledge Base is the ability to automatically fill your knowledge base from the information you have published on a website. This website can be your homepage, your helpdesk support site, or even a client's website. If you aren't familiar with the Knowledge Base feature yet, or need a refresher, pleasesee this article.

Getting Started

To begin loading your desired website into your knowledge base, you simply need to click "Add Documents" from the main page of your Knowledge Base:

Next, make sure that "Website" is the selected data source type, and enter the URL to the website you want to synchronize with your knowledge base

Click "Complete", and that's it! Your documents will start loading, and after a few moments will begin to appear in the list below.

You will notice the status of "RUNNING" in the "Last Run" column of your data source. Once this changes to "Complete", all of your documents have been loaded. Depending on the size of your website, this can take a few minutes to a few hours.

(NOTE: You may load a maximum of 2000 documents at a time)

Once your documents are loaded, your knowledge base is ready to be queried directly from a Knowledge Base task, or from an AI Assistant.

What about when I update my website, or add new pages?

By default, the Knowledge Base will check your website once a week for new content. If you make some changes and want to make sure they are applied right away, you can manually re-sync your website at any time from the action menu next to your Data Source:

How does it work? And what if I'm not getting the documents I would expect?

By default, our website crawler will search the domain of your website for what's called a "Sitemap". A Sitemap is a list of every webpage on a site, as well as a timestamp of the last time it was updated. When you put in a URL for us to load into your Knowledge Base, we try to find a Sitemap (most popular content management systems provide this automatically), then go through each document in the sitemap and check if it matches the URL you entered.

Here's an example:

You enter https://www.mysite.com

We check for a sitemap, and find one, which lists three URLs:

https://www.mysite.com
https://www.mysite.com/aboutus
https://blog.mysite.com

We compare each url against the URL you entered (https://www.mysite.com) and load any webpages that START WITH that URL. In the example above, we would load:

https://www.mysite.com
https://www.mysite.com/aboutus

but not

https://blog.mysite.com

This is because https://blog.mysite.com does not start with www.mysite.com.

Why would we do this, and not just load every URL in the sitemap? The main reason is that many users don't always want to load an entire website into their Knowledge Base, bug just a portion of it, such as https://www.mysite.com/documentation

If we loaded every URL in the website, in cases like this, it would fill your Knowledge Base up with irrelevant data you don't want your bot to have to sift through when answering questions.

Advanced Settings & Troubleshooting Issues

However, sometimes this can have unexpected results. Here are some common issues:

1. Only my homepage was loaded, but none of the other pages on my site were

Sometimes the sitemap URLs don't match the URLs you usually use. For example, maybe you asked the Knowledge Base to crawl www.mysite.com, but your sitemap has all of the URLs listed as just mysite.com/whatever

2. I didn't get any results

Similar to above, there are times when the sitemap hasn't been updated, and is empty, or points to incorrect URLS. In this case, your web pages won't be discovered by the crawler.

Another reason for this can be that your site is protected by aggressive bot prevention. We use a variety of sophisticated measures to bypass being blocked, but they don't always work.

It may also be that your website is too slow, or has errors when we try to load some of the pages - when we get errors, we will retry a few times, but eventually have to give up.

3. I get some pages loaded, but not all. It seems random

There are actually several reasons this can be the case. Often, it's a mismatch between the URL you entered and the URL format found in your sitemap. To check this, you can usually just navigate to https://mysite.com/sitemap.xml and inspect the sitemap yourself. Compare the URls you find to the one you entered, and make sure that the pages you care about all START WITH your main URL.

Another reason is that you are trying to load a subset of your website, but the pages you are looking for aren't in that subset. This can be particularly true when trying to crawl, say, a portion of a discussion form. You may ask us to load https://myforum.com/mycategory, but each of the actual pages are under https://myforum.com/mytopic?topic=1, etc.

Because https://myforum.com/mytopic?topic=1 DOES NOT START WITH https://myform.com/mycategory, those pages aren't loaded.

Fixing These Issues

To fix these issues, we have several settings in the "Advanced" section of the "Add Documents" dialog that can help you out.

The first two, URL Prefixes to Exclude, and URL Prefixes to Include, will help you get around pages being loaded that you don't want, or pages being loaded that you do want.

URL Prefixes to Exclude

Whatever URLs you put in the "URL Prefixes to Exclude" will NOT be imported for any URL that starts with the prefixes you configure. For example, you may want to load your whole website EXCEPT the "Open Positions" section. In that case, given that your website is https://myssite.com and all the listed open positions are https://mysite.com/workwithus, you could add https://mysite.com/workwithus into the URL Prefixes to Exclude, and any pages under that will be ignored, such as https://mysite.com/workwithus/front_end_developer and https://mysite.com/workwithus/index

URL Prefixes to Include

This setting works just the opposite of Prefixes to Exclude. Whatever URL prefixes you add, any page that is discovered that starts with one of these prefixes will be imported, even if it does not start with the main URL. For example, if you have a forum and you only want to load one category, say https://myform.com/categories?category=1, but the topics are under https://myform.com/topics?category=1, you can add https://myform.com/topics?category=1 to your URL Prefixes to Include and those topics will get imported.

Use Sitemaps

There are many benefits to using Sitemaps for scraping your website, include much faster an efficient updates of changed pages. However, some websites don't publish them, or publish corrupted versions, and so you don't get the data you are hoping for.

If you aren't getting any pages in your import, you can disable Use Sitemaps in the advanced tab, and the crawl will work differently. Instead of looking for a sitemap, we will load the page you entered as your Start URL, then check that page for any other pages that it links to. We will then load each of those pages, and check for more links, etc, up to 3 levels deep.

Turning this off should be a method of last resort, but it will solve some cases where your website import is returning nothing, or very few, documents.

I tried all that - nothing is working

Please reach out to us at support@flowxo.com and let us know what problem you're having. We can look at the logs of your import and tell you what exactly is going wrong. Sometimes, there's not much we can do - your site is being aggressively blocked by bot protection and our evasion mechanisms aren't working. In that case, you may need to find another method to populate your knowledge base, such as uploading PDFs manually or copying/pasting into manual documents.

In many other cases, there is a combination of the "Advanced Settings" that will solve the issue.

It is also possible that your website is too large, or has a very unique structure that our build in web crawler can't figure out. In cases like this, we can create a custom import for you using our Professional Services team. Just reach out to us and we can provide you pricing for building a custom import.

That's all there is to it! Happy flowing!