How to troubleshoot the Elastic App Search web crawler

In case you hadn’t heard, we recently released a brand new web crawler for Elastic App Search. The web crawler provides a simple way to ingest publicly available web content and make that content instantly searchable on your website. Configuring the web crawler to start ingesting data from your website is super easy — it’s just a matter of a few clicks. This sounds great, but what if after crawling there are no pages being indexed or you feel some pages are missing? How can you easily narrow down the issue and fix it? In this blog post, we’ll first demonstrate how to configure the web crawler with App Search. Then we’ll troubleshoot its execution to identify and fix any errors we might have made during the configuration. Manage the web crawler Deploy App SearchFirst, you need to deploy App Search, which can be done in a few clicks with a free 14-day trial on Elastic Cloud, or you can download and install it on-prem. You can then create your first App Search engine. In our case, we’re going to use the Elastic blog website as a data source, so we just name the engine elastic-blogs.

Once the engine is set up, go to Web Crawler in the App Search Admin UI to start ingesting documents. Configure the web crawlerNote: The following steps deliberately contain a misconfiguration error in order for us to troubleshoot it later in the post. When configuring the web crawler, the first information to provide is the entry point of the website we want to crawl. In our case, we type https://elastic.co for the Domain URL and then click Add Domain.

In the next screen, the entry point can be edited to specify where the web crawler should start crawling. In our case, we can add the full path to https://elastic.co/blog/. This will speed up the process of crawling as it starts directly from the place we’re interested in.

The second part is the crawling rules, which enable us to target only the contents we’re interested in. For each page the web crawler discovers, the crawling rules are applied in sequential order against the URL. The first rule that matches the URL is applied. In our case, we’re interested only in the blogs, so we want to exclude the rest of Elastic website. Let’s add the following rules:

Allow | Begins with | /blog/ 
Disallow | Regex | .*

Each page of the Elastic website will be evaluated against these rules, which allow only URLs that start with /blog/. Using the same logic, we can implement more complex rules. Let’s say we want to exclude pages under the path /blog/ that summarize blogs per category or author. We can just add a new rule using a regular expression as follows: 

Disallow | Regex | /blog/author
Disallow | Begins with | /blog/category

As a result, we have the following rules configured:

The last rule cannot be moved. This is a default rule that allows the page to be indexed if no other rules have matched. Run the web crawlerNow that we’ve configured the web crawler, we’re ready to start crawling the blogs from the Elastic website. Let’s hit the Start Crawling button to begin ingesting documents. Shortly after, you’ll see documents being indexed. You can see them by navigating to the Documents view.

Well, something must be wrong — we have no documents indexed. Let’s investigate. Monitor the web crawler execution Check the execution outcomeFirst, we can check the status of the web crawler execution in the Web Crawler view.

There we can see that two executions have completed and finished with no errors. This tells us that the web crawler executed properly, but it doesn’t explain why we don’t see any documents. We need to investigate further and look at the web crawler events logs. Access events logsApp Search runs on the Elastic Stack, which is also used for other use cases, such as observability and security. This means that App Search can benefit from features from these solutions, which in our case is very handy. By default, the events logs from the web crawler execution are automatically indexed in the underlying Elasticsearch instance that powers App Search. The easiest way to access these logs is using Kibana. If you’re using Elastic Cloud, then a Kibana instance is already deployed for you. If you have deployed on-prem, you can download and install Kibana. Open Kibana and go to the Logs app in the main menu.

The Logs app can be configured to specify where to find the logs. Navigate to Settings to set Log indices to .ent-search-crawler--logs.

We can also specify which fields from the logs we want to display. To investigate our issue, let’s use the following fields:

Name

Description

crawler.crawl.id

Unique ID for the crawl execution

url.path

URL path of the page that is being processed

url.domain

Domain of the page that is being processed

http.response.status_code

Http response code received by the crawler while accessing a page

event.action

Event types for URLs, crawl request, and ingestion process

crawler.url.deny_reason

Reason for a page not being indexed

Message

Provides additional details on certain events

Other fields are available and can be added easily from the Log Settings UI.  Troubleshoot web crawler errors Find the errorWe can now go back to the Stream view to read the logs, where you’ll see the logs from the last execution.

We can quickly notice the main issue with our crawling. Although everything executed, the /blog/ path has been denied, with the message policy=deny rule=regex pattern=.*. This also stops the crawling process because there were no other pages to discover as the first URL was denied. In our example the deny reason was rule_engine_denied, but the following are some of the most common reasons that will block a page from being processed:

Deny reason

Description

already_seen

Skipping because we have seen this URL before

domain_filter_denied

Not in list of allowed domains configured for the crawl request

rule_engine_denied

Denied by a configured crawl rule

nofollow

Link contains a rel=nofollow attribute or a nofollow robots meta tag is found on the page

too_many_unique_links

Reached the limit on number of unique URLs

link_too_deep

Reached crawl depth limit

link_too_long

Exceeded URL length limit

link_with_too_many_params

Exceeded number of parameters in URL

Fix the configurationWe can now go back to the configured web crawler rules and check them. Notice that the rule that blocks everything (Disallow | Regex | .*) is before the rule that allows the pages from the /blog path (Allow | Begins with | /blog/). We can simply change the order of the rules and launch the web crawler again.

Run the web crawlerIf you navigate back to the Logs app, you can stream the logs as the crawler is going through the website and indexing documents. Also, the Logs app benefits from the search capabilities of Elasticsearch, which makes it easy to search through the events logs generated by the web crawler execution. A typical investigation workflow would be searching all the events logs for a specific crawl id (that can be captured in the App Search admin UI) and a URL path:

You can also navigate to the Documents view in the App Search admin UI to see new documents being indexed in real time.

ConclusionIn this blog post, we learned how to manage the web crawler to quickly crawl content from a website. We then showed how to monitor the web crawler execution by using the Elastic Stack to view the necessary logs. The final step was to troubleshoot and fix the misconfiguration issue we introduced. If you haven’t already, remember that you can always start up a free 14-day trial of Elastic App Search on Elastic Cloud to try out the web crawler for yourself. Happy crawling! https://www.elastic.co/blog/how-to-troubleshoot-the-elastic-app-search-web-crawler

Creată 4y | 7 apr. 2021, 21:20:34


Autentifică-te pentru a adăuga comentarii