SEO

SEO: Analyzing Googlebot Crawls for Problems, Inefficiencies

Crawl budget inefficiencies can affect organic search performance if new or updated content is not getting crawled and indexed.

In “What Crawl Budget Means for Googlebot,” Google explains in its Webmaster Central Blog that there are two factors that control crawl budget: crawl rate and crawl demand. “Taking crawl rate and crawl demand together we define crawl budget as the number of URLs Googlebot can and wants to crawl.”

Google asserts that crawl budget does not impact sites with less than a few thousand pages. But ecommerce sites often have many more pages, creating a potential problem.

In this post, I will explain how to generate reports to help determine if your site has a Googlebot crawl budget problem. The goal is to list new or updated web pages that have not been crawled (and, thus, indexed). I’ll do this by generating a list of all URLs on a site’s XML sitemaps, with the creation or modification dates.

Then, I’ll compare that list to Googlebot crawl activity in web server logs.  Log files provide the best source of information when analyzing the crawl budget. I addressed the issue at “Using Server Logs to Uncover SEO Problems.”

I’ll use Screaming Frog’s Log File Analyser to start.

Log Files

First, insert your log file into Log File Analyser at “Drag and Drop Log Files Here.” This will open the “Project” tab, to configure a new analysis.

Screaming Frog's Log File Analyser can show when Googlebot crawls pages.

Screaming Frog’s Log File Analyser can show when Googlebot crawls pages.

Next, trim the log files to isolate Googlebot entries. Most sites receive crawls from, potentially, dozens of bots, such as Googlebot, Bingbot, other search engines, and SEO tools.  We also need to remove “fake” Googlebot requests, which are common from tools that emulate Google for, mostly, legitimate analysis.

To do this, in the Project tab go to New > User Agents and tick the box “Verify Bots When Importing Logs (Slows down Import).” This verifies Googlebot IPs are real by performing a double DNS verification, as Google has explained in “Verifying Googlebot,” in the Search Console help portal.

Using Practical Ecommerce’s log files as examples, eliminating fake Googlebot crawls reduced individual requests from 306,960 to 112,308 — roughly half of the Googlebot requests were fake, in other words.

Next, after Log File Analyser processes the log, I’ll export it into a cleaned, structured CSV file.  I’ll select the option “Verification Status Show Verified.” This removes the fake Googlebot entries. When I created the project and picked our timezone, the date in the log was properly formatted. Now we just need to export the CSV file.

XML Sitemaps

I’ll use, again, Practical Ecommerce’s XML sitemaps as examples. I’ll assume they are comprehensive and include only unique URLs that we want crawled and indexed. I will also assume that the last modification dates in the XML sitemaps are accurate.

The XML sitemaps hold the keys to our crawl budget analysis:

  • Are there pages or updates not crawled? We can answer this by comparing pages in the XML sitemaps against pages crawled from the logs.
  • How fast are changes picked up? We can answer this by comparing the modification times to crawl times.

I first need to convert the XML sitemaps to CSV files. I tried Screaming Frog’s SEO Spider to download the XML sitemap and export it to a CSV, but it drops the critical modification time. 

I will use Python, instead.

Comparing URLs in sitemaps to CSV files. First, I will expand the XML index sitemap and parse the individual sitemaps into a Pandas DataFrame. (“Pandas” is a software library for Python to perform various types of analyses. A “DataFrame” is the equivalent of a Google Sheet but with the ability to perform more powerful data transformations.)  Here is the code:

https://gist.github.com/hamletbatista/5d0d996872239ddbfe8744da049124a9.

I then exported the DataFrame to a CSV file and imported it into Log File Analyser. When I select the URLs tab, and “Not in Log File” from the pulldown menu, I get the answer to the first question: a list of URLs that haven’t been crawled but should be.

Comparing the modification times to crawl times. To answer the second question — How fast are changes picked up? — we need to compare the last modification dates in the sitemaps to the crawl dates in the log files. Unfortunately, Log File Analyser does not offer this feature.

It’s back to Python.

I already have the XML sitemaps in a Pandas DataFrame. I will now load the CSV export from Log File Analyser into another DataFrame. I can then combine the two DataFrames using the Pandas merge function.

There are options for the merge depending on the data we want to keep. In this case, I’ll use “Left Join” to retain the XML sitemap URLs and capture the intersection between the sitemaps and the log file. Here is the code to do that:

https://gist.github.com/hamletbatista/b8801049ae464398404a8f9bc755ad26

Combine the two DataFrames using the Pandas merge function. Use the "Left Join" to keep the XML sitemap URLs and the intersection between the sitemap and the log file.

Combine the two DataFrames using the Pandas merge function. Use the “Left Join” to keep the XML sitemap URLs and capture the intersection between the sitemap and the log file.

Once we merge the DataFrames, we can determine which pages are not crawled (because crawl dates are missing), using this code:

https://gist.github.com/hamletbatista/5e3a65bc19427d8c5570482b572d04b2.

The output is a list of pages that not have been crawled by Googlebot during the period of the log files.

The most interesting question is if recently-changed URLs are getting crawled. For this, I can compare the crawl date in the log files and the last modification date in the XML sitemap:

https://gist.github.com/hamletbatista/02fb89d885825398c611cba57fbdb3ec.

The result is a list of updated pages that were crawled quickly and another list of those that weren’t.

Now we can determine if Practical Ecommerce has a crawl budget problem. We can review the URLs that haven’t been crawled, and insert the important ones in Search Console’s “URL inspection” tool. It should provide details as to why. We can also request reindexing in the inspection tool.

Hamlet Batista

Hamlet Batista

Bio   •   RSS Feed


x