Every search marketing professional should have a crawler in her arsenal of tools.
Organic search’s first and most important rule is that search engines must be able to crawl to a page for that page to rank and drive any traffic or sales. If the search engine can’t crawl to discover the pages on your site, then in the eyes of the search engine, those pages do not exist. And, naturally, only pages that a search engine knows exist can show up in rankings.
Yes, you can create an XML sitemap to tell the search engines which pages really exist. But an XML sitemap alone will only get your pages indexed. Unless you have zero competition in ranking with those pages, an XML sitemap alone will not help you rank.
Your SEO performance depends on depth of your site’s crawl. As a result, you must analyze your crawl in order to optimize your site.
My crawler recommendations are at the end of this article. First, I’ll focus on the specific reasons to crawl your site.
Organic search’s first and most important rule is that search engines must be able to crawl to a page for that page to rank and drive any traffic or sales.
Discover What’s on Your Site
Find out exactly which pages are and are not on your site, according to a crawler that acts similar to Google’s traditional web crawlers. Are the products you thought were on your site really there? Are they in the category you thought they were? Has your platform created pages you didn’t know about? Or maybe merchandising or another branch of marketing has created some new or duplicate pages?
Find Crawl Blocks
If a page doesn’t show up in the report at the end of the crawl, it means that the crawler could not access it.
When you scan the output file, pay special attention to what’s not there. If pages are missing, the crawler either did not complete — which you’ll know based on whether any error messages displayed — or the crawler could not access them.
Once you know that you have a crawl block, you can determine the nature of that block based on which pages are missing.
Are all of your color, style, and size filter pages missing? You probably have a very common but very damaging SEO issue: AJAX filters that refresh and narrow the products visible on the screen without changing the URL.
Are pages that have a certain combination of letters in their URL missing? One of your robots.txt disallows is probably disallowing more than intended. Is the whole darn site missing? Check for a global disallow in the robots.txt or a meta robots NOINDEX command.
Learn Which URLs Are Disallowed
Some crawlers will tell you specifically which pages can be crawled to but are blocked by a robots.txt disallow. This feature makes it very easy to find and fix the file to allow any pages that were accidentally disallowed.
Find 404 Errors
Most every ecommerce site has 404 errors. Many show a 404-error page for each discontinued product. But those error pages tend to be useful to customers and tend not to be crawlable in the site’s navigation. In other words, when a product is discontinued, you don’t continue to link to it. The search engines know it was there because they have it indexed, and so they will see the 404 error and eventually de-index the page.
But search engines consider 404 error pages that are linked to within the site navigation a sign of poor customer experience. Combined with other signals, or in large enough quantities, 404 errors can begin to dampen search rankings.
There are other ways to get 404 reports, but they only show the URLs that are returning a 404 error. A crawler will specifically show which error pages are linked to in such a way that search engines can crawl to them. The tool also identifies which how many and which pages linked to each error page to help ferret out the underlying reasons for the error so it can be resolved.
In addition to 404 errors, crawlers identify redirects. Any 302 redirects should be examined for opportunities to convert them to 301 redirects. All redirects should be reviewed to determine how many redirects happen before the crawler lands on a “real” page that returns a 200 OK, and to determine if that final destination page is actually the correct page on which to land.
Google has said that every 301 redirect “leaks” about 15 percent of the authority it transfers to the receiving page. So limit the number of times that a page redirects to another redirect if at all possible.
Find Poor Meta Data
A simple alphabetical sort in Excel identifies which title tags are duplicates of each other or poorly written, assuming you can get the data in Excel. A crawler is excellent for this purpose. It will also collect meta descriptions and meta keywords fields for review. Optimization is much easier when you can prioritize quickly which areas need the most help first.
Without a crawler, reviewing meta data is hit or miss. It’s tedious to sample enough pages on a site to feel comfortable that the pages have the correct meta data, and it’s always possible that the pages you don’t review are the pages that will have incorrect tags on them. For meta tags like the robots noindex, which instruct search engines not to index a page, that handful of pages that you don’t sample could cost you dearly.
Analyze Canonical Tags
Canonical tags are still relatively new to a lot of companies and are easily done incorrectly. Many sites have a canonical tag on every page that simply references that specific page. This not only defeats the purpose of having a canonical tag, but it reinforces the duplicate content that the tags are meant to remove.
Review the canonical tags for pages with duplicate content to ensure that every duplicate version of that content references a single canonical page.
Gather Custom Data
For those who want to go beyond the standard data that a crawler pulls, custom fields enable you to find whether certain fields exist, are populated, and what they contain. It takes a bit of experience with regular expressions (“RegEx,” identifies a pattern of characters) or XPath (identifies parts of an XML document), but you can tell a crawler to grab the price of products, the analytics code on each page, the structured data or Open Graph tags on each page, and more.
Pull in Analytics
Some crawlers will grab analytics data from tools like Google Analytics and Google Search Console, and report it for each page crawled. This is an incredible timesaver in determining the relative value of optimizing a page. Should a page be driving much more traffic? You can make that determination and see much of the data needed to optimize the page all in one place by running one report.
Find your favorite crawler and use it often. My favorite crawler is Screaming Frog’s SEO Spider, because it can do everything listed above.
I have no affiliation with Screaming Frog — actually the company that produces it is a competitor of sorts in that it’s an SEO agency in the U.K. But they have created an amazing crawler with an excellent suite of features. SEO Spider can do all of the above, and easily creates reports for export to Excel. Plus I enjoy peoples’ reactions when I recommend the outlandish-sounding “Screaming Frog.”
SEO Spider will set you back £99. That’s a small price to pay for the value the tool brings. In addition, Screaming Frog regularly updates SEO Spider and adds new features to it.
If you require a free solution and have a small site, Screaming Frog will let you demo its software with a limited set of features and the ability to crawl up to 500 pages.
Free tools with unlimited usage include Xenu Link Sleuth and GSite Crawler. I’m sure there are others, but these are the two that I have used and can recommend.
Xenu Link Sleuth was created by a single developer, who uses Link Sleuth to bring attention to his religious views. While I don’t endorse those views, he has made an excellent free tool that I recommend. It has been around for over ten years and isn’t supported or updated anymore — your results may vary.
I find that Link Sleuth crawls deeper than Screaming Frog’s tool without running out of system memory. Link Sleuth allows export to CSV, but the data exported is only useful to (a) analyze which pages exist on the site, (b) look for crawl blocks, and (c) find redirects and 404 errors.
GSite Crawler was created by an ex-Google employee and is geared more toward creating XML sitemaps. You can still use it to analyze which pages exist on the site and look for crawl blocks, but it lacks many of the other features above.