Design & Development

Monitor Accessibility Errors on Your Ecommerce Site with Scrapy, WAVE

In just a few lines of code, ecommerce businesses can monitor individual store pages, searching for potential web accessibility problems that might make it difficult for shoppers with disabilities to browse for products or make purchases.

Web accessibility is important to online retailers for at least two reasons. First, online stores are built to make sales and any time a customer cannot easily find products or buy those products, storeowners should be concerned. Second, web accessibility is likely to generate a number of lawsuits and legal settlements this year, in part because of less-than-clear U.S. web accessibility regulations. So to avoid litigation, make certain your store is accessible.

Ecommerce business owners and managers have a number of possible solutions for monitoring site pages and finding potential accessibility problems. As one example, in this article I am going to describe how to use a web crawling framework, Scrapy, and an accessibility testing application programing interface (API) to crawl your own ecommerce website and look for accessibility errors.

Before you can start on the spider, you will need to register for the Web Accessibility in Mind (WebAIM) WAVE API. This API considers the World Wide Web Consortium (W3C) Web Content Accessibility Guidelines (WCAG) and section 508 of the U.S. Workforce Rehabilitation Act, evaluating the pages you submit for potential errors.

The WAVE API is not free. You’ll pay 2.5 cents to 4 cents per credit — an API request might cost two or three credits — depending on how many you buy at a time. Once you have registered, you can log in and get your API key.

The Spider’s Mission

The web spider we will write should return a report like the one below. The report will show each page of the target website — i.e., your ecommerce website — and how many potential accessibility errors are on each page.

url alerts errors contrast
https://businessideadaily.com/auth/login 3 3 1
https://businessideadaily.com/password/email 2 2 0
https://businessideadaily.com/ 9 8 3

To do this, the spider will need to accomplish three tasks.

  • The spider must be able to find essentially all of the pages on the target website.
  • For each page, the spider will pass the page URL to the WAVE API.
  • Results from the accessibility evaluation will be captured as fields in the report.

Building on Basic Page-finding, Link-crawling Spider

In “Crawl Your Ecommerce Site with Python, Scrapy, an article published earlier this month, I explained how to build a functional Scrapy-powered web spider that will find, essentially, all of the pages on your website.

That spider completes the first task, which I will build on in this article.

Prepare the Spider

If you have created the spider from the aforementioned “Crawl Your Ecommerce Site with Python, Scrapy article, you can start with it. Or you can generate a new, very similar spider with the command below.

scrapy genspider -t crawl testaccess businessideadaily.com

As promised, this command will generate a new file, called testaccess.py. Here is what that file will look like when you first generate it.

# -*- coding: utf-8 -*-
 import scrapy
 from scrapy.linkextractors import LinkExtractor
 from scrapy.spiders import CrawlSpider, Rule
from bid.items import BidItem
 class TestaccessSpider(CrawlSpider):
 name = 'testaccess'
 allowed_domains = ['businessideadaily.com']
 start_urls = ['http://www.businessideadaily.com/']
rules = (
 Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
 )
def parse_item(self, response):
 i = BidItem()
 #i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()
 #i['name'] = response.xpath('//div[@id="name"]').extract()
 #i['description'] = response.xpath('//div[@id="description"]').extract()
 return i

Starting at the top of the page, the testaccess spider imports or makes use of some outside classes and libraries.

import scrapy
 from scrapy.linkextractors import LinkExtractor
 from scrapy.spiders import CrawlSpider, Rule
from bid.items import BidItem

To these, we are going to add a specific library for parsing JSON files. JSON is a way of organizing information as we share it between systems. The WAVE API returns its results in JSON by default, thus the need for this library.

import json

Now the top of the testacces.py file will look like this.

# -*- coding: utf-8 -*-
 import json
 import scrapy
 from scrapy.linkextractors import LinkExtractor
 from scrapy.spiders import CrawlSpider, Rule
from bid.items import BidItem

Moving a little further down the file, we need to add a domain to the list of allowed domains. This second domain — wave.webaim.org — is for the API. Without specifying it here, Scrapy would not let us get the accessibility report.

allowed_domains = ['businessideadaily.com', 'wave.webaim.org']

We also need to remove the allow parameter from the spider’s rule set, since we want to find all of the pages on the site. If you are using the spider from my earlier article, this will already be done.

Rule(LinkExtractor(), callback='parse_item', follow=True),

Collect and Work with Pages

The spider has a parse_item method.

def parse_item(self, response):
 i = BidItem()
 #i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()
 #i['name'] = response.xpath('//div[@id="name"]').extract()
 #i['description'] = response.xpath('//div[@id="description"]').extract()
 return i

When the web spider finds a page on your site, it will pass a response — the information it found about the page — to this method. Thus, we are going to use parse_item to collect a URL, prepare the URL for the API request, and pass the modified URL to another method. In truth, we could change the name of this method, since parse_item is probably not as descriptive as we would like. So feel free to change it, just be certain to change it in the Rule too.

def parse_item(self, response):
 test_url = "http://wave.webaim.org/api/request?key={YOURKEYHERE}&url=" + response.url
 yield scrapy.Request(test_url, callback=self.parse_test_json)

The first thing the parse_item method does is to prepare the request URL for the API. This request URL has three parts.

The beginning of the request URL is simply the address for accessing the WAVE API.

http://wave.webaim.org/api/request?

Next, you will need to pass the API a key. You can find it on the WAVE API dashboard. Don’t forget that you will need to register and buy credits.

http://wave.webaim.org/api/request?key={YOURKEYHERE}

The WAVE API wants to know which page it should evaluate. Each page’s URL will be stored in response.url, so here we just concatenate response.url to the end of the request URL as it is so far.

"http://wave.webaim.org/api/request?key={YOURKEYHERE}&url=" + response.url

When this test_url is output, it will look like a “regular” web address. For example:

http://wave.webaim.org/api/request?key=kdh76h83&url=http://businessideadaily.com

This request URL is passed to a Scrapy Request object representing the call to this particular page.

yield scrapy.Request(test_url, callback=self.parse_test_json)

As a second parameter, we pass a callback function. This method will receive the accessibility report.

Process the Accessibility Report

The parse_test_json method organizes the WAVE API’s response for each page tested and makes it easier for us to get our accessibility report.

def parse_test_json(self, response):
 test_results = json.loads(response.body_as_unicode())
 href = BidItem()
 href['url'] = test_results['statistics']['pageurl']
 href['errors'] = test_results['categories']['error']['count']
 href['contrast'] = test_results['categories']['contrast']['count']
 href['alerts'] = test_results['categories']['alert']['count']
 href['report'] = test_results
 return href

The results of from the WAVE API are captured in the test_results variable. This variable is set to the unicode value of the JSON response thanks to the JSON library we imported at the beginning of this article.

test_results = json.loads(response.body_as_unicode())

Here is an example of what those results look like.

"status": {"success": true},
 "statistics": {
    "waveurl": "http://wave.webaim.org/report?url=https://businessideadaily.com/",
    "pagetitle": "Business Idea Daily - delivering business ideas each day",
    "pageurl": "https://businessideadaily.com/",
    "allitemcount": 51,
    "creditsremaining": 90,
    "time": "1.38",
    "totalelements": 142},
 "categories": {
    "feature": "count": 1, "description": "Features"},
    "contrast": {"count": 3, "description": "Contrast Errors"},
    "html5": {"count": 15, "description": "HTML5 and ARIA"},
    "error": {"count": 8, "description": "Errors"},
    "alert": {"count": 9, "description": "Alerts"},
    "structure": {"count": 15, "description": "Structural Elements"}
 }

From this we are going to capture five things:

  • The complete result set as JSON;
  • The number of errors found;
  • The number of contrast errors found;
  • The number of alerts discovered;
  • The URL for the page tested.

Each of these will be stored as part of an item. Our spider’s items will be defined in the items.py file which is part of a standard Scrapy build. The items.py file from the earlier article had just one field.

class BidItem(scrapy.Item):
 url = scrapy.Field()

For this more-specific, web-accessibility-testing spider, we are going to add a few fields to the item class.

class BidItem(scrapy.Item):
 url = scrapy.Field()
 errors = scrapy.Field()
 contrast = scrapy.Field()
 alerts = scrapy.Field()
 report = scrapy.Field()

Turning back to the testaccess.py file, we can now specify values for each of the fields.

 href = BidItem()
 href['url'] = test_results['statistics']['pageurl']
 href['errors'] = test_results['categories']['error']['count']
 href['contrast'] = test_results['categories']['contrast']['count']
 href['alerts'] = test_results['categories']['alert']['count']
 href['report'] = test_results

Notice that to get the number of errors, we call categories from the JSON response, look in error, and then for the count value.

href['errors'] = test_results['categories']['error']['count']

Seeing the JSON again might help this to make more sense.

categories": {
 "error": {"count": 8, "description": "Errors"},
 }

Finally, we tell our spider to return the item.

return href

Crawl Your Site

Moving to a terminal and the directory for our testaccess spider, we can send the spider out to find the pages on our site and test each one for possible web accessibility problems.

scrapy crawl testaccess -o webaccess.csv

This command will generate a webaccess.csv file that you can open in your favorite spreadsheet, such as Google Sheets, for example. This file will show you how many errors, contrast errors, and alerts were found on each page of your website. The spreadsheet will look something like this, below.

report url alerts errors contrast
{u’status’: {u’success’: True}, u’statistics’: {u’waveurl’: u’http://wave.webaim.org/report?url=https://businessideadaily.com/password/email’, u’pagetitle’: u”, u’pageurl’: u’https://businessideadaily.com/password/email’, u’allitemcount’: 6, u’creditsremaining’: 21, u’time’: u’1.89′, u’totalelements’: 36}, u’categories’: {u’feature’: {u’count’: 0, u’description’: u’Features’}, u’contrast’: {u’count’: 0, u’description’: u’Contrast Errors’}, u’html5′: {u’count’: 2, u’description’: u’HTML5 and ARIA’}, u’error’: {u’count’: 2, u’description’: u’Errors’}, u’alert’: {u’count’: 2, u’description’: u’Alerts’}, u’structure’: {u’count’: 0, u’description’: u’Structural Elements’}}} https://businessideadaily.com/password/email 2 2 0
{u’status’: {u’success’: True}, u’statistics’: {u’waveurl’: u’http://wave.webaim.org/report?url=https://businessideadaily.com/’, u’pagetitle’: u’Business Idea Daily – delivering business ideas each day’, u’pageurl’: u’https://businessideadaily.com/’, u’allitemcount’: 51, u’creditsremaining’: 21, u’time’: u’2.14′, u’totalelements’: 142}, u’categories’: {u’feature’: {u’count’: 1, u’description’: u’Features’}, u’contrast’: {u’count’: 3, u’description’: u’Contrast Errors’}, u’html5′: {u’count’: 15, u’description’: u’HTML5 and ARIA’}, u’error’: {u’count’: 8, u’description’: u’Errors’}, u’alert’: {u’count’: 9, u’description’: u’Alerts’}, u’structure’: {u’count’: 15, u’description’: u’Structural Elements’}}} https://businessideadaily.com/ 9 8 3
Armando Roggio
Armando Roggio
Bio   •   RSS Feed


x