Parsing a Wikipedia page’s content with python

A while back I was asking on Twitter and Stack Overflow about how to parse a Wikipedia page’s content using python. It seemed harder than I expected, given the number of Wikimedia-related tools available. Here’s what I ended up doing.

What I wanted to do:

  • Fetch the content of a particular Wikipedia page.
  • Tweak that content (e.g., hide certain elements).
  • Save the resulting HTML.

Given MediaWiki has an API I initially thought the best thing would be to grab structured content using that, remove the elements I didn’t want, then render the rest into nice, clean HTML. This seemed more robust than scraping a rendered page’s HTML, parsing it, removing bits, then saving the remainder. Scraping always feels like a last resort. And there was an API!

But MediaWiki content is much more complicated than I first thought and, following the discussion on my Stack Overflow question, it seemed like turning Wikipedia’s raw wikitext into HTML was going to be more trouble than it was worth.

A small step up from scraping standard Wikipedia pages would be to omit all the stuff surrounding the content, which can be done by appending ?action=render to the URL, e.g. for /Samuel_Pepys. Then it would be a case of parsing the HTML, ensuring it’s sane, and stripping out anything I didn’t want.

The resulting python script (on GitHub, tests) is part of my Pepys’ Diary code, in Django, but is fairly standalone.

The process is:

  1. Fetch the HTML page using requests.

  2. Use bleach to ensure the HTML is valid and, whitelisting only the HTML tags and attributes we want, strip out unwanted elements.

  3. Use BeautifulSoup to further strip out HTML elements based on their CSS class names, and to add extra classes to elements with certain existing classes.

  4. Return the new, improved HTML.

It seems to work alright, resulting in some decent-looking copies of Wikipedia pages.

For completeness, here’s the code at the time of writing, but the GitHub version may be newer:

from bs4 import BeautifulSoup
import bleach
import requests


class WikipediaFetcher(object):

    def fetch(self, page_name):
        """
        Passed a Wikipedia page's URL fragment, like
        'Edward_Montagu,_1st_Earl_of_Sandwich', this will fetch the page's
        main contents, tidy the HTML, strip out any elements we don't want
        and return the final HTML string.

        Returns a dict with two elements:
            'success' is either True or, if we couldn't fetch the page, False.
            'content' is the HTML if success==True, or else an error message.
        """
        result = self._get_html(page_name)

        if result['success']:
            result['content'] = self._tidy_html(result['content'])

        return result

    def _get_html(self, page_name):
        """
        Passed the name of a Wikipedia page (eg, 'Samuel_Pepys'), it fetches
        the HTML content (not the entire HTML page) and returns it.

        Returns a dict with two elements:
            'success' is either True or, if we couldn't fetch the page, False.
            'content' is the HTML if success==True, or else an error message.
        """
        error_message = ''

        url = 'https://en.wikipedia.org/wiki/%s' % page_name

        try:
            response = requests.get(url, params={'action':'render'}, timeout=5)
        except requests.exceptions.ConnectionError as e:
            error_message = "Can't connect to domain."
        except requests.exceptions.Timeout as e:
            error_message = "Connection timed out."
        except requests.exceptions.TooManyRedirects as e:
            error_message = "Too many redirects."

        try:
            response.raise_for_status()
        except requests.exceptions.HTTPError as e:
            # 4xx or 5xx errors:
            error_message = "HTTP Error: %s" % response.status_code
        except NameError:
            if error_message == '':
                error_message = "Something unusual went wrong."

        if error_message:
            return {'success': False, 'content': error_message} 
        else:
            return {'success': True, 'content': response.text}

    def _tidy_html(self, html):
        """
        Passed the raw Wikipedia HTML, this returns valid HTML, with all
        disallowed elements stripped out.
        """
        html = self._bleach_html(html)
        html = self._strip_html(html)
        return html

    def _bleach_html(self, html):
        """
        Ensures we have valid HTML; no unclosed or mis-nested tags.
        Removes any tags and attributes we don't want to let through.
        Doesn't remove the contents of any disallowed tags.

        Pass it an HTML string, it'll return the bleached HTML string.
        """

        # Pretty much most elements, but no forms or audio/video.
        allowed_tags = [
            'a', 'abbr', 'acronym', 'address', 'area', 'article',
            'b', 'blockquote', 'br',
            'caption', 'cite', 'code', 'col', 'colgroup',
            'dd', 'del', 'dfn', 'div', 'dl', 'dt',
            'em',
            'figcaption', 'figure', 'footer',
            'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'header', 'hgroup', 'hr',
            'i', 'img', 'ins',
            'kbd',
            'li',
            'map',
            'nav',
            'ol',
            'p', 'pre',
            'q',
            's', 'samp', 'section', 'small', 'span', 'strong', 'sub', 'sup',
            'table', 'tbody', 'td', 'tfoot', 'th', 'thead', 'time', 'tr',
            'ul',
            'var',
        ]

        # These attributes will be removed from any of the allowed tags.
        allowed_attributes = {
            '*':        ['class', 'id'],
            'a':        ['href', 'title'],
            'abbr':     ['title'],
            'acronym':  ['title'],
            'img':      ['alt', 'src', 'srcset'],
            # Ugh. Don't know why this page doesn't use .tright like others
            # http://127.0.0.1:8000/encyclopedia/5040/
            'table':    ['align'],
            'td':       ['colspan', 'rowspan'],
            'th':       ['colspan', 'rowspan', 'scope'],
        }

        return bleach.clean(html, tags=allowed_tags,
                                    attributes=allowed_attributes, strip=True)

    def _strip_html(self, html):
        """
        Takes out any tags, and their contents, that we don't want at all.
        And adds custom classes to existing tags (so we can apply CSS styles
        without having to multiply our CSS).

        Pass it an HTML string, it returns the stripped HTML string.
        """

        # CSS selectors. Strip these and their contents.
        selectors = [
            'div.hatnote',
            'div.navbar.mini', # Will also match div.mini.navbar
            # Bottom of https://en.wikipedia.org/wiki/Charles_II_of_England :
            'div.topicon',
            'a.mw-headline-anchor',
        ]

        # Strip any element that has one of these classes.
        classes = [
            # "This article may be expanded with text translated from..."
            # https://en.wikipedia.org/wiki/Afonso_VI_of_Portugal
            'ambox-notice',
            'magnify',
            # eg audio on https://en.wikipedia.org/wiki/Bagpipes
            'mediaContainer',
            'navbox',
            'noprint',
        ]

        # Any element has a class matching a key, it will have the classes
        # in the value added.
        add_classes = {
            # Give these tables standard Bootstrap styles.
            'infobox':   ['table', 'table-bordered'],
            'ambox':     ['table', 'table-bordered'],
            'wikitable': ['table', 'table-bordered'],
        } 

        soup = BeautifulSoup(html)

        for selector in selectors:
            [tag.decompose() for tag in soup.select(selector)]

        for clss in classes:
            [tag.decompose() for tag in soup.find_all(attrs={'class':clss})]

        for clss, new_classes in add_classes.iteritems():
            for tag in soup.find_all(attrs={'class':clss}):
                tag['class'] = tag.get('class', []) + new_classes

        # Depending on the HTML parser BeautifulSoup used, soup may have
        # surrounding <html><body></body></html> or just <body></body> tags.
        if soup.body:
            soup = soup.body
        elif soup.html:
            soup = soup.html.body

        # Put the content back into a string.
        html = ''.join(str(tag) for tag in soup.contents)

        return html

25 Mar 2015 at Twitter

  • 07:44am: @holgate Yes. Rightly or wrongly, I'd feel really uncomfortable writing on Medium, as if it was marketing rather than personal writing.
  • 10:31am: Street View time travel: Move down the road on the left and watch as gentrification begins! https://t.co/sFVYl1l4px /cc all at BERG
  • 04:38pm: @genmon I did this for @danhon's newsletter, which seems to work in the absence of a better way: http://t.co/Htfu2TCGiw
  • 06:02pm: @whoisdanw Every person has their price. It’s only a matter of time before I thank you all for your support over the years.
  • 06:45pm: Just noticed that the Encyclopedia/Index on http://t.co/ctmCudjsDG has now got a nice round 5,000 topics.
  • 08:59pm: This prog about superfans, autograph and selfie hunters was really nice. http://t.co/e8K4vGAtSF Loved the later bit at Comicon, so warm.
  • 09:22pm: @tomcoates You can't download a customised version? http://t.co/vYcKB6i68r or use sass/less to build one with the colours you want?
  • 09:22pm: @tomcoates (Sorry if that's stating the obvious)
  • 09:56pm: @stml @infovore @tomcoates Then… don't use something as big and complex as Bootstrap! If your needs are simpler, write some simple css!
  • 10:01pm: @stml @infovore @tomcoates And even then…
  • 10:04pm: 10pm. In a large, old, golden, but nearly empty hall a Bishop is talking to few people about Public Contract Regulations. Government is odd.
  • 10:09pm: 10.07pm. An Earl responds to the Bishop. Someone in a white court wig and black gown… hang on, maybe this isn't the Lords, but dull cosplay?

Music listened to most that week

  1. Courtney Barnett (22)
  2. The Cure (18)
  3. Perkie (12)
  4. Martha (10)
  5. Math and Physics Club (10)
  6. Modern Jazz Quartet (8)
  7. Sleater-Kinney (6)
  8. Earl Sweatshirt (4)
  9. Annie (2)
  10. Clark (2)

More at Last.fm…