Mailman Archive Scraper

While I’m banging on about Pretend Office I thought I’d point out the Python script I wrote to re-publish the mailing list’s private archive into a semi-anonymised public location in case it’s of use to anyone else sometime.

The Pretend Office mailing list is run using Mailman and I wasn’t sure how to make the archives public for all to read while also maintaining anonymity for its participants. I didn’t want our full names, email addresses, phone numbers, etc. to be Googleable.

So I wrote a script which scrapes the private Mailman archive pages every hour and makes copies of the them on a publicly viewable webserver. (In some cases I believe it’s possible to directly query the Mailman database but I don’t have access to that, hence this slightly more clunky scraping solution.)

In addition to simply copying the files, the script can optionally perform any or all of these additional tasks:

  • Create an RSS feed of recent emails sent to the list.

  • Remove all email addresses from the pages, including those in Mailman’s obscured “phil at gyford.com” format.

  • Replace the “More info on this list” URL with an alternative.

  • Remove some or all of the levels of quoted emails included on the page. (Pretend Office’s public archive includes only a single level of quoting.)

  • Search and replace any custom strings you like. (For Pretend Office I stripped out everyone’s surnames.)

  • Add custom HTML into the <head> of every page (eg, for adding Google Analytics javascript).

This is all pretty flexible and does the job for me. It can scrape both public and private Mailman archives, assuming you have an email address and password that allows you access to the latter.

There are some less-than-ideal issues:

  • The script doesn’t save state between runs, so it has to fetch at least as many pages as you want entries in the RSS feed every time. But this isn’t the end of the world.

  • The RSS feed doesn’t include the full text of emails. I think PyRSS2Gen can be extended to allow this but my Python isn’t good enough to work out how — if you know how please do let me know so I can add it.

  • The scraping of the source pages may break. I’ve tried it on a couple of Mailman archives and it was fine but scraping is always fragile.

Maybe that’s all useful to someone, somewhere, someday.

It is my very first attempt at Python, so be gentle with me and, if you’re so inclined, tell me all the things I could have done better. I’ve also never used Git or GitHub before, both of which I found as baffling as most version control systems. Again, let me know if I’ve goofed somewhere. Thanks.

Mailman Archive Scraper at GitHub.

Comments

  • Hello,

    Thanks for this information. I have a mailman on a hosting provider and I have another asp.net site on a different server. Can I use this to get the archives of the mailman and save it on a local path on a different server? I hope I am making sense.

    I have your script running on the server and in place of publish_dir I have typed in C:/inetpub/wwwroot/Mailman

    I see an error message of "No such file or directory"

    Any help would be greatly appreciated.

    Thank you.

  • It sounds like it will do what you want, but I don't know anything about Windows I'm afraid, so can't help with why it doesn't like the file path you've tried, sorry.

  • Thank you for your reply. I was able to set it up fine. The publich archive works fine. However, when I try to get a private archive with correct email and password, I get the following error:

    File "C:\Python26\MailmanArchiveScraper\MailmanArchiveScraper.…", line 481, i
    n
    main()
    File "C:\Python26\MailmanArchiveScraper\MailmanArchiveScraper.…", line 477, i
    n main
    scraper.scrape()
    File "C:\Python26\MailmanArchiveScraper\MailmanArchiveScraper.…", line 146, i
    n scrape
    self.logIn()
    File "C:\Python26\MailmanArchiveScraper\MailmanArchiveScraper.…", line 244, i
    n logIn
    fp = mechanize.urlopen(form.click())
    File "build\bdist.win32\egg\mechanize\_opener.py", line 420, in urlopen
    File "build\bdist.win32\egg\mechanize\_opener.py", line 202, in open
    File "build\bdist.win32\egg\mechanize\_http.py", line 612, in http_response
    File "build\bdist.win32\egg\mechanize\_opener.py", line 225, in error
    File "C:\Python26\lib\urllib2.py", line 367, in _call_chain
    result = func(*args)
    File "build\bdist.win32\egg\mechanize\_http.py", line 633, in http_error_defau
    lt
    urllib2.HTTPError: HTTP Error 404: Not Found

  • Judging by the "404: Not Found", my only guess is that the path I've set in the script for private Mailman archives isn't the same as what your Mailman site uses. Look for the line

    self.list_url = 'http://' + self.domain + '/mailman/private/' + self.list_name

    and see if it matches the site you're trying to access.

  • I appreciate your quick response. The path seems to be correct. The public lists are returned fine with the address of domain/pipermail/listname

    If I browse directly to my private list, the address in the browser is domain/mailman/private/listname

    Thanks

  • I would try and work out what URL it's requesting that is generating that 404 error.

Commenting is disabled on posts once they’re 30 days old.

13 May 2009 at Twitter

  • 9:12pm: @mattb Online radio should have interference filters to let you make it sound like FM, MW, LW or SW.
  • 8:57pm: bbc.co.uk's Apprentice Predictor gave away that I didn't pick the firee before the firing actually took place. Spoiler. Won't use again.
  • 12:50pm: Mugs acquired at good old Waitrose. Good size, comfy handles. Plus: tea, milk and biscuits. The full kit.
  • 11:32am: @tomtaylor Robert Dyas a good idea... although the website suggests I might end up with Winnie the Pooh and Friends...
  • 11:26am: If I wanted to buy some mugs near Silicon Roundabout, where would I go? Argos doesn't do them (unless you want two Playboy mugs for £8.79).
  • 8:11am: Responding to critics.
  • 7:30am: Blimey, I had no idea a film was made of Nicholas Mosley's 'Accident' (I like) with a screenplay by Pinter. And it's re-released next month!
  • 7:01am: Everyone was swimming clockwise this morning. Surprisingly disorienting.

13 May 2009 in Links