While I’m banging on about Pretend Office I thought I’d point out the Python script I wrote to re-publish the mailing list’s private archive into a semi-anonymised public location in case it’s of use to anyone else sometime.
The Pretend Office mailing list is run using Mailman and I wasn’t sure how to make the archives public for all to read while also maintaining anonymity for its participants. I didn’t want our full names, email addresses, phone numbers, etc. to be Googleable.
So I wrote a script which scrapes the private Mailman archive pages every hour and makes copies of the them on a publicly viewable webserver. (In some cases I believe it’s possible to directly query the Mailman database but I don’t have access to that, hence this slightly more clunky scraping solution.)
In addition to simply copying the files, the script can optionally perform any or all of these additional tasks:
Create an RSS feed of recent emails sent to the list.
Remove all email addresses from the pages, including those in Mailman’s obscured “phil at gyford.com” format.
Replace the “More info on this list” URL with an alternative.
Remove some or all of the levels of quoted emails included on the page. (Pretend Office’s public archive includes only a single level of quoting.)
Search and replace any custom strings you like. (For Pretend Office I stripped out everyone’s surnames.)
This is all pretty flexible and does the job for me. It can scrape both public and private Mailman archives, assuming you have an email address and password that allows you access to the latter.
There are some less-than-ideal issues:
The script doesn’t save state between runs, so it has to fetch at least as many pages as you want entries in the RSS feed every time. But this isn’t the end of the world.
The RSS feed doesn’t include the full text of emails. I think PyRSS2Gen can be extended to allow this but my Python isn’t good enough to work out how — if you know how please do let me know so I can add it.
The scraping of the source pages may break. I’ve tried it on a couple of Mailman archives and it was fine but scraping is always fragile.
Maybe that’s all useful to someone, somewhere, someday.
It is my very first attempt at Python, so be gentle with me and, if you’re so inclined, tell me all the things I could have done better. I’ve also never used Git or GitHub before, both of which I found as baffling as most version control systems. Again, let me know if I’ve goofed somewhere. Thanks.