Mailman Archive Scraper (Phil Gyford’s website)

While I’m banging on about Pretend Office I thought I’d point out the Python script I wrote to re-publish the mailing list’s private archive into a semi-anonymised public location in case it’s of use to anyone else sometime.

The Pretend Office mailing list is run using Mailman and I wasn’t sure how to make the archives public for all to read while also maintaining anonymity for its participants. I didn’t want our full names, email addresses, phone numbers, etc. to be Googleable.

So I wrote a script which scrapes the private Mailman archive pages every hour and makes copies of the them on a publicly viewable webserver. (In some cases I believe it’s possible to directly query the Mailman database but I don’t have access to that, hence this slightly more clunky scraping solution.)

In addition to simply copying the files, the script can optionally perform any or all of these additional tasks:

Create an RSS feed of recent emails sent to the list.
Remove all email addresses from the pages, including those in Mailman’s obscured “phil at gyford.com” format.
Replace the “More info on this list” URL with an alternative.
Remove some or all of the levels of quoted emails included on the page. (Pretend Office’s public archive includes only a single level of quoting.)
Search and replace any custom strings you like. (For Pretend Office I stripped out everyone’s surnames.)
Add custom HTML into the <head> of every page (eg, for adding Google Analytics javascript).

This is all pretty flexible and does the job for me. It can scrape both public and private Mailman archives, assuming you have an email address and password that allows you access to the latter.

There are some less-than-ideal issues:

The script doesn’t save state between runs, so it has to fetch at least as many pages as you want entries in the RSS feed every time. But this isn’t the end of the world.
The RSS feed doesn’t include the full text of emails. I think PyRSS2Gen can be extended to allow this but my Python isn’t good enough to work out how — if you know how please do let me know so I can add it.
The scraping of the source pages may break. I’ve tried it on a couple of Mailman archives and it was fine but scraping is always fragile.

Maybe that’s all useful to someone, somewhere, someday.

It is my very first attempt at Python, so be gentle with me and, if you’re so inclined, tell me all the things I could have done better. I’ve also never used Git or GitHub before, both of which I found as baffling as most version control systems. Again, let me know if I’ve goofed somewhere. Thanks.

Mailman Archive Scraper at GitHub.

Comments

Mahesh at 30 Oct 2009, 5:11pm. Permalink

Hello,

Thanks for this information. I have a mailman on a hosting provider and I have another asp.net site on a different server. Can I use this to get the archives of the mailman and save it on a local path on a different server? I hope I am making sense.

I have your script running on the server and in place of publish_dir I have typed in C:/inetpub/wwwroot/Mailman

I see an error message of "No such file or directory"

Any help would be greatly appreciated.

Thank you.
Phil Gyford at 30 Oct 2009, 5:54pm. Permalink

It sounds like it will do what you want, but I don't know anything about Windows I'm afraid, so can't help with why it doesn't like the file path you've tried, sorry.
Mahesh at 4 Nov 2009, 4:23pm. Permalink

Thank you for your reply. I was able to set it up fine. The publich archive works fine. However, when I try to get a private archive with correct email and password, I get the following error:

File "C:\Python26\MailmanArchiveScraper\MailmanArchiveScraper.…", line 481, i
n
main()
File "C:\Python26\MailmanArchiveScraper\MailmanArchiveScraper.…", line 477, i
n main
scraper.scrape()
File "C:\Python26\MailmanArchiveScraper\MailmanArchiveScraper.…", line 146, i
n scrape
self.logIn()
File "C:\Python26\MailmanArchiveScraper\MailmanArchiveScraper.…", line 244, i
n logIn
fp = mechanize.urlopen(form.click())
File "build\bdist.win32\egg\mechanize\_opener.py", line 420, in urlopen
File "build\bdist.win32\egg\mechanize\_opener.py", line 202, in open
File "build\bdist.win32\egg\mechanize\_http.py", line 612, in http_response
File "build\bdist.win32\egg\mechanize\_opener.py", line 225, in error
File "C:\Python26\lib\urllib2.py", line 367, in _call_chain
result = func(*args)
File "build\bdist.win32\egg\mechanize\_http.py", line 633, in http_error_defau
lt
urllib2.HTTPError: HTTP Error 404: Not Found
Phil Gyford at 4 Nov 2009, 4:34pm. Permalink
Judging by the "404: Not Found", my only guess is that the path I've set in the script for private Mailman archives isn't the same as what your Mailman site uses. Look for the line
```
self.list_url = 'http://' + self.domain + '/mailman/private/' + self.list_name
```
and see if it matches the site you're trying to access.
Mahesh at 4 Nov 2009, 4:44pm. Permalink

I appreciate your quick response. The path seems to be correct. The public lists are returned fine with the address of domain/pipermail/listname

If I browse directly to my private list, the address in the browser is domain/mailman/private/listname

Thanks
Phil Gyford at 4 Nov 2009, 4:47pm. Permalink

I would try and work out what URL it's requesting that is generating that 404 error.

Commenting is disabled on posts once they’re 30 days old.

Mailman Archive Scraper

Comments

13 May 2009 at Twitter

13 May 2009 in Links

On this day I was reading

Music listened to most that week

Individual RSS feeds

Combined RSS feeds