Fun with screen-scraping
Today for a client I had to grab last year's newsletters from Intellicontact and post them as static HTML files. The specific mailings to archive were given to me as row numbers on an Intellicontact list view, so the first thing to do was translate those row numbers to Intellicontact message IDs. BeautifulSoup to the rescue! I manually saved the listing page (actually there were two), and then used this basic recipe to generate a CSV with message IDs and post dates.
The next step was to download the edit view for each message, and extract the newsletter content from it. I started with urllib, but I needed cookies to stay authenticated at Intellicontact. So then I tried twill. It looked promising at first, but I ended up getting a 406 Not Accepted response from the edit page. Maybe this was because twill doesn't support JavaScript? Third try was the charm: pywinauto! The distribution includes an example script for saving a web page with Firefox, and that's what I used as my starting point for this script.
This was my first time with twill as well as pywinauto. Both look like useful tools. I also learned from the BeautifulSoup docs about a practical Unicode how-to that I'm looking forward to reading (I still don't have that zen). And after all the recent work on Aspen and Dewey--where the tests pass and the APIs are clean--I was reminded today that what people pay you for is to get dirty!
1 comments:
I use httplib2 for screen scraping. it is lower level than urllib, so you have access to headers and cookies and all that fun stuff.
-Corey
www.goldb.org
Post a Comment