RSS-Enabling Your Buddy’s Non-RSS Enabled Blog
So as it turns out, I’m fully enamored with Google Reader, which allows me to read all sorts of RSS feeds. I waste time much more efficiently now. They’ve even got a version that’s optimized for your mobile device, so you can piss off your wife by reading blogs on your Blackberry.
The problem with feed readers, though, is that once you start using them, you stop reading blogs that don’t provide feeds. After realizing that I was falling behind on Heck’s Kitchen — which is hand coded, and therefore does not offer a feed — I realized that the best solution available was to write a screen scraper that would convert the static HTML page to RSS. Thankfully, Jenny writes good HTML and uses CSS classes and ids, so scraping the page was easy.
The fruits of my labor are here: Heck’s Kitchen RSS Feed.
Full script after the jump.
As per usual, the hard part of writing a perl script is finding and installing the correct modules via CPAN. Here, I’m using XML::RSS and HTML::Treebuilder, both of which provide nice interfaces into structured documents.
Finally, when pushing the HTML into the XML::RSS, you’ve got to add your own CDATA tags, otherwise you get a lot of <s.
#!/usr/bin/perl
## (C) 2007 Dan Check
##
## This script is free software; you can redistribute it and/or modify
## it under the terms of the GNU General Public License as published by
## the Free Software Foundation; either version 2, or (at your option)
## any later version.
##
## This script is distributed in the hope that it will be useful,
## but WITHOUT ANY WARRANTY; without even the implied warranty of
## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
## GNU General Public License for more details.
##
## You may have received a copy of the GNU General Public License
## along with this script; see the file COPYING. If not, write to the
## Free Software Foundation, Inc., 59 Temple Place - Suite 330,
## Boston, MA 02111-1307, USA.
use strict;
use warnings;
use LWP;
use XML::RSS;
use Date::Format;
use HTML::TreeBuilder;
# Create an XML RSS object
my $rss = new XML::RSS (version => ‘2.0′);
$rss->channel(title => ‘Heck\’s Kitchen’,
link => ‘http://jennymiller.com/’,
language => ‘en’,
description => ‘The dramatic lives of trapeze artists, a clown, and an elephant trainer against a background of circus spectacle.’,
copyright => ‘Copyright 2007, jennymiller.com’,
managingEditor => ‘katspank@gmail.com’,
webMaster => ‘katspank@gmail.com’
);
$rss->add_module(prefix=>’content’, uri=>’http://purl.org/my/rss/module/’);
# pubDate => ‘Thu, 23 Aug 1999 07:00:00 GMT’,
# lastBuildDate => ‘Thu, 23 Aug 1999 16:20:26 GMT’,
# create a new browser
my $browser = LWP::UserAgent->new;
# Set browser headers
$browser->default_headers->push_header(’User-Agent’ => ‘Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6′);
$browser->default_headers->push_header(’Accept’ => ‘text/xml,application/xml,application/xhtml+xml,text/html,text/plain,image/png,*/*’);
$browser->default_headers->push_header(’Accept-Language’ => ‘en-us,en’);
$browser->default_headers->push_header(’Accept-Charset’ => ‘ISO-8859-1,utf-8′);
my $url = “http://www.jennymiller.com/”;
# Create a cookie jar
$browser->cookie_jar({});
# Get the login page
my $response = $browser->get($url);
unless($response->is_success) {
warn “Couldn’t get $url: “, $response->status_line, “\n”;
die;
}
# create a tree
my $tree = HTML::TreeBuilder->new;
$tree->parse($response->content);
$tree->eof;
my @out;
# we want entry and menu classes
foreach my $item (
$tree->look_down(’_tag’, ‘div’,
sub {
return if !$_[0]->attr(’class’) && !$_[0]->attr(’id’);
return unless (($_[0]->attr(’class’) && $_[0]->attr(’class’) =~ /^entry$/) || ($_[0]->attr(’id’) && $_[0]->attr(’id’) =~ /^menu$/));
# print “Got ” . $_[0]->as_text . “\n”;
my @c = $_[0];
}
)) {
my $item_tree = HTML::TreeBuilder->new;
$item_tree->parse($item->as_HTML);
$item_tree->eof;
# Grab the first h3 as the title
my $date_tree = $item_tree->look_down(’_tag’, ‘h3′);
my $title_tree = $item_tree->look_down(’_tag’, ‘h1′);
my $date = time2str(”%b %d, %Y”, time);
if ($date_tree) {
$date = $date_tree->as_text;
}
# Default the title to LOOK, for the links section
my $title = “LOOK!”;
if ($title_tree) {
$title = $title_tree->as_text;
}
$rss->add_item(title => $title,
link => “http://jennymiller.com/”,
guid => $date,
description => “< ![CDATA[" . $item->as_HTML . “]]>”
);
}
$tree->delete;
$rss->{output} = ‘2.0′;
$rss->save(’./jloreview.com/hk/rss.xml’);
2 Comments so far
Leave a reply
Damn it, you just made me realize that I’ve never released _anything_ under the GPL. Wait a minute, maybe I can license this _comment_ under the GPL.
In any event, you, sir, are a gentleman and a scholar.
You’re the best.