Jim Lawless' Blog


RSS feed processing with AWK

Originally published on: Fri, 01 May 2009 02:33:14 +0000

This post contains obsolete references.

Since the original publication of this post, I have moved to a different blog host. I have left some of the original post intact, but have corrected links to other posts to reflect their new addresses.


In an early attempt to process the syndicated data from this blog, I wrote a short AWK script using GNU GAWK to process the RSS ( Really Simple Syndication ) feed provided at the links at the bottom of this page.

The approach was a deviation from the more common-sense route of using a programming language with an XML parser ( or better yet, an RSS parser ) available, but my curiosity got the better of me; I briefly pondered how concisely I could write an AWK script to sift out the simple data that I needed.


BEGIN {FS="[<>]"}
$2=="title"||$2=="link"{print $2 " " $3}

Line one sets the AWK field-separator to the regular expression [<>] which causes the angle-bracket characters prevalent in XML tags to act as delimiter characters.

The rule on the next line states that if the second word ($2) is either "title" or "link" print the second and third words ( $2 and $3 respectively ) separated by a space.

The output of the above script using my current RSS feed is as follows:


title Direct Threaded Daydreams
link http://www.mailsend-online.com/blog
title Direct Threaded Daydreams
link http://www.mailsend-online.com/blog
title PHP, Transparent GIF&#8217;s, and Web Tracking
link http://www.mailsend-online.com/blog?p=9
title Envy
link http://www.mailsend-online.com/blog?p=8
title A Quine in C
link http://www.mailsend-online.com/blog?p=7
title Stacking Images with PerlMagick
link http://www.mailsend-online.com/blog?p=6
title WSH2EXE part 2
link http://www.mailsend-online.com/blog?p=5
title WSH2EXE part 1
link http://www.mailsend-online.com/blog?p=4
title Cheating the LZW
link http://www.mailsend-online.com/blog?p=3
title E-mail cleansing
link http://www.mailsend-online.com/blog?p=2
title Obfuscated C
link http://www.mailsend-online.com/blog?p=1

Note that since the XML markup in my RSS file doesn't contain CDATA tags and since I avoid XML entities I can get away with using a script similar to the above to extract the title and link to each post I've submitted in order from the newest post to the oldest post.

In a future post, we'll build a more sensible RSS feed processor that isn't dependent on the coincidental purity of the data in the title and link tags.

My immediate need was to build a script that would generate a web page that contained the list of posts and the links to each. We're just about there with the short AWK script.

The title and link to the blog itself ends up appearing twice at the beginning of the markup, so my second script (rssparse.awk) sets a first use flag variable called first to limit the output of that title and link to one.

The new script also stores the title when encountered and then outputs both the title and link when the link line is encountered.

The BEGIN and END special AWK rules are used to set the initial flag and will generate the HTML header and footer data.

To automate the process of downloading the RSS data, I use wget, the command-line HTTP retrieval utility.

Here are the two files that comprise my brief little system to download my RSS feed and turn it into simple HTML:

my_feed.bat


@echo off
wget -O tmp.rss http://....address.no.longer.valid.../feed/
gawk -f rssparse.awk tmp.rss > bloglist.htm

rssparse.awk


# License: MIT / X11
# Copyright (c) 2009 by James K. Lawless
#
# Permission is hereby granted, free of charge, to any person
# obtaining a copy of this software and associated documentation
# files (the "Software"), to deal in the Software without
# restriction, including without limitation the rights to use,
# copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the
# Software is furnished to do so, subject to the following
# conditions:
#
# The above copyright notice and this permission notice shall be
# included in all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
# EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
# OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
# HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
# WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
# OTHER DEALINGS IN THE SOFTWARE.
BEGIN {
   FS="[<>]";
   first=1;
   printf("<html><head><title>Blog Posts by Jim Lawless\n");
   printf("</title></head><body>\n");
}
$2=="title" {
   title=$3;
}
$2=="link"{
   if(first==1) {
      first=0;
      next;
   }
   printf("<a href=\"%s\">%s</a><br>\n<a href=\"%s\">%s</a><p>\n",
      $3,title,$3,$3);
}
END {
   printf("</body></html>");
}

The resulting simple HTML bloglist.htm is as follows:


<html><head><title>Blog Posts by Jim Lawless
</title></head><body>
<a href="http://www.mailsend-online.com/blog">Direct Threaded Daydreams</a><br>
<a href="http://www.mailsend-online.com/blog">http://www.mailsend-online.com/blog</a><p>
<a href="http://www.mailsend-online.com/blog?p=9">PHP, Transparent GIF&#8217;s, and Web Tracking</a><br>
<a href="http://www.mailsend-online.com/blog?p=9">http://www.mailsend-online.com/blog?p=9</a><p>
<a href="http://www.mailsend-online.com/blog?p=8">Envy</a><br>
<a href="http://www.mailsend-online.com/blog?p=8">http://www.mailsend-online.com/blog?p=8</a><p>
<a href="http://www.mailsend-online.com/blog?p=7">A Quine in C</a><br>
<a href="http://www.mailsend-online.com/blog?p=7">http://www.mailsend-online.com/blog?p=7</a><p>
<a href="http://www.mailsend-online.com/blog?p=6">Stacking Images with PerlMagick</a><br>
<a href="http://www.mailsend-online.com/blog?p=6">http://www.mailsend-online.com/blog?p=6</a><p>
<a href="http://www.mailsend-online.com/blog?p=5">WSH2EXE part 2</a><br>
<a href="http://www.mailsend-online.com/blog?p=5">http://www.mailsend-online.com/blog?p=5</a><p>
<a href="http://www.mailsend-online.com/blog?p=4">WSH2EXE part 1</a><br>
<a href="http://www.mailsend-online.com/blog?p=4">http://www.mailsend-online.com/blog?p=4</a><p>
<a href="http://www.mailsend-online.com/blog?p=3">Cheating the LZW</a><br>
<a href="http://www.mailsend-online.com/blog?p=3">http://www.mailsend-online.com/blog?p=3</a><p>
<a href="http://www.mailsend-online.com/blog?p=2">E-mail cleansing</a><br>
<a href="http://www.mailsend-online.com/blog?p=2">http://www.mailsend-online.com/blog?p=2</a><p>
<a href="http://www.mailsend-online.com/blog?p=1">Obfuscated C</a><br>
<a href="http://www.mailsend-online.com/blog?p=1">http://www.mailsend-online.com/blog?p=1</a><p>
</body></html>

I will probably make a similar AWK script based on this one in the coming weeks that will build a small block of three or four random links from the blog to include on every new blog post.

Unless otherwise noted, all code and text entries are Copyright ©2009 by James K. Lawless

del_icio_us Save to del.icio.us
stumbleupon Save to StumbleUpon
digg Digg it
reddit Save to Reddit
facebook Share on Facebook
twitter Share on Twitter
aolfav More bookmarks



Previous post: PHP, Transparent GIF's, and Web Tracking
Next post:Safe Scripting with Scroll Lock and Caps Lock


Search this Blog (and site)

Search this Site with PicoSearch


Subscribe to this Blog

 Subscribe!


Contact Me

Email: jimbo@radiks.net


Follow me on Twitter

http://twitter.com/lawlessGuy


Recent Posts

Mad Schemes : Learning Lisp via SICP

Auto Save Clipboard Images Redux

Extending SpiderMonkey JavaScript on Windows

Rhino JavaScript to EXE with launch4j

Compiling Rhino JavaScript to Java

Directory Traversal in Rhino JavaScript

Taking Shape

We've Moved!


Popular Posts

A Command-Line MP3 Player for Windows

Auto Save Images from the Clipboard

Java in a Windows EXE with launch4j

An Interview with Tom Zimmer: Forth System Developer

Setting Windows Console Text Colors in C


Random Posts

TAP : A Command Processor Library

Windows Text to Speech in WSH JavaScript

Preventing Windows Screen-Saver Activation

WSH2EXE part 1

Along Came AWK

A Quine in Forth

An Embedded Mini-Interpreter

Generating Primes with XSLT

Extracting URL Addresses from Text in C

Thwarting HTTP Referer Trackbacks


Full List of Posts

http://www.mailsend-online.com/bloglist.htm


Blogroll

MicroISV on a Shoestring
DadHacker
The Bottom Feeder
Writin' That Code!
The Recursive ISV
The Thomsen Blog
Prototypically Speaking
The Reinvigorated Programmer