Jim Lawless' Blog


RSS feed processing with AWK

Originally published on: Fri, 01 May 2009 02:33:14 +0000

This post contains obsolete references.

Since the original publication of this post, I have moved to a different blog host. I have left some of the original post intact, but have corrected links to other posts to reflect their new addresses.


In an early attempt to process the syndicated data from this blog, I wrote a short AWK script using GNU GAWK to process the RSS ( Really Simple Syndication ) feed provided at the links at the bottom of this page.

The approach was a deviation from the more common-sense route of using a programming language with an XML parser ( or better yet, an RSS parser ) available, but my curiosity got the better of me; I briefly pondered how concisely I could write an AWK script to sift out the simple data that I needed.


BEGIN {FS="[<>]"}
$2=="title"||$2=="link"{print $2 " " $3}

Line one sets the AWK field-separator to the regular expression [<>] which causes the angle-bracket characters prevalent in XML tags to act as delimiter characters.

The rule on the next line states that if the second word ($2) is either "title" or "link" print the second and third words ( $2 and $3 respectively ) separated by a space.

The output of the above script using my current RSS feed is as follows:


title Direct Threaded Daydreams
link http://www.mailsend-online.com/blog
title Direct Threaded Daydreams
link http://www.mailsend-online.com/blog
title PHP, Transparent GIF&#8217;s, and Web Tracking
link http://www.mailsend-online.com/blog?p=9
title Envy
link http://www.mailsend-online.com/blog?p=8
title A Quine in C
link http://www.mailsend-online.com/blog?p=7
title Stacking Images with PerlMagick
link http://www.mailsend-online.com/blog?p=6
title WSH2EXE part 2
link http://www.mailsend-online.com/blog?p=5
title WSH2EXE part 1
link http://www.mailsend-online.com/blog?p=4
title Cheating the LZW
link http://www.mailsend-online.com/blog?p=3
title E-mail cleansing
link http://www.mailsend-online.com/blog?p=2
title Obfuscated C
link http://www.mailsend-online.com/blog?p=1

Note that since the XML markup in my RSS file doesn't contain CDATA tags and since I avoid XML entities I can get away with using a script similar to the above to extract the title and link to each post I've submitted in order from the newest post to the oldest post.

In a future post, we'll build a more sensible RSS feed processor that isn't dependent on the coincidental purity of the data in the title and link tags.

My immediate need was to build a script that would generate a web page that contained the list of posts and the links to each. We're just about there with the short AWK script.

The title and link to the blog itself ends up appearing twice at the beginning of the markup, so my second script (rssparse.awk) sets a first use flag variable called first to limit the output of that title and link to one.

The new script also stores the title when encountered and then outputs both the title and link when the link line is encountered.

The BEGIN and END special AWK rules are used to set the initial flag and will generate the HTML header and footer data.

To automate the process of downloading the RSS data, I use wget, the command-line HTTP retrieval utility.

Here are the two files that comprise my brief little system to download my RSS feed and turn it into simple HTML:

my_feed.bat


@echo off
wget -O tmp.rss http://....address.no.longer.valid.../feed/
gawk -f rssparse.awk tmp.rss > bloglist.htm

rssparse.awk


# License: MIT / X11
# Copyright (c) 2009 by James K. Lawless
#
# Permission is hereby granted, free of charge, to any person
# obtaining a copy of this software and associated documentation
# files (the "Software"), to deal in the Software without
# restriction, including without limitation the rights to use,
# copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the
# Software is furnished to do so, subject to the following
# conditions:
#
# The above copyright notice and this permission notice shall be
# included in all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
# EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
# OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
# HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
# WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
# OTHER DEALINGS IN THE SOFTWARE.
BEGIN {
   FS="[<>]";
   first=1;
   printf("<html><head><title>Blog Posts by Jim Lawless\n");
   printf("</title></head><body>\n");
}
$2=="title" {
   title=$3;
}
$2=="link"{
   if(first==1) {
      first=0;
      next;
   }
   printf("<a href=\"%s\">%s</a><br>\n<a href=\"%s\">%s</a><p>\n",
      $3,title,$3,$3);
}
END {
   printf("</body></html>");
}

The resulting simple HTML bloglist.htm is as follows:


<html><head><title>Blog Posts by Jim Lawless
</title></head><body>
<a href="http://www.mailsend-online.com/blog">Direct Threaded Daydreams</a><br>
<a href="http://www.mailsend-online.com/blog">http://www.mailsend-online.com/blog</a><p>
<a href="http://www.mailsend-online.com/blog?p=9">PHP, Transparent GIF&#8217;s, and Web Tracking</a><br>
<a href="http://www.mailsend-online.com/blog?p=9">http://www.mailsend-online.com/blog?p=9</a><p>
<a href="http://www.mailsend-online.com/blog?p=8">Envy</a><br>
<a href="http://www.mailsend-online.com/blog?p=8">http://www.mailsend-online.com/blog?p=8</a><p>
<a href="http://www.mailsend-online.com/blog?p=7">A Quine in C</a><br>
<a href="http://www.mailsend-online.com/blog?p=7">http://www.mailsend-online.com/blog?p=7</a><p>
<a href="http://www.mailsend-online.com/blog?p=6">Stacking Images with PerlMagick</a><br>
<a href="http://www.mailsend-online.com/blog?p=6">http://www.mailsend-online.com/blog?p=6</a><p>
<a href="http://www.mailsend-online.com/blog?p=5">WSH2EXE part 2</a><br>
<a href="http://www.mailsend-online.com/blog?p=5">http://www.mailsend-online.com/blog?p=5</a><p>
<a href="http://www.mailsend-online.com/blog?p=4">WSH2EXE part 1</a><br>
<a href="http://www.mailsend-online.com/blog?p=4">http://www.mailsend-online.com/blog?p=4</a><p>
<a href="http://www.mailsend-online.com/blog?p=3">Cheating the LZW</a><br>
<a href="http://www.mailsend-online.com/blog?p=3">http://www.mailsend-online.com/blog?p=3</a><p>
<a href="http://www.mailsend-online.com/blog?p=2">E-mail cleansing</a><br>
<a href="http://www.mailsend-online.com/blog?p=2">http://www.mailsend-online.com/blog?p=2</a><p>
<a href="http://www.mailsend-online.com/blog?p=1">Obfuscated C</a><br>
<a href="http://www.mailsend-online.com/blog?p=1">http://www.mailsend-online.com/blog?p=1</a><p>
</body></html>

I will probably make a similar AWK script based on this one in the coming weeks that will build a small block of three or four random links from the blog to include on every new blog post.

Unless otherwise noted, all code and text entries are Copyright ©2009 by James K. Lawless



Views expressed in this blog are those of the author and do not necessary reflect those of the author's employer. Views expressed in the comments are those of the responding individual.

stumbleupon Save to StumbleUpon
digg Digg it
reddit Save to Reddit
facebook Share on Facebook
twitter Share on Twitter
aolfav More bookmarks


Previous post: PHP, Transparent GIF's, and Web Tracking
Next post:Safe Scripting with Scroll Lock and Caps Lock


About Jim ...


Click **here**
to try out MailWrench;
a command-line SMTP /
SMTPS (Google Gmail)
mailer for Windows.


Follow me on Twitter

http://twitter.com/lawlessGuy


Recent Posts

A JavaScript REPL for Android Devices

MailSend is Free

My Blog Engine

The October 10th Bug

A Review of Kevin Mitnick's Book Ghost in the Wires

Spellbound by Web Programming

Backlinks to my Blog Posts

Play MP3 Files with Python on Windows


Random Posts

Along Came AWK

A Lightweight Alternative to Windows Shortcuts

CP/M Days

A Simple Media Control Interface Script Processor

WSH2EXE part 2

Obfuscated C

My Foray into Shareware

A Simple Associative Array Library in C

WSH2EXE - The Final Chapter

Obfuscated Perl


Full List of Posts

http://www.mailsend-online.com/bloglist.htm


Recent Posts from my Other Blog

Remembering Dr. San Guinary

Why Some Web Sites will go Dark on Jan 18th

SNL Superhero Skit

More Ruby Games

My Ruby Game Challenge Entry

Steal this Bookmarklet

Nerd Toys

Learn New Jargon, You Must

Spot the Wiebe

Tech Magazine Glory Days

Book Review : Paull Allen - Idea Man

A 90's Experiment in Online Systems - The U.S. West CommunityLink Service