Jim Lawless' Blog


Converting Data to XML with AWK

Originally published on: Mon, 10 Jan 2011.

From time to time, friends of mine will need to convert some data they've retrieved from a SQL query into XML. I usually use the need as an opportunity to teach them a little about the AWK programming language.

For a little about my background with AWK, see the post Along Came AWK.

Generally, the data in question has been captured in some sort of a file where each field is delimited by a single character that cannot appear in the field data. Often, it's the TAB character ( ASCII 9 ). I like to use the pipe-symbol (|) as a delimiter. Sometimes, TAB's are converted to spaces if multiple people have handled the file.

Suppose that we have a set of data in a file called sample.txt that contains a musical artist's name, an album/CD name, and the genre for that type of music: sample.txt

Parsing this data with AWK is a relatively simple task. Most AWK afficionados would probably take care of it with a one-liner. For sake of discussion, let's actually write out a formal script to handle the task.

try1.awk

In order to execute the above script, I'm going to use GAWK ( GNU AWK ) from a Windows command-prompt. Most Linux OS's already have an AWK interpreter readily available. If you're running Linux or other Unix you might experiment by replacing gawk with awk or nawk.

We can execute the try1.awk script by using the following command-line:

The output of the above command should look like the following:

How AWK Processes Files

AWK scripts are specified by writing pattern/action combinations. In the script try1.awk we have specified two patterns: the BEGIN pattern and the empty pattern.

When AWK initially executes a script against a file or series of files, it first raises a BEGIN event. If the script has specified BEGIN with a body of script code bounded by curly-braces, the code within those curly-braces will execute before anything else occurs.

An END pattern can be specified to handle the END event. The END event condition is raised when all files have been processed.

The empty pattern simply appears to be a block of code bounded by curly-braces. Each line from every specified input text file will then be processed by the code in that set of curly-braces. When any event other than BEGIN or END is processed, the current input line and state of the current input file are broken into a series of special variables. Some special AWK variables can be set to alter the way that AWK processes input. A sample of the more common variables appears below:

FS  The field-separator character or regular-expression that AWK is to use to parse fields out of the current input line. This field can be altered in the action paired with BEGIN pattern.
RS  The record-separator character or regular-expression that AWK is to use to parse records out of the input data. This field can be altered in the action paired with BEGIN pattern.
FILENAME  The current filename that AWK is processing. Because multiple files can be specified after the name of the script in an AWK command-line, the name of the file can be determined by this field in order to handle special, per-file processing.
$0  The entire unparsed input line of text.
$1,$2,$3,$n ...etc.  $1 equates to the first field parse from the data based on the delimiter held in FS. $2 is the second field, $3 is the third, ...etc. The dollar-sign may be thought of as an operator as an expression can appear to its right that will retrieve a field by number. $(i+1) where the variable "i" has a value of four is equivalent to using $5 in the same code.
NF  The number of fields that had been parsed out for the current line.
NR  The number of records read in by the AWK processor. Note that this count does not reset if additional files are processed by specifying them after sample.txt on our command-line.
  

How Our try1.awk Script Works

Before processing any files, the action paired with the BEGIN pattern is invoked setting the field-separator variable FS to the pipe-symbol.

Then, any line of input causes three lines of output. One line will appear with a label and a value for each type of field we're trying to parse ... artist, album name, and genre.

The try1.awk script uses the AWK printf() function borrowed from the C programming-language for output. Windows users should note that they might want to end the format-string with \r\n instead of just \n to ensure that a carriage-return/linefeed combination is generated.

Many AWK examples simply use the built-in print command for output. Usage of printf() is a personal preference. You're welcome to change the code to use print.

Please note that after the first line of data in the sample text file, I have included a blank line.

If you examine the output of the script, you'll note that the blank line was processed as though the fields were blank:

The variables $1,$2, and $3 are expected to be present in the try1.awk script. If there's no data, the script does not abnormally terminate; it generates a set of output that should probably be cleaned up. Let's change the AWK script to ensure that we handle only the appropriate lines.

try2.awk

After invoking the above AWK script, our output should look like the following:

Some new concepts are introduced in try2.awk. Instead of using an empy pattern, we have specified to expression patterns: NF==3 and NF!=3.

In addition to the the empty pattern and the special patterns BEGIN and END, one can specify any AWK expression that will be evaluated using the current input record. In the script above, we display the artist, album, and genre only if there are exactly three fields ( where NF is equal to 3 ). Otherwise, we issue an error message if NF is not exactly three. Note that we now have an error message in the output. Ideally, we should write the error message to the stderr device, but this is done differently on different operating-systems, so I'd rather leave that as an exercise for the reader at this time.

Skipping Blank Lines

We're going to make one more change to the series of scripts so that blank lines will simply be ignored.

try3.awk

After invoking the above AWK script, our output should look like the following:

In try3.awk we introduce a new expression pattern NF==0. If there are no fields on the line, we invoke an action that uses the single AWK verb next. The next verb halts the processing of the current line and causes AWK to read the next one and start processing at the top of the script again. Note that we could change the NF!=3 pattern in this script to be the empty pattern. If the first two patterns aren't matched, the number of fields will definitely be something other than three. However, NF!=3 is more explicit, so we'll leave it.

Turning the Data into XML

try4.awk

To save the contents of the above script's output to a file, we can redirect the output from the command-prompt with the following command-line:

After invoking the above AWK script, the file tmp.xml should look like the following:

Notice that in the new try4.awk script we expand on the number of lines in the action paired with the BEGIN pattern and we add an END pattern/action pair so that we can create the header and footer tags and such for our XML file.

Changes were then made to the printf() calls so that each field would be bounded by XML tags.

AWK does provide a facility for writing to files without redirection, but the approach actually is similar to redirection from the command-shell. We could have altered the first call to printf() to create a new text file called tmp.xml

We then would have changed all of the subsequent calls to printf() to use a redirection symbol with two greater-than symbols. This would cause the output to append to the specified file:

For more handy info on using AWK as a utility language, please see Greg Grothaus' post Why you should know just a little Awk

Unless otherwise noted, all code and text entries are Copyright 2011 by James K. Lawless. The AWK code in this post is covered by the MIT/X11 open source license.



Views expressed in this blog are those of the author and do not necessary reflect those of the author's employer. Views expressed in the comments are those of the responding individual.

stumbleupon Save to StumbleUpon
digg Digg it
reddit Save to Reddit
facebook Share on Facebook
twitter Share on Twitter
aolfav More bookmarks


Previous post: Mad Schemes : Learning Lisp via SICP
Next post:Send GMail From the Windows Command-Line with MailWrench


About Jim ...


Follow me on Twitter

http://twitter.com/lawlessGuy


My GitHub Repository

https://github.com/jimlawless


Recent Posts

Compiling C from the Command Line with Pelles C

A Forthcoming Marvel Movie Villain

Uninstalling Problematic Windows Software

Don't be Hatin'

A JavaScript REPL for Android Devices

MailSend is Free

My Blog Engine

The October 10th Bug


Random Posts

Screen Capture from Multiple Monitors in Java

Removing IE Popups in C

Blog Posts by Category

A Simple Associative Array Library in C

A Forthcoming Marvel Movie Villain

An Embedded Mini-Interpreter

Preventing Windows Screen-Saver Activation

A DSL in JavaScript

A Quine in C

The Protection Racket


Full List of Posts

http://www.mailsend-online.com/bloglist.htm


Recent Posts from my Other Blog

Extending Commodore 64 BASIC

Hide the HTTP-Referer using HTML and JS

RSS Feed Processing in Python

A Chromecast Slideshow using Python

A Simple Perl REPL

Linux Mint on a Toshiba Netbook

Find and View All Images on a HD with Perl

Pretty-Printing an s-expression in Go

My Personal Text to HTML Utility

1985 Computing : Atari and Commodore

My Mac has Mono

Yet Another Config File Reader for Go (Golang)

Filling a Slice Using Command-line Flags in Go (Golang)

An RPN Interpreter in Go (Golang)

Simulating Try-Catch in Go (Golang)

Sending GMail with Go (Golang)

Variant Types in Golang

My First C64 Interrupt Program

The Triangles Puzzle

Happy 25th, Perl !

My Favorite BASIC One-Liner

Playing with OS/X Text to Speech

The Villain at the end of Marvel's Avengers Move is...

Chicken! Fight like a Robot!

Processing GMail items with Ruby

The Squares Puzzle

Happy 30th Birthday, Commodore 64

Scripting Safari

MailWrench CSharp Command Line Mailer for Windows is now Free Software

Welcome Back, M.U.L.E. !

Rainy Day Fun with the HTML DOM

Building a World War II Foxhole Radio

Prototerp Unleashing a JavaScript REPL via the Mobile Browser

Steal This Bookmarklet

Happy Birthday, Miles