Originally published on: Mon, 10 Jan 2011.
From time to time, friends of mine will need to convert some data they've retrieved from a SQL query into XML. I usually use the need as an opportunity to teach them a little about the AWK programming language.
For a little about my background with AWK, see the post Along Came AWK.
Generally, the data in question has been captured in some sort of a file where each field is delimited by a single character that cannot appear in the field data. Often, it's the TAB character ( ASCII 9 ). I like to use the pipe-symbol (|) as a delimiter. Sometimes, TAB's are converted to spaces if multiple people have handled the file.
Suppose that we have a set of data in a file called sample.txt that contains a musical artist's name, an album/CD name, and the genre for that type of music: sample.txt
Parsing this data with AWK is a relatively simple task. Most AWK afficionados would probably take care of it with a one-liner. For sake of discussion, let's actually write out a formal script to handle the task.
In order to execute the above script, I'm going to use GAWK ( GNU AWK ) from a Windows command-prompt. Most Linux OS's already have an AWK interpreter readily available. If you're running Linux or other Unix you might experiment by replacing gawk with awk or nawk.
We can execute the try1.awk script by using the following command-line:
The output of the above command should look like the following:
AWK scripts are specified by writing pattern/action combinations. In the script try1.awk we have specified two patterns: the BEGIN pattern and the empty pattern.
When AWK initially executes a script against a file or series of files, it first raises a BEGIN event. If the script has specified BEGIN with a body of script code bounded by curly-braces, the code within those curly-braces will execute before anything else occurs.
An END pattern can be specified to handle the END event. The END event condition is raised when all files have been processed.
The empty pattern simply appears to be a block of code bounded by curly-braces. Each line from every specified input text file will then be processed by the code in that set of curly-braces. When any event other than BEGIN or END is processed, the current input line and state of the current input file are broken into a series of special variables. Some special AWK variables can be set to alter the way that AWK processes input. A sample of the more common variables appears below:
|FS||The field-separator character or regular-expression that AWK is to use to parse fields out of the current input line. This field can be altered in the action paired with BEGIN pattern.|
|RS||The record-separator character or regular-expression that AWK is to use to parse records out of the input data. This field can be altered in the action paired with BEGIN pattern.|
|FILENAME||The current filename that AWK is processing. Because multiple files can be specified after the name of the script in an AWK command-line, the name of the file can be determined by this field in order to handle special, per-file processing.|
|$0||The entire unparsed input line of text.|
|$1,$2,$3,$n ...etc.||$1 equates to the first field parse from the data based on the delimiter held in FS. $2 is the second field, $3 is the third, ...etc. The dollar-sign may be thought of as an operator as an expression can appear to its right that will retrieve a field by number. $(i+1) where the variable "i" has a value of four is equivalent to using $5 in the same code.|
|NF||The number of fields that had been parsed out for the current line.|
|NR||The number of records read in by the AWK processor. Note that this count does not reset if additional files are processed by specifying them after sample.txt on our command-line.|
Before processing any files, the action paired with the BEGIN pattern is invoked setting the field-separator variable FS to the pipe-symbol.
Then, any line of input causes three lines of output. One line will appear with a label and a value for each type of field we're trying to parse ... artist, album name, and genre.
The try1.awk script uses the AWK printf() function borrowed from the C programming-language for output. Windows users should note that they might want to end the format-string with \r\n instead of just \n to ensure that a carriage-return/linefeed combination is generated.
Many AWK examples simply use the built-in print command for output. Usage of printf() is a personal preference. You're welcome to change the code to use print.
Please note that after the first line of data in the sample text file, I have included a blank line.
If you examine the output of the script, you'll note that the blank line was processed as though the fields were blank:
The variables $1,$2, and $3 are expected to be present in the try1.awk script. If there's no data, the script does not abnormally terminate; it generates a set of output that should probably be cleaned up. Let's change the AWK script to ensure that we handle only the appropriate lines.
After invoking the above AWK script, our output should look like the following:
Some new concepts are introduced in try2.awk. Instead of using an empy pattern, we have specified to expression patterns: NF==3 and NF!=3.
In addition to the the empty pattern and the special patterns BEGIN and END, one can specify any AWK expression that will be evaluated using the current input record. In the script above, we display the artist, album, and genre only if there are exactly three fields ( where NF is equal to 3 ). Otherwise, we issue an error message if NF is not exactly three. Note that we now have an error message in the output. Ideally, we should write the error message to the stderr device, but this is done differently on different operating-systems, so I'd rather leave that as an exercise for the reader at this time.
We're going to make one more change to the series of scripts so that blank lines will simply be ignored.
After invoking the above AWK script, our output should look like the following:
In try3.awk we introduce a new expression pattern NF==0. If there are no fields on the line, we invoke an action that uses the single AWK verb next. The next verb halts the processing of the current line and causes AWK to read the next one and start processing at the top of the script again. Note that we could change the NF!=3 pattern in this script to be the empty pattern. If the first two patterns aren't matched, the number of fields will definitely be something other than three. However, NF!=3 is more explicit, so we'll leave it.
To save the contents of the above script's output to a file, we can redirect the output from the command-prompt with the following command-line:
After invoking the above AWK script, the file tmp.xml should look like the following:
Notice that in the new try4.awk script we expand on the number of lines in the action paired with the BEGIN pattern and we add an END pattern/action pair so that we can create the header and footer tags and such for our XML file.
Changes were then made to the printf() calls so that each field would be bounded by XML tags.
AWK does provide a facility for writing to files without redirection, but the approach actually is similar to redirection from the command-shell. We could have altered the first call to printf() to create a new text file called tmp.xml
We then would have changed all of the subsequent calls to printf() to use a redirection symbol with two greater-than symbols. This would cause the output to append to the specified file:
For more handy info on using AWK as a utility language, please see Greg Grothaus' post Why you should know just a little Awk
Unless otherwise noted, all code and text entries are Copyright ©2011 by James K. Lawless. The AWK code in this post is covered by the MIT/X11 open source license.
Views expressed in this blog are those of the author and do not necessary reflect those of the author's employer.
|Previous post:|| Mad Schemes : Learning Lisp via SICP|
|Next post:||Send GMail From the Windows Command-Line with MailWrench|
About Jim ...
My newest tech blog