Jim Lawless' Blog


Along Came AWK

Originally published on: Fri, 08 Jan 2010 03:12:03 +0000

Early in my career as a software developer in the late 1980's, I wrote code primarily in C for MS-DOS machines. I had been interested in Unix but had no access to a machine with that operating system. Nonetheless, I read a lot of Unix books and articles. I had stumbled across some information on a programming language called AWK that piqued my interest.

I soon found a tiny AWK implementation called BAWK ( Bob Brodt's AWK ). It lacked associative-arrays and some other AWK niceties, but it introduced me to the simplicity that AWK affords for writing filter programs.

I later found an AWK interpreter that I simply remember as AWK210 because of the archive name. This version of AWK ported to MS-DOS had support for these new-to-me constructs called associative-arrays.

I bought the book The AWK Programming Language and was hooked.

The concept of associate-arrays led me to approach things differently in my AWK code. Different than I might have written something in C, that is. I envisioned associate-arrays as small in-memory, single-key databases. Of course, at the time, I didn't use real databases in my code. I was still using b+tree-indexed data files, but I think you understand where I'm going.

I was able to write scripts that maintained in-memory associations that proved to shorten a lot of my code. I was familiar with the concept of a symbol-table as that concept was introduced early in the book Compilers: Principles, Techniques, and Tools. The book contained a sample C function that used a linear-feedback hash to map string keys to string values. AWK just made all of that seem very simple. I didn't really know what algorithm it was using to map the keys to values and I didn't care.

I wrote numerous short utilities in AWK at work. At home, I'd written a couple of utilities that leveraged the MS-DOS DEBUG.COM utility. The first was a simple assembler that provided the ability to use labels in scripts that would be piped through DEBUG to generate an executable .COM file. What better use for an associative-array than as a symbol-table for an assembler?

The second utility would generate an input file for DEBUG that would read in raw sectors from floppies using the L(oad) command and would write chunks of them out as binary data files. At the end of the process, all of these files would be collected into a single archive. I wrote a counterpart script to extract these binaries, read them in to DEBUG and write them out as sectors to the specified floppy. These two sets of scripts allowed me to make sector-for-sector floppy-disk images without much hassle.

The desktop calculator example in The AWK Programming Language was so simple, that it helped me to understand the mechanics of recursive-descent parsing. I was reading Jack Crenshaw's Let's Build a Compiler series in the pages of Computer Language magazine and was having trouble grasping this concept when trying to read it in the presented C code.

I later wrote a similar parser in C for a desktop calculator using those techniques. Later yet, I applied those concepts to the development of a report-generation language that unfortunately didn't end up seeing the light of day.

I often used ( and still use ) AWK to create test data. In the old MS-DOS days, I would generate data in flat-files that my programs could read. More recently, I taught a couple of people at work how to extract data that they'd built in Excel and ( by way of an AWK script ) generate SQL code to load the data into a database table. This allowed others to tweak the data in a static form in the spreadsheet. We could then load the data by means of the AWK script.

I have also used AWK to generate code in C and Java based on input data.

I have used AWK to analyze source-code as well. I had written a series of scripts that ultimately parsed some C code written by a team that I was leading. The scripts looked for conformity to the coding standards that had been set in place. I would run the reports each night after everyone had commited their updates. In those days, that meant that they copied their most stable source files into a common directory tree. The scripts would produce a report indicating source files that appeared to be violating the standards citing the suspicious code. That script took a while to run, but I would kick it off in the evening and would forget about it until morning.

In the mid-90's, I happened upon the Thompson Automation AWK compiler (TAWK) for MS-DOS. I reviewed it for an electronic magazine at the time. As part of my agreement with the publisher, I was able to keep the copy of TAWK for DOS, but I was not able to use it for commercial purposes.

The TAWK for DOS compiler was just great. It created stand-alone EXE's. It had its own virtual-memory system that would intelligently use Extended Memory, Expanded Memory, and disk-files as necessary. This meant that I could write DOS programs that handled huge amounts of data in memory and would not be bound by the 640K ( really 1Meg ) memory barrier.

A little later, I had been in touch with Pat Thompson about trying out the version of TAWK I'd heard about for MS Windows 95. I received a review copy of the compiler. At the time, I had begun to write an AWK book for a publisher and wondered if I could write a book specifically on TAWK.

The publisher was a little stand-offish about writing for such a specific product and Thompson Automation was a little concerned about piracy. Their manual was their only copy-protection. I understood completely. The compiler was absolutely wonderful.

TAWK supported interfaces to Windows API calls and callbacks. It featured functions that allowed me to allocate and manipulate more C-ish binary data-structures. It also featured easy access to TCP/IP routines.

Instead of writing a book about TAWK, I wrote a review of the compiler for Dr. Dobbs Journal of Software Tools.

You can read the full copy of the review, Examining the TAWK Compiler - Dr. Dobbs Journal, May 1997, here:

http://www.ddj.com/dept/architect/184410193

I liked the compiler so much, that I wrote two of my early commercial Internet e-mail programs with it. I wrote MailSend first ( as mentioned here: http://www.mailsend-online.com/blog?p=53 ). Some of my customers were asking for a counterpart utility to read their POP3 mail, so I wrote MailGrab in TAWK as well.

Neither of those programs use AWK's input capabilities ... all processing is handled in the BEGIN pattern.

Many may find it odd that I wrote the utilities in an AWK dialect, but I was quite productive using TAWK. I didn't have to worry about pointers ( for the most-part ... I did have to pack and unpack some data-structures to provide some of the features. ) I didn't have to worry about a lot of stringently-defined data-structures; I was able to pass associative-arrays around where I needed to keep data. Some of the run-time support routines issued error messages with the original source file line-numbers in them when I'd introduced a bug in trying to read a file. I was able to debug my software quickly.

The compiler and and source code fit neatly onto a single 1.44Meg floppy disk. In these days of multi-gigabyte USB flash-ROM drives, that may not seem like much, but I was rather happy to be able to carry the entire development environment for both products and their source around with me so that I could alter and compile them on any 32-bit Windows computer.

I have been selling those products now for over 10 years. Their stability and ability to run on continually evolving versions of Microsoft Windows is amazing.

Thompson Automation ceased doing business several years ago, so the TAWK compiler has not been updated nor is it sold. I am saddened each time I think about the company because TAWK is one of the most solid compiler I have ever used.

These days, I use the GNU AWK interpreter frequently. I use it at work to interrogate data and source files. On several occasions, we've been updating some code or needed to upgrade to a new JEE container or something and needed to ensure that we had removed all code that might cause issues. A few short AWK scripts always make that whole process quite painless.

When some friends were having trouble finding work, I wrote an AWK script that processes several files downloaded from a job-posting web-site into a single HTML file with the newest links at the top with all duplicate job-postings removed.

In the coming months, I hope to provide examples of each type of AWK script I've mentioned here as I think these kinds of tools can be very handy.

Posts I've written on this blog that contain AWK scripts are:

http://www.mailsend-online.com/blog?p=65

I use the above script to generate an HTML page from a text file so that my favorite links from the HackerNews site are updated and available to the public.

http://www.mailsend-online.com/blog?p=50

The above is a source-code obfuscation utility for C# and Java using AWK. It requires a little coding discipline as a specific naming-convention has to be observed. It's a prototype at best, but it's the second such AWK program I've written to obfuscate the source for these two languages.

http://www.mailsend-online.com/blog?p=10

When I began to write for this blog on wordpress.com, I wanted a way to turn the RSS syndication feed into an HTML file so that I could preserve a simple set of links to all of my posts. I use a variant of the script in the post above to translate the RSS XML into this HTML file: http://www.mailsend-online.com/bloglist.htm

I have another script I use for this site that generates the social media bookmark tags you'll find at the end of each post. Right now, I'd prefer to keep that script to myself.

I love AWK not only because I can write useful things quickly, but because using it has helped me simplify the way I approach various kinds of problems.

You might check back here every once in a while. I may finally write that AWK book I've been thinking about. ;-)

Unless otherwise noted, all code and text entries are Copyright 2010 by James K. Lawless


Views expressed in this blog are those of the author and do not necessary reflect those of the author's employer.


Previous post: A Simple Parser for a Small Command Line Interface
Next post:An SMTP Server Simulator in Perl


About Jim ...

Tech Pranks from the 16-Bit MS-DOS Era