Originally published on: Tue, 17 Nov 2009 01:17:18 +0000
Please note! If you're having difficulties compiling the C source code presented below, please see my post: Compiling C from the Command Line with Pelles C
A question arose in a forum that I recently read asking how one might most efficiently ( in terms of both processing time and memory use ) extract URL's from text.
I wrote the following program as an example of the approach I'd probably take.
extract_url.c
When building hand-coded lexers, I generally create a map of states for each token and keep incrementing the state to the next legal token-character state until the input character is no longer valid. At that point, my lexer will do something with the characters that had been collected to that point.
In the case of an HTTP(S) URL processor, I defined seven possible states:
After the input state has reached six, the code looks in the string legal_chars to determine if it should keep tracking valid URL characters. ( I may have missed a few. You might need to add or remove some from this string in order for the code to work properly.)
Once an invalid character or EOF is reached, the code outputs the URL starting from the variable mark up to ( but not including ) the current input character. The state is then reset to zero and the current character is returned to the input loop for first-time processing.
To execute extract_url on a text file, you should issue a command-line similar to the following:
The source and EXE files for extract_url can be found here: http://www.mailsend-online.com/wp/extract_url.zip
Unless otherwise noted, all code and text entries are Copyright ©2009 by James K. Lawless
Views expressed in this blog are those of the author and do not necessary reflect those of the author's employer. Views expressed in the comments are those of the responding individual.

Save to StumbleUpon
Digg it
Save to Reddit
Share on Facebook
Share on Twitter
More bookmarks
Click **here**
A JavaScript REPL for Android Devices
A Review of Kevin Mitnick's Book Ghost in the Wires
Play MP3 Files with Python on Windows
Compiling Rhino JavaScript to Java
PHP, Transparent GIF's, and Web Tracking
Switching a Console Window to Full Screen or Windowed Mode
A Command-Line MP3 Player for Windows
Why Some Web Sites will go Dark on Jan 18th
Book Review : Paull Allen - Idea Man
A 90's Experiment in Online Systems - The U.S. West CommunityLink Service