Originally published on: Tue, 17 Nov 2009 01:17:18 +0000
A question arose in a forum that I recently read asking how one might most efficiently ( in terms of both processing time and memory use ) extract URL's from text.
I wrote the following program as an example of the approach I'd probably take.
extract_url.c
When building hand-coded lexers, I generally create a map of states for each token and keep incrementing the state to the next legal token-character state until the input character is no longer valid. At that point, my lexer will do something with the characters that had been collected to that point.
In the case of an HTTP(S) URL processor, I defined seven possible states:
After the input state has reached six, the code looks in the string legal_chars to determine if it should keep tracking valid URL characters. ( I may have missed a few. You might need to add or remove some from this string in order for the code to work properly.)
Once an invalid character or EOF is reached, the code outputs the URL starting from the variable mark up to ( but not including ) the current input character. The state is then reset to zero and the current character is returned to the input loop for first-time processing.
To execute extract_url on a text file, you should issue a command-line similar to the following:
The source and EXE files for extract_url can be found here: http://www.mailsend-online.com/wp/extract_url.zip
Unless otherwise noted, all code and text entries are Copyright ©2009 by James K. Lawless
Save to del.icio.us
Save to StumbleUpon
Digg it
Save to Reddit
Share on Facebook
Share on Twitter
More bookmarks
Subscribe!
Auto Save Clipboard Images Redux
Extending SpiderMonkey JavaScript on Windows
Rhino JavaScript to EXE with launch4j
Compiling Rhino JavaScript to Java
Directory Traversal in Rhino JavaScript
A Command-Line MP3 Player for Windows
Auto Save Images from the Clipboard
Java in a Windows EXE with launch4j
An Interview with Tom Zimmer: Forth System Developer
Setting Windows Console Text Colors in C
Hiding Batch File Console Windows
Rhino JavaScript to EXE with launch4j
Setting Text Color in a Batch File
A Lightweight Alternative to Windows Shortcuts
Auto Save Clipboard Images Redux
Thwarting HTTP Referer Trackbacks
MicroISV on a Shoestring
DadHacker
The Bottom Feeder
Writin' That Code!
The Recursive ISV
The Thomsen Blog
Prototypically Speaking
The Reinvigorated Programmer