Jim Lawless' Blog


Extracting URL Addresses from Text in C

Originally published on: Tue, 17 Nov 2009 01:17:18 +0000

A question arose in a forum that I recently read asking how one might most efficiently ( in terms of both processing time and memory use ) extract URL's from text.

I wrote the following program as an example of the approach I'd probably take.

extract_url.c


// extract_url
// Extract http and https URL's from text.
//
// License: MIT / X11
// Copyright (c) 2009 by James K. Lawless
// jimbo@radiks.net http://www.radiks.net/~jimbo
// http://www.mailsend-online.com
//
// Permission is hereby granted, free of charge, to any person
// obtaining a copy of this software and associated documentation
// files (the "Software"), to deal in the Software without
// restriction, including without limitation the rights to use,
// copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the
// Software is furnished to do so, subject to the following
// conditions:
//
// The above copyright notice and this permission notice shall be
// included in all copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
// EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
// OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
// NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
// HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
// WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
// FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
// OTHER DEALINGS IN THE SOFTWARE.

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <ctype.h>

// states
#define S_h (1)
#define S_t1 (2)
#define S_t2 (3)
#define S_p (4)
#define S_s (5)
#define S_col (6)

   // Lower-case alpha only. tolower() will be used
   // when searching for legal characters.
char *legal_chars =
   "abcdefghijklmnopqrstuvwxyz0123456789"
   "./\\~#%&amp;()_-+=;?";

int _state;

void print_urls(char *);

int main(int argc,char **argv) {
   char buff[1024];
   while(fgets(buff,1023,stdin)!=NULL) {
      print_urls(buff);
   }
}

void print_urls(char *s) {
   char *p,*mark;
   _state=0;
   for(p=s;*p;p++) {
      switch(_state) {
         case 0:
            if(*p=='h') {
               _state=S_h;
               mark=p;
            }
            break;
         case S_h:
            if(*p=='t')
               _state=S_t1;
            else
               _state=0;
            break;
         case S_t1:
            if(*p=='t')
               _state=S_t2;
            else
               _state=0;
            break;
         case S_t2:
            if(*p=='p')
               _state=S_p;
            else
               _state=0;
            break;
         case S_p:
            if(*p==':')
               _state=S_col;
            else
            if(*p=='s')
               _state=S_s;
            else
               _state=0;
            break;
         case S_s:
            if(*p==':')
               _state=S_col;
            else
               _state=0;
            break;
         case S_col:
            if(strchr(legal_chars,tolower(*p))==NULL) {
               while(mark<p) {
                  fputc(*mark,stdout);
                  mark++;
               }
               fputc('\n',stdout);
               _state=0;
               p--; // backtrack
            }
      }
   }
   if(_state) {
      while(mark<p) {
         fputc(*mark,stdout);
         mark++;
      }
   }
}

When building hand-coded lexers, I generally create a map of states for each token and keep incrementing the state to the next legal token-character state until the input character is no longer valid. At that point, my lexer will do something with the characters that had been collected to that point.

In the case of an HTTP(S) URL processor, I defined seven possible states:

    Zero - Starting state. Treat the input character as the first character and set the variable mark to refer to this position in the input line. One - We have an 'h' from http or https. Two - We have the first 't' from http or https. Three - We have the second 't' from http or https. Four - We have the 'p' from http or https. Five - We have the 's' from from https. Six - We have the colon-character from http or https.

After the input state has reached six, the code looks in the string legal_chars to determine if it should keep tracking valid URL characters. ( I may have missed a few. You might need to add or remove some from this string in order for the code to work properly.)

Once an invalid character or EOF is reached, the code outputs the URL starting from the variable mark up to ( but not including ) the current input character. The state is then reset to zero and the current character is returned to the input loop for first-time processing.

To execute extract_url on a text file, you should issue a command-line similar to the following:


extract_url < tmp.txt

The source and EXE files for extract_url can be found here: http://www.mailsend-online.com/wp/extract_url.zip

Unless otherwise noted, all code and text entries are Copyright ©2009 by James K. Lawless

del_icio_us Save to del.icio.us
stumbleupon Save to StumbleUpon
digg Digg it
reddit Save to Reddit
facebook Share on Facebook
twitter Share on Twitter
aolfav More bookmarks



Previous post: An Embedded Mini-Interpreter
Next post:A Simple Associative Array Library in C


Search this Blog (and site)

Search this Site with PicoSearch


Subscribe to this Blog

 Subscribe!


Contact Me

Email: jimbo@radiks.net


Follow me on Twitter

http://twitter.com/lawlessGuy


Recent Posts

Mad Schemes : Learning Lisp via SICP

Auto Save Clipboard Images Redux

Extending SpiderMonkey JavaScript on Windows

Rhino JavaScript to EXE with launch4j

Compiling Rhino JavaScript to Java

Directory Traversal in Rhino JavaScript

Taking Shape

We've Moved!


Popular Posts

A Command-Line MP3 Player for Windows

Auto Save Images from the Clipboard

Java in a Windows EXE with launch4j

An Interview with Tom Zimmer: Forth System Developer

Setting Windows Console Text Colors in C


Random Posts

Hiding Batch File Console Windows

Rhino JavaScript to EXE with launch4j

Setting Text Color in a Batch File

Obfuscated Ruby

A Lightweight Alternative to Windows Shortcuts

Auto Save Clipboard Images Redux

A Simple ROT13 Macro

Open Source Licenses

Thwarting HTTP Referer Trackbacks

Taking Shape


Full List of Posts

http://www.mailsend-online.com/bloglist.htm


Blogroll

MicroISV on a Shoestring
DadHacker
The Bottom Feeder
Writin' That Code!
The Recursive ISV
The Thomsen Blog
Prototypically Speaking
The Reinvigorated Programmer