Jim Lawless' Blog


Extracting URL Addresses from Text in C

Originally published on: Tue, 17 Nov 2009 01:17:18 +0000

Please note! If you're having difficulties compiling the C source code presented below, please see my post: Compiling C from the Command Line with Pelles C

A question arose in a forum that I recently read asking how one might most efficiently ( in terms of both processing time and memory use ) extract URL's from text.

I wrote the following program as an example of the approach I'd probably take.

extract_url.c


// extract_url
// Extract http and https URL's from text.
//
// License: MIT / X11
// Copyright (c) 2009 by James K. Lawless
// jimbo@radiks.net http://www.radiks.net/~jimbo
// http://www.mailsend-online.com
//
// Permission is hereby granted, free of charge, to any person
// obtaining a copy of this software and associated documentation
// files (the "Software"), to deal in the Software without
// restriction, including without limitation the rights to use,
// copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the
// Software is furnished to do so, subject to the following
// conditions:
//
// The above copyright notice and this permission notice shall be
// included in all copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
// EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
// OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
// NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
// HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
// WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
// FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
// OTHER DEALINGS IN THE SOFTWARE.

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <ctype.h>

// states
#define S_h (1)
#define S_t1 (2)
#define S_t2 (3)
#define S_p (4)
#define S_s (5)
#define S_col (6)

   // Lower-case alpha only. tolower() will be used
   // when searching for legal characters.
char *legal_chars =
   "abcdefghijklmnopqrstuvwxyz0123456789"
   "./\\~#%&amp;()_-+=;?";

int _state;

void print_urls(char *);

int main(int argc,char **argv) {
   char buff[1024];
   while(fgets(buff,1023,stdin)!=NULL) {
      print_urls(buff);
   }
}

void print_urls(char *s) {
   char *p,*mark;
   _state=0;
   for(p=s;*p;p++) {
      switch(_state) {
         case 0:
            if(*p=='h') {
               _state=S_h;
               mark=p;
            }
            break;
         case S_h:
            if(*p=='t')
               _state=S_t1;
            else
               _state=0;
            break;
         case S_t1:
            if(*p=='t')
               _state=S_t2;
            else
               _state=0;
            break;
         case S_t2:
            if(*p=='p')
               _state=S_p;
            else
               _state=0;
            break;
         case S_p:
            if(*p==':')
               _state=S_col;
            else
            if(*p=='s')
               _state=S_s;
            else
               _state=0;
            break;
         case S_s:
            if(*p==':')
               _state=S_col;
            else
               _state=0;
            break;
         case S_col:
            if(strchr(legal_chars,tolower(*p))==NULL) {
               while(mark<p) {
                  fputc(*mark,stdout);
                  mark++;
               }
               fputc('\n',stdout);
               _state=0;
               p--; // backtrack
            }
      }
   }
   if(_state) {
      while(mark<p) {
         fputc(*mark,stdout);
         mark++;
      }
   }
}

When building hand-coded lexers, I generally create a map of states for each token and keep incrementing the state to the next legal token-character state until the input character is no longer valid. At that point, my lexer will do something with the characters that had been collected to that point.

In the case of an HTTP(S) URL processor, I defined seven possible states:

After the input state has reached six, the code looks in the string legal_chars to determine if it should keep tracking valid URL characters. ( I may have missed a few. You might need to add or remove some from this string in order for the code to work properly.)

Once an invalid character or EOF is reached, the code outputs the URL starting from the variable mark up to ( but not including ) the current input character. The state is then reset to zero and the current character is returned to the input loop for first-time processing.

To execute extract_url on a text file, you should issue a command-line similar to the following:


extract_url < tmp.txt

The source and EXE files for extract_url can be found here: http://www.mailsend-online.com/wp/extract_url.zip

Unless otherwise noted, all code and text entries are Copyright ©2009 by James K. Lawless



Views expressed in this blog are those of the author and do not necessary reflect those of the author's employer. Views expressed in the comments are those of the responding individual.

stumbleupon Save to StumbleUpon
digg Digg it
reddit Save to Reddit
facebook Share on Facebook
twitter Share on Twitter
aolfav More bookmarks


Previous post: An Embedded Mini-Interpreter
Next post:A Simple Associative Array Library in C


About Jim ...


Click **here**
to try out MailWrench;
a command-line SMTP /
SMTPS (Google Gmail)
mailer for Windows.


Follow me on Twitter

http://twitter.com/lawlessGuy


Recent Posts

A JavaScript REPL for Android Devices

MailSend is Free

My Blog Engine

The October 10th Bug

A Review of Kevin Mitnick's Book Ghost in the Wires

Spellbound by Web Programming

Backlinks to my Blog Posts

Play MP3 Files with Python on Windows


Random Posts

Compiling Rhino JavaScript to Java

Generating Primes with XSLT

PHP, Transparent GIF's, and Web Tracking

A DSL in JavaScript

FIF Isn't Forth

Switching a Console Window to Full Screen or Windowed Mode

Blog Posts by Category

RSS feed processing with AWK

A Command-Line MP3 Player for Windows

My Big Shareware Splash


Full List of Posts

http://www.mailsend-online.com/bloglist.htm


Recent Posts from my Other Blog

Remembering Dr. San Guinary

Why Some Web Sites will go Dark on Jan 18th

SNL Superhero Skit

More Ruby Games

My Ruby Game Challenge Entry

Steal this Bookmarklet

Nerd Toys

Learn New Jargon, You Must

Spot the Wiebe

Tech Magazine Glory Days

Book Review : Paull Allen - Idea Man

A 90's Experiment in Online Systems - The U.S. West CommunityLink Service