switchScanner 0.6


Overview

switchScanner is a simple Python script that generates simple lexical scanners in C using switch statements. The resulting scanners are generally portable (Win32, x86 Linux, OS X) and are incredibly fast.

The switchScanner home page is here:

http://www.midwinter.com/~larry/programming/switchScanner/
And you can download a fresh copy of the source code here:
http://www.midwinter.com/~larry/programming/switchScanner/switchScanner.0.6.tar.gz
http://www.midwinter.com/~larry/programming/switchScanner/switchScanner.0.6.zip

The Gory Details

You use it from Python like so:

	import switchScanner
	s = switchScanner("myScannerName")
	s.addKeyword("keywordGoesHere")
	s.addKeyword("secondKeyword")
	...
	switchScanner.write()
This will create scanner.c and scanner.h in the current directory. There will be one entry-point in scanner.h:
	extern token myScannerName(char **s);
The "token" class is an enum; its values are auto-generated from the keywords. To use the scanner, #include "scanner.h" then call:
	token t;
	while ((t = myScannerName(&s)) > TOKEN_NOERROR)
		{
		// recognize T as necessary
		}
The scanner will return TOKEN_EOF if it reaches the end of the input string without incident, and TOKEN_ERROR if it encounters an unknown token.

Okay, so what's the point of all this? Simple: the scanners generated by this script are lightning-fast. They make exactly one pass through each character in the scanned string. And they are explicitly not data-driven; they use switch statements, recognizing each letter in sequence in the keywords. For instance, a scanner that looked for the strings cat, car, cur, and bat might look like this:

	switch (*s++)
		{
		case 'b':
			if (!strcmp(s, "ar")
				return TOKEN_BAR;
			break;
		case 'c':
			switch (*s++)
				{
				case 'a':
					switch (*s++)
						{
						case 'r':
							return TOKEN_CAR;
						case 't':
							return TOKEN_CAT;
						}
					break;
				case 'u':
					if (*s == 'r')
						return TOKEN_CUR;
					break;
				}
			break;
		}
This is greatly simplified over the actual code, which handles case sensitivity and ensures that the keywords terminate. (The above example, for instance, would return TOKEN_CUR for the word curtsey.) The real thing also has some additional silly little optimizations.

Trying out switchScanner

I've included a simple (hacked-up!) sample program for switchScanner. Under UNIX, simply run "make"; Win32 developers should run "nmake /f win32.mak". This will produce sstest (on Windows, sstest.exe), with the scanner in scanner.c.

According to the (super-simple!) benchmark in sstest, my scanner can recognize over 15 million symbols per second on my 933MHz Pentium 3 Linux server.

Notes And Warnings

Licensing

Here's the license:

[BEGIN NOTICE]

Copyright 2005-2006 Larry Hastings

This software is provided 'as-is', without any express or implied warranty.
In no event will the authors be held liable for any damages arising from
the use of this software.

Permission is granted to anyone to use this software for any purpose,
including commercial applications, and to alter it and redistribute
it freely, subject to the following restrictions:

1. The origin of this software must not be misrepresented; you must not
   claim that you wrote the original software. If you use this software
   in a product, an acknowledgment in the product documentation would be
   appreciated but is not required.
2. Altered source versions must be plainly marked as such, and must not be
   misrepresented as being the original software.
3. This notice may not be removed or altered from any source distribution.
The switchScanner homepage is here:
	http://www.midwinter.com/~larry/programming/switchScanner/

[END NOTICE]
In non-legalese, my goal was to allow you to do anything you like with the software, except claim that you wrote the original. If my license prevents you from doing something you'd like to do, contact me (my email address is in the source) and we can discuss it.

Furthermore, I'd like to point out that my license makes no claim on the output of switchScanner. Scanners you create with switchScanner are entirely your property.

Version History

0.6
Thursday, July 27th, 2006
Bugfix release.
0.5
Tuesday, February 8th, 2005
Initial public release.

Happy scanning!


larry