A Little Distance From GitHub

In years past, I’ve been pretty cavalier about just letting old programs disappear, but I’m trying to do a better job these days.  After all, as a programmer, one of the most valuable things I produce is the source of my programs.

Like the rest of the world, I started putting code on GitHub, and it’s a really nice service. I can see why it’s so popular. However, it has always nagged me that I was putting my work on a server I don’t control. Occasionally, I see stories about trouble at the company or controversial developments as well.

This site is hosted by DreamHost, which offers SVN hosting as part of the deal, so I’ve started using that. I know some people will think of SVN as a step down, but I’m comfortable with both git and SVN, and either will suffice for my personal work.

This move is part of a broader trend for me, away from “free” hosts that want to own my content. Blogs are on the decline since Facebook and Twitter took over. It may be easier to find an audience in a crowded social network, but I’m not sure that compensates me for giving away my content.  So, I’ve started paying for hosting again, and I plan to use that server as the hub for all the on-line presence I have.

Re-Implementing BasCat In Scala

One look at my github repos will make it clear: I love to play with programming languages.  I routinely implement programs multiple times in different languages so I can compare them.

Today I re-implemented bascat, one of my Go programs, in Scala. In general, I got the trade-offs I expected:

  • The Scala program is a lot shorter, and the code is more declarative. I’d much rather maintain this version.
  • The Go program uses far fewer runtime resources, and is quicker to start up.
  • I had a lot fewer choices to make when writing the Go version, because Go gives you a relatively small number of constructs with which to work.  Progress on the Go version, as a result, was steady. In contrast, I spent most of the time on the Scala version thinking over all the implementation options available. As a result, I subjectively enjoyed writing the Go version more.

 

A Better Sed API

I retooled my sed package last night to improve the interface for embedding in a Go program.

Originally, the interface matched the needs of the driver program directly. The driver identifies the input stream and the output stream (stdout), and just needs to connect them. As a result, a call to the library looked like this:

engine, parseError := sed.New(pgm)
engine.Run(os.Stdin, os.Stdout)

That worked great for the go-sed driver. However, what if you don’t want to process the entire input in one shot? After a few minutes, I realized how much better the interface would be if the sed engine simply wrapped an io.Reader. Plenty of other libraries work this way; you wrap Readers for decompression and for decryption. Why not for sed processing?

Implementation

Unfortunately, it wasn’t an easy change. The original library expects to have an io.Writer in hand, and just write to it at will. Any errors on the write mean an error for the overall process.

In contrast, supporting the io.Reader.Read method means that the library fills a fixed-size []byte, and running out of room is not an error. The vm needs to pick up where it left off on the next
call to Read.

I solved this issue by:

  • Adding an overflow string to the vm state.
  • Creating a writeString function that all vm output goes through.
  • Using writeString to copy any excess to overflow.
  • Creating a new bufferFull instance of error to signal that the last vm instruction wrote to overflow
  • Having Read() check and output any overflow data prior to restarting the vm at the instruction where it left off.

It took a couple hours from start to finish. I also added a test to the engine_test.go file that exercises Read() with a pathological 2-byte buffer, to make sure that all the bytes are transferred correctly across vm restarts.

Outcome

Now, say you are processing an io.Reader and one of the first steps you need to accomplish is removing blank lines and comment lines. You can use an embedded sed engine quickly and easily to get an equivalent io.Reader where that part is already done.

func process(infile io.Reader) {
   pgm := strings.NewReader(`s/#.*$// ; /^$/d`)
   engine, parseError :=  sed.New(pgm)
   nocomm := engine.Wrap(infile)

   ... now use nocomm instead of infile...
}

A utility like this will never be as fast as custom filter code, but the convenience makes it the right choice in a number of scenarios.

Speed

The new version is only marginally slower than the original on the tests I ran. Processing a huge hexdump took about 30 seconds with both programs, and the difference in processing time was less than 1 second. That’s acceptable to me for the benefit I’m getting from the better interface.