Archive for February 2009

New Learnings

During the course of interviews I have gone through in recent weeks, a significant portion of my time was spent refreshing my knowledge of basic theoretical computer science type things. The reason was fairly simple in that I wanted to be able to more or less instantly be able to answer any of the basic knowledge questions. I could have independently come up with various graph traversal algorithms given a bit of time as I did have them stored deep in my grey matter. That said, it seemed better to freshen up the on the details so those particular nodes in my brain had a bit more weight to them.

So there was lots of refreshing of information, which was notably taking up my brain time which meant to less time learning wholly new things. This is okay, but one of the best things about software development is the constant learning. So when a company mentioned that they were using Hadoop and HBase, this was instantly exciting to me as it gave me a perfectly good reason to go and research some new technologies.

If you haven’t heard of these two projects, and unless you are specifically working with them, you probably haven’t, Hadoop and HBase are free software implementations of two systems designed by Google, MapReduce and BigTable respectively. MapReduce is a framework developed by Google to facilitate processing and working with large datasets across a distributed network. BigTable is essentially a way to store structured data across a distributed network, though it is important to note that this structured in terms of nested hashes, not in a traditional relational manner.

Google had a few motivations for building something like this. They regularly worked with gigantic datasets, their search index itself, search logs and Google Maps tilesets as a few examples. Analyzing these datasets took massive CPU resources and a distributed approach was more or less deemed the only practical way to actually compute solutions. MapReduce takes the classic divide and conquer approach to solving a problem. The problem space is split up and doled out to dozens, hundreds or thousands of computers. This is the map phase. Results from each piece of the calculation is then returned to the master computer which then reduces the results down into some sort of final output.

This approach has been used for solving all sorts of distributed problems, but MapReduce was unique in that it was a framework, usable for a variety of tasks. The hard parts of managing a distributed network for a single calculation are often just that, the management tasks. Deciding which servers get which chunks of data depending on where they are in the network. Deciding if a server has crashed, or if it’s just slow. How and what, if any, data do you duplicate across multiple nodes to guarantee that you get an answer. These are all concerns that must be addressed when creating a distributed application.

What Google did with MapReduce was abstract away the management tasks so their developers could focus on actually writing the algorithms to solve the problems. Hadoop is a Java version of MapReduce. But I digress.

What was really nice was that in the midst of this review and stress, as looking for meaningful employment is not without some concerns, was an opportunity to read a couple of new papers and learn some new things. So I promptly went and wrote a distributed program to calculate how many lines each character in the collected works of Shakespeare spoke. This was not something that required much computer horsepower, but it was pretty cool to run the thing on two computers and get chunks distributed to each of them.

So that went over well. I am going to post this now as I have been sitting on this for a while and not finishing it up. I need to lower my standards somewhat as, to paraphrase Joel and Jeff, I only have one reader anyway.

Book Review 2.0 – The Ruby Way

I took this book out from the library and was instantly enamoured with it. So much so that it went on my wish list before I had returned it. As with Rails Recipes, this book is primarily for those who have a decent grasp of the Ruby programming language, but do not yet know all of the tricks. If you have memorized the language API, this probably is not the book for you.

As my Ruby experiences are measured in months rather than years at this point in time, more information is always better than less. And I have always been a sucker for a nice heavy book to keep beside the keyboard. 

The Ruby Way.

Basically this book is a list of common and not so common tasks and code snippets to show you how to handle them in Ruby. Another way to look at it is this book is basically the logical reverse of the language API. Rather than look up a method and find out what it does, you look up a task and it tells you the method. This can be very, very handy.

Ruby is a concise language with many excellent one liners. In the core classes, however, there are often a large number of methods. String is one good example, there is something like 100 methods in the class and I have, more than once, accidentally re-implemented one of them while trying to solve a problem. This is mainly due to simple lack of experience with the language. Now that I have a copy of this book, if I get the feeling that maybe Ruby has something built in to solve problem A, I can look up problem A and find out if there is a one liner rather than writing some custom method.

Even in my relatively small projects, this book has already saved me a handful of hours. Currently I am reading through the book, random chapters at a time. The main goal I have now is to build an internal catalogue of problems that this book solves so I can remember to reference it if those problems crop up. Shortly after I use a method a couple of times, then it is part of the repotoire and I won’t need to look it up again.

So, in a nutshell, if you need a good book to learn to make your Ruby more concise and make better use of the language, go with this one.