New Learnings

Thursday, 26 February 2009

During the course of interviews I have gone through in recent weeks, a significant portion of my time was spent refreshing my knowledge of basic theoretical computer science type things. The reason was fairly simple in that I wanted to be able to more or less instantly be able to answer any of the basic knowledge questions. I could have independently come up with various graph traversal algorithms given a bit of time as I did have them stored deep in my grey matter. That said, it seemed better to freshen up the on the details so those particular nodes in my brain had a bit more weight to them.

So there was lots of refreshing of information, which was notably taking up my brain time which meant to less time learning wholly new things. This is okay, but one of the best things about software development is the constant learning. So when a company mentioned that they were using Hadoop and HBase, this was instantly exciting to me as it gave me a perfectly good reason to go and research some new technologies.

If you haven't heard of these two projects, and unless you are specifically working with them, you probably haven't, Hadoop and HBase are free software implementations of two systems designed by Google, MapReduce and BigTable respectively. MapReduce is a framework developed by Google to facilitate processing and working with large datasets across a distributed network. BigTable is essentially a way to store structured data across a distributed network, though it is important to note that this structured in terms of nested hashes, not in a traditional relational manner.

Google had a few motivations for building something like this. They regularly worked with gigantic datasets, their search index itself, search logs and Google Maps tilesets as a few examples. Analyzing these datasets took massive CPU resources and a distributed approach was more or less deemed the only practical way to actually compute solutions. MapReduce takes the classic divide and conquer approach to solving a problem. The problem space is split up and doled out to dozens, hundreds or thousands of computers. This is the map phase. Results from each piece of the calculation is then returned to the master computer which then reduces the results down into some sort of final output.

This approach has been used for solving all sorts of distributed problems, but MapReduce was unique in that it was a framework, usable for a variety of tasks. The hard parts of managing a distributed network for a single calculation are often just that, the management tasks. Deciding which servers get which chunks of data depending on where they are in the network. Deciding if a server has crashed, or if it's just slow. How and what, if any, data do you duplicate across multiple nodes to guarantee that you get an answer. These are all concerns that must be addressed when creating a distributed application.

What Google did with MapReduce was abstract away the management tasks so their developers could focus on actually writing the algorithms to solve the problems. Hadoop is a Java version of MapReduce. But I digress.

What was really nice was that in the midst of this review and stress, as looking for meaningful employment is not without some concerns, was an opportunity to read a couple of new papers and learn some new things. So I promptly went and wrote a distributed program to calculate how many lines each character in the collected works of Shakespeare spoke. This was not something that required much computer horsepower, but it was pretty cool to run the thing on two computers and get chunks distributed to each of them.

So that went over well. I am going to post this now as I have been sitting on this for a while and not finishing it up. I need to lower my standards somewhat as, to paraphrase Joel and Jeff, I only have one reader anyway.