New Learnings
During the course of interviews I have gone through in recent weeks, a
significant portion of my time was spent refreshing my knowledge of basic
theoretical computer science type things. The reason was fairly simple in that
I wanted to be able to more or less instantly be able to answer any of the
basic knowledge questions. I could have independently come up with various
graph traversal algorithms given a bit of time as I did have them stored deep
in my grey matter. That said, it seemed better to freshen up the on the
details so those particular nodes in my brain had a bit more weight to
them.
So there was lots of refreshing of information, which was
notably taking up my brain time which meant to less time learning wholly new
things. This is okay, but one of the best things about software development is
the constant learning. So when a company mentioned that they were using Hadoop and HBase, this was
instantly exciting to me as it gave me a perfectly good reason to go and
research some new technologies.
If you haven't heard of these two
projects, and unless you are specifically working with them, you probably
haven't, Hadoop and HBase are free software implementations of two systems
designed by Google, MapReduce and BigTable respectively. MapReduce is a framework developed
by Google to facilitate processing and working with large datasets across a
distributed network. BigTable is essentially a way to store structured data
across a distributed network, though it is important to note that this
structured in terms of nested hashes, not in a traditional relational
manner.
Google had a few motivations for building something like
this. They regularly worked with gigantic datasets, their search index itself,
search logs and Google Maps tilesets as a few examples. Analyzing these
datasets took massive CPU resources and a distributed approach was more or
less deemed the only practical way to actually compute solutions. MapReduce
takes the classic divide and conquer approach to solving a problem. The
problem space is split up and doled out to dozens, hundreds or thousands of
computers. This is the map phase. Results from each piece of the calculation
is then returned to the master computer which then reduces the results down
into some sort of final output.
This approach has been used for
solving all sorts of distributed problems, but MapReduce was unique in that it
was a framework, usable for a variety of tasks. The hard parts of managing a
distributed network for a single calculation are often just that, the
management tasks. Deciding which servers get which chunks of data depending on
where they are in the network. Deciding if a server has crashed, or if it's
just slow. How and what, if any, data do you duplicate across multiple nodes
to guarantee that you get an answer. These are all concerns that must be
addressed when creating a distributed application.
What Google
did with MapReduce was abstract away the management tasks so their developers
could focus on actually writing the algorithms to solve the problems. Hadoop
is a Java version of MapReduce. But I digress.
What was really
nice was that in the midst of this review and stress, as looking for
meaningful employment is not without some concerns, was an opportunity to read
a couple of new papers and learn some new things. So I promptly went and wrote
a distributed program to calculate how many lines each character in the
collected works of Shakespeare spoke. This was not something that required
much computer horsepower, but it was pretty cool to run the thing on two
computers and get chunks distributed to each of them.
So that
went over well. I am going to post this now as I have been sitting on this for
a while and not finishing it up. I need to lower my standards somewhat as, to
paraphrase Joel and Jeff, I only have one reader anyway.