Posts tagged ‘Code’

Number crunching and buses!

A few days ago, Translink announced that they would be releasing their bus, train and seabus route information in a standard format. A list of every bus stop, route, time, etc might not seem overly exciting to most people, but I love datasets. Admittedly, I often don’t know exactly what to do with datasets, but that’s hardly the real issue here. Anyhow, this seemed like a promising thing for me to do and I downloaded it, unzipped it and spent a couple of hours prepping a Rails project to serve as a new home for it.

Roughly 500 routes, 8700 stops, 126000 trips and 3.4 million timepoints at those stops. Not a whopping amount of data, but enough to start having some fun. My initial plan was just to be able to plot the stops for a given route onto google maps. That’s done in it’s ugly glory at my stopfinder. If you want to search for a 1 or 2 digit route, put in the leading 0’s. Sorry, haven’t done that yet.

My next steps are going to be to publish a number of primitive operations on the data with results in JSON format. Things like ‘closest stop to lat,lon’, ‘how to get from stop x to stop y’, and other similar sorts of things. The idea being that if I can build up a suitable library of common operations on the dataset, any future ideas that do come to mind should be relatively easy to implement.

That and if anyone does want to do some data mining, well, this is an option. I’ll post any updates, formats and that sort of thing on this site as I work through it. In general, the services will be pretty much simply URL based and will return raw JSON. Nothing special, but fairly easy to parse and work with. I have a relatively irrational dislike of XML which I will probably get over at some point, but it will take someone making a very good argument.

Quality is Job One

Uh, yeah.

So it is, but actually stating that, or anything along those lines? Way to kill the team, boss! (See Peopleware)

That said, quality assurance, quality control, QA, call it what you want, but it’s one of the more misunderstood aspects to software development. Oh sure, everyone knows that they need to do more QA or better QA, but lip service is about all that is ever paid towards it. I am notably not including in my ‘everyone’ those who feel that QA can be completely automated. You guys are wrong and I’m going to leave it at that. You also may think you don’t need to do it, see this article for some classic arguments against that fallacy.

I’m not going to go into depth about QA, how to do it, best practices or anything along those lines as I’m fairly unqualified. That said, I’m not really qualified to talk about anything, but that doesn’t really stop me.

QA is a processes, not a task

This particular fail case is something I’ve seen in multiple organizations now. The most obvious symptom of this is when management has decreed that there is a block of a few hours set aside to ‘do QA’ on an application with a few hundred known use cases. Another obvious indicator is when other employees are volunteered to do a few hours of QA on top of their normal job. Think you’re going to get good results from that?

The root cause of this failure is simply not understanding how QA works, so let’s walk through it a bit. In a very broad sense, the general list of tasks for QA is something like this:

1. Go through the basic cases

2. Go through the corner cases

3. Go through obscure, known failure cases

4. Exploratory testing

5. Automating 1, 2 and 3.

So, how does this fit into a day of work? Let’s find out:

First off, we’re going to go through the basic use cases for the application. Then, there is a pile of corner cases that are pretty valid that need to be checked out. Then it’s time to check all the really obscure, but horribly embarrassing failures that have been seen before. From there we can finally…What? You changed the code? Okay, first off, we’re going to go through the basic use cases for the application…

Interruption here! “Silly tester,” says the savvy developer, “You only need to re-test the parts of the system that were changed.” Nice theory, but wrong in many, many ways. Simply put, if this was the case, testing outside of developers would never be needed. That generally goes well.

Back to the task at hand, do the basic cases, do the corner caWHAT? Changed again? Basic cases…

The real job of QA starts at step 4, which we haven’t even seen yet. Exploratory testing is finding the embarrassing defects before they get out into the wild. A good tester at this phase is going to break your application in ways you haven’t even dreamed of. In ways that only 0.1% of your users would ever try to do. Of course, if 0.1% of your users do it, and you get 10k uniques per day? That’s 10 people per day that are going to hit this embarrassing bug that how could you possible let into the wild and I’m taking my business elsewhere right now as I obviously cannot trust you with my data. And if one of those has a blog? Heh. Have fun with that.

So the epic fail with having 16 hours scheduled in to test your quarter million lines of code application? If you’ve got bug fixing going on at the same time, any of your competent testers will never get past step 1. Any testers that listen to the savvy developer, or worse, are the savvy developer will miss basic cases and you deploy with fundamental breaks.

The purpose of QA is not to have someone say, “Wonderful developer, your application is perfect!” If I hear that from a tester, I assume the person isn’t doing their job very well. QA should hurt your feelings. Assumptions you made should be laid bare and justified or thrown out if incorrect. This is often the last line of defense before your customers see your application, take it seriously.

New Learnings

During the course of interviews I have gone through in recent weeks, a significant portion of my time was spent refreshing my knowledge of basic theoretical computer science type things. The reason was fairly simple in that I wanted to be able to more or less instantly be able to answer any of the basic knowledge questions. I could have independently come up with various graph traversal algorithms given a bit of time as I did have them stored deep in my grey matter. That said, it seemed better to freshen up the on the details so those particular nodes in my brain had a bit more weight to them.

So there was lots of refreshing of information, which was notably taking up my brain time which meant to less time learning wholly new things. This is okay, but one of the best things about software development is the constant learning. So when a company mentioned that they were using Hadoop and HBase, this was instantly exciting to me as it gave me a perfectly good reason to go and research some new technologies.

If you haven’t heard of these two projects, and unless you are specifically working with them, you probably haven’t, Hadoop and HBase are free software implementations of two systems designed by Google, MapReduce and BigTable respectively. MapReduce is a framework developed by Google to facilitate processing and working with large datasets across a distributed network. BigTable is essentially a way to store structured data across a distributed network, though it is important to note that this structured in terms of nested hashes, not in a traditional relational manner.

Google had a few motivations for building something like this. They regularly worked with gigantic datasets, their search index itself, search logs and Google Maps tilesets as a few examples. Analyzing these datasets took massive CPU resources and a distributed approach was more or less deemed the only practical way to actually compute solutions. MapReduce takes the classic divide and conquer approach to solving a problem. The problem space is split up and doled out to dozens, hundreds or thousands of computers. This is the map phase. Results from each piece of the calculation is then returned to the master computer which then reduces the results down into some sort of final output.

This approach has been used for solving all sorts of distributed problems, but MapReduce was unique in that it was a framework, usable for a variety of tasks. The hard parts of managing a distributed network for a single calculation are often just that, the management tasks. Deciding which servers get which chunks of data depending on where they are in the network. Deciding if a server has crashed, or if it’s just slow. How and what, if any, data do you duplicate across multiple nodes to guarantee that you get an answer. These are all concerns that must be addressed when creating a distributed application.

What Google did with MapReduce was abstract away the management tasks so their developers could focus on actually writing the algorithms to solve the problems. Hadoop is a Java version of MapReduce. But I digress.

What was really nice was that in the midst of this review and stress, as looking for meaningful employment is not without some concerns, was an opportunity to read a couple of new papers and learn some new things. So I promptly went and wrote a distributed program to calculate how many lines each character in the collected works of Shakespeare spoke. This was not something that required much computer horsepower, but it was pretty cool to run the thing on two computers and get chunks distributed to each of them.

So that went over well. I am going to post this now as I have been sitting on this for a while and not finishing it up. I need to lower my standards somewhat as, to paraphrase Joel and Jeff, I only have one reader anyway.