Tuesday, January 24, 2012

The sleep-deprived ramblings of a developer in transition...

Late-night software deployments are no fun. No fun whatsoever, particularly when you feel compelled (out of a little niggling paranoia that you’ve done something just a hair’s-breadth away from perfect) to stay up to watch the automated jobs pick up and start running the new code.

I’m going to be very happy when this project can be put to bed, so that I can get back to doing things that I’m a little more emotionally invested in. I have a great opportunity to do some systems architecture coming up; we’re rewriting my core projects into Java and JSF, which is allowing for some much-needed central refactoring, so that things can be decoupled and streamlined. I might not have the technical knowledge that the other developers on the project have (I’m terrifically out-of-date on Java), but I do have solid domain knowledge, so I’ll be able to establish the game plan—write up most of the user stories and make sure that that’s all coherent, make sure that all the systems that need to be redesign get done correctly, &c, &c.

Of course, the worst thing of staying up late to work on code is when you forget that there’s a time zone difference between you and the server it’s running on… and that when it runs at “2:00 am”, it’s actually referring to Mountain Standard Time. Bah!

Time for bed.

Monday, January 2, 2012

In which documentation leads to rearchitecture

All year—well, mainly just the second half of 2011—I’ve been intending to write some XML Schema documents to formalise the input and output of an XML API I wrote a year and change ago. Is it really Important to do this? Yes and no. Mostly no (which explains why it hasn’t been done yet), since there are only three applications that use the API so far and I wrote all of them (and the PHP and JS libraries that handle it). From the functional perspective, it’s not critical that these schemata get written. However, I’m also trying to finish off some formal documentation of the tool that the API serves, which includes these schemata. Why would it need to include the schemata? Hopefully, we’re going to get some more developers working on this project, so having a formal document that describes the input and output would be good; it would give these developers something to refer to, that could be used for validation, and improve the testability of the whole system.

I’ve been trying finally finish these schemata, and I’ve found two things about how I implemented it:

  • The output has a small logic flaw, I think—I use a <message/> element both inside and outside the element that indicates what objects have been affected, depending (primarily) on the context of the action being performed.
  • In order to be able to validate against these schemata, I’d have to make some obnoxiously redundant changes to the libraries that generate the input. The XML is somewhat polymorphic—depending on the value of the command attribute on an <action/> element, the legal child elements change. I’d love to be able to handle that without having to use the xsi:type attribute to indicate that the delete action is an instance of type ActionDelete, but it’s becoming apparent that XML parsers just don’t work that way.

So, the decision to make the XML polymorphic based on an <action/> element may have been shortsighted. Probably was, in fact. The question is, though, do I rewrite the API entirely to allow this business logic to be expressed in the schema, or do I write a more generalised pair of schemata now, then clean up the API in a v2.0 so that it can be more rigourously and specifically validated against the business? The output definitely needs to be revisited (especially since it's currently sending both XML and JSON, depending on the action), but what to do about the input? Probably simpler to do the minimal amount of redesign now, then when the applications get refactored later this year, look into the more comprehensive enhancements.

We shall see, we shall see…

Monday, June 27, 2011

On Coe’s First Law of Software Development

Last week, I discussed my Second Law of Software Development (self-documenting code isn’t), in reference to why proper, discrete documentation is a Good Thing. I didn’t get into what I think is the best time to write documentation (before you write code), because that’s a whole other rant in and of itself, but I did briefly mention my First Law of Software Development:

When it happens, you’ll know.

I’m not ashamed to admit that I’ve borrowed the phrasing from The Simpsons, but it’s a really good line. The First Law originally applied to baking in security when you’re developing SaaS, but other things keep coming up, wherein if I’d kept in mind that when it happens, you’ll know, I could have avoided a whole lot of hassle.

Simply put, the First Law is all about trying to see things coming, and being prepared for them. I could have borrowed from Scouts and gone with be prepared, but it doesn’t quite appeal to my sense of humour. The fact is, something will eventually go wrong, and when it finally happens, you’ll know. And when you look at the block of code that’s to blame, you’ll ask yourself why you didn’t code defensively for it in the first place.

It originally came to me when I was writing a CRM tool for the company I worked for in 2007. Inspired by a software engineering professor at my university, I wanted to code against bad input. At first, it was about malicious input, but as time has gone on, it really is about just generally bad input. The original motivation was about accepting the fact that at some point in time, someone, somewhere, is going to discover and exploit a weakness in your software. You don’t want to assume that all of your users will be nefarious little pissants, but in the interest of your good users, you ought to assume that your average long-term number of less-than-trustworthy users is nonzero.

So, eventually, someone will try to misuse your software. But does it stop there? The correct answer is no, no it doesn’t. While you’re validating your input against inappropriate behaviour, you can just as easily, if not more easily, validate for correct behaviour—that your users haven’t accidentally done something wrong. Type checking falls under this umbrella, and it’s useful both in the functions that are retrieving user input and the functions that are processing it. This is particularly important in weakly typed languages, because you can’t reliably just cast your input into a variable of type x (particularly in JavaScript, where concatenation and mathematic addition use the same operator). When users provide improper input that isn’t what it should be (but they still have honourable intentions), then you have a problem (maybe it’s in your documentation… but that’s another post). Maybe the input is well-formed, but has unexpected side-effects. When it happens, you’ll know.

Now that you’re validating your user input for validity and intent, are you done? Probably not. In this day and age, software doesn’t exist in a vacuum (apart from the little noddy programs you write to prove that you can handle the concept you were just taught). There are external subsystems that you rely on. In an ideal world, you’ll get perfect data from them, but this isn’t an ideal world. Databases get overloaded and refuse connections. Servers get restarted and services don’t always come up correctly, if at all. Connections time out, or you forget whether or not this request has to go through a proxy. When it happens, you’ll know, because all of a sudden, your software breaks. Hard. You need to figure out what subsystem failed, and more importantly, why, so that you can prevent it from happening that way in the future.

However, that isn’t enough. You should have been ready for that failure. You can’t assume that all the other subsystems will be there 100% of the time. Assume that your caching layer will disappear at an inconvenient moment. Know that your database won’t always give you a result set. If you have to call out to a separately managed web service, do not rely on it being there, or having the same API forever. Code defensively for the fact that eventually, something will go wrong, and you won’t be watching when it happens.

So there’s a very good reason why the First Law of Software Development is when it happens, you’ll know. Eventually, “it” will happen, and when you figure out what “it” was, and where it caused you problems, it’ll seem so obvious that there was a point of failure, or a weakness, that you’ll ask yourself why this problem wasn’t coded against in the first place.

Wednesday, June 15, 2011

Coe's Second Law Of Software Development

Before I begin, let me just put it out there that I really like the Zend Framework. I know, I know, laying down your cards about your favourite editor/framework/OS/whatever is liable to set off holy wars, but I really like ZF. It’s clean, and I like the mix-and-match properties of it that allow me to use only as much of the framework as I need. It’s this very property that I’ve used to great advantage in Project Alchemy and Portico. But I’ve always had one complaint about it.

The documentation in the reference guide, and to a roughly equal extent, in the PHPDoc navigator, is really flaky. The full functionality isn’t properly described in the reference guide, and the PHPDoc doesn’t provide enough information about the API. I’ve found, on several occasions, that I have to dig into the code simply in order to figure out how to use some methods.

Anthony Wlodarski, of PHPDeveloper.org, sees this as a positive of Zend Framework; that when ZF community wonks tell you to RTFS, it really is for your own good; that ZF really is that self-documenting. He says,

One thing I learned early on with ZF was that the curators and associates in the ZF ecosystem always fall back to the root of “read the code/api/documentation”. With good reason too! It is not the volunteers simply shrugging you off but it is for your own good

Unfortunately, it’s been my experience that self-documenting code isn’t. Let’s call that “Coe’s Second Law Of Software Development” (the First being when it happens, you’ll know). This is how strongly I feel about the issue. Far and away, the code itself is always the last place you want to look to figure out how it works, and only ever if you have a fair amount of time on your hands, because deciphering another developer’s idiosyncrasies is harder than writing new code. And if you have to look through that code to figure out what the correct usage of the tool is, then someone isn’t doing their job properly, particularly when Zend Framework has the backing of Zend.

I’ve been working for $EMPLOYER$ for more than a year now, and I’ve worked with a number of our internal tools, and across the board, I keep getting bit by the fact that our self-documenting code isn’t. Self-documenting code means that there’s no easily-accessible, central repository of information about how the tools are supposed to work, or about what inputs to provide and outputs to expect. Self-documenting code means that when something goes wrong, and the person who originally wrote the code, the poor sap who has to correct the problem now has to figure out what the code is supposed to do. Self-documenting code means that when your prototypes (or even production tools!) fail without a clear error code, you have to either start shotgun debugging, trying to figure out what you’re doing wrong (or what changed); or you have to ask the project manager, who will ask a developer to dig through the code. This increases the turnaround time on all problems.

Self-documenting code is fine when you’re writing little noddy programs for first-, second-, and even some third-year classes, where the functionality is defined in the assignment description, and the problems straightforward enough that what you’re doing is actually clear at a quick glance. When you’re writing tools for distribution to parties outside of the small group of people who are developing it, you owe it to yourself, to your QA team, to your present and future colleagues, and more importantly to your users, to write good, clear documentation, and to do so ahead of time, because then you only have to update the documentation to reflect what changed about your plan.

But remember, when someone tells you excitedly that their code is self-documenting, remember: self-documenting code isn’t.

Monday, October 18, 2010

In which there is a first time for everything

So, this Thursday is the Toronto Facebook Developer Garage… This is actually going to be the first industry event I’ll have ever attended, and hopefully not the last. Should be interesting—I’m going with the development team from work (all three of us), and I know that a few of the guys from my first Facebook-related job will be there (naturally; that company is one of the hosts). I haven’t so much as touched the Facebook API since I left that job, so it’s certainly going to be interesting to see what cool stuff people are doing with the platform, and how it’s changed in the last two years.

I’m also, admittedly, really curious to see just what goes on at one of these events. Apparently it grew from a bunch of grassroots, community-organised gatherings of just developers who wanted to show off some cool stuff, but this one’s got a proper agenda, with keynote speeches and whatnot. I know that the Garage in NYC involved getting some swag (saw pictures, said former employer co-hosted that one too), so obviously it’s much more… regimented than just an informal gathering.

Should be fun, though. The biggest issue of getting deeper integration of Facebook into what my company does is the monetisation issue, and unless I’m much mistaken, you can’t post your own ads inside of a Facebook app canvas (probably because it takes eyeballs away from the ads on the side of the page), so the issue becomes, how would we get Facebook users to buy the paid parts of the service? Or at least how do we get Facebook users to come to our site (because we’re already one of the most popular sites in Canada; can’t go wrong with getting some more eyes), so that they’ll…

  1. see the ads that our advertisers are paying for, and
  2. continue through the regular flow and maybe buy the paid parts of the service?

Therein lies the challenge… and maybe there will be answers at this Garage. At the very least, there should be ideas on what can be done with the API that’s been introduced since 2008!

Friday, July 23, 2010

In which work ethics are considered

Over the past couple of weeks, I’ve had two very clear indicators—to myself—of just how much I’m enjoying my current job. I say that because there’s a subtle, but important, difference between saying “I love my job because I love doing x” and observing your own behaviours and thought patterns, and noticing that the way that you do your job, and approach your job, demonstrates how much you enjoy it.

The first way is something I realised about two weeks ago, when I met with the general manager (my boss’s boss, and an all-around good guy) for a quick catch-up chat. I’ve been working at this job for eleven weeks, and it feels like I’ve been there a lot longer, and more importantly, the environment, and the nature of the work is just naturally enjoyable, and it’s something I’m keenly interested in, so I like going to work on general principle. But when I was talking to him, it dawned on me that I like my work so much that, for the first time in years (at least two, if not four) have I been able to get so wrapped up in my work that I lose track of the time. Almost every other job I’ve had, with possibly one exception, since I moved to Toronto, I’ve tended to kind of check out around 4:30 or 4:45; I’d start trying to find something to do that would be productive work, but wouldn’t take too long to do, because I wanted to get out the door. Where I am now, as often as not, it’ll be almost 5:30 when I look at the clock and realise that I should probably go home.

The second, even clearer indication came this past Friday. I’ve been working on coming up with a way to better integrate two of the projects I’ve been working on, and particularly a way to do it across a subdomain divide (the products need to communicate on both the server and client sides; I managed to hack it on the pure client side by spawning some IFrames, but getting a particular server-side action in Project A to trigger an action in Project B has been a little less clear-cut. So, I decided that a simple XML interface into Project B was needed, with a wrapper for Project A (and Project C, which another developer is working on was necessary). So, I spent Thursday and Friday afternoons working on this API.

Two very cool things came from this.

  1. Apart from a few minor syntax errors (missing a parenthesis, putting a colon where a semicolon should’ve been, &c.), the stripped down API worked perfectly right off the bat. A few hundred lines of code, and it Just Worked. I haven’t been that successful in a while.
  2. I ran the first test at about 4:40. I had other things that needed to be done that evening which necessitated my leaving as close to 5:00 as possible, but I had a thought I haven’t had in a long time: I wish I didn’t have to leave right away. I wanted to take my work home with me.

This hasn’t happened in ages, and I didn’t realise how much I missed that feeling until this past Friday. I’ve been telling people how much I like my job based on its perks: catered lunches on Fridays, stocked kitchen, and an amazing sense of community with my co-workers. But being able to say, “there are days that I don’t want to stop working”… that might be the surest sign that you’ve got a great job.

It’s kind of funny, because I normally try to fight against that really, typically Protestant work ethic of, when you boil it down, “living to work”. I can leave my work at the office; most days, it’s a case of losing track of the time because I’m so wrapped up in what I’m doing, so when I realise what time it is, I clean up what I was doing, get it to a state where I can leave, and I go home. And I think that’s what this is an extension of—I got so wrapped up in what I was doing that, had I not had other things to do, I almost certainly would have stuck around, ignoring the clock.

I think that might be the difference. Most days, I don’t care what time it is; it’s irrelevant to me how many hours I spend at the office, as long as I get done what I want to get done. When I compare that against the Toronto workaholics who not only work to live, but take it as a point of some perverse kind of pride that they work sixty- or eighty-hour weeks, I can see much better what the difference is. I’m doing what I love, and I take pride in the result of my work, whereas some people take pride in the amount of work that they do.

Thursday, June 3, 2010

That thing you use? I made that.

Monday night (technically Tuesday morning, but who’s counting? Other than Blogger, that is) I mentioned that being able to say to somebody, “you know that thing that you use? I made that” is a great feeling. I just ran into a former colleague from my previous contract, who let me know just on what scale my stuff is operating.

One of the projects I worked on—made, really; the requirements were small enough—was a carbon and cost savings calculator for Sears Canada’s website, so that people looking to replace one of their appliances could see about how much money they’d save by switching. Fairly simple to do; the worst of it was extracting the formulas from the Big Ugly Interactive Spreadsheet that Sears provided. I worked hard to provide the best little jQuery applet I could, complete with pretty transitions and everything. Fully translated into French, too, and I made sure it’d work acceptably well in IE6. It was kind of a focus of their recent/current Green promotion, but how many people, really, were going to wind up using it?

As it turns out, a lot. In the linked article, you can see that last Friday, they opened a six-kiosk booth in the Vancouver Robson store that runs that “little jQuery applet” in a touchscreen interface! I’m… speechless. To the best of my knowledge, nothing I’ve ever made has seen such a wide userbase. They’re adding more booths to more stores, too. I kind of hope that one will show up in Toronto so I can play with it and show people.

Tuesday, June 1, 2010

In which a scale is found tipping

There’s a certain aphorism that I’ve been thinking about lately: “there are twenty-four useful hours in every day” I don’t recall where I heard it first—as I recall, the version I heard growing up was “you’ve got the same amount of time in the day as the rest of us”—but I’ve realised two things about that first saying:

  1. There are decidedly not twenty-four useful hours in every day. Depending on your particularly sleep needs, there are eighteen and, say, fifteen waking hours in every day. Then when you factor in time spent in transit between work/school and home, and mealtimes, along with basic hygiene needs, your time-per-day number drops considerably. I”d estimate around twelve. Your mileage may vary, depending on, well, the mileage between home and what you do to make a living.
  2. I need more. I think this is why I’m reminded if my father’s different, truer version of the saying.

There are people in this industry, in this city, who love the freelancing life; people who love seeking out the Next Big Contract, and who get off on working late into the night to make a higher paycheque than the next guy. I’m not one of those people, as I’ve been discovering this year.

Don’t get me wrong. I like networking, and I like making things happens, and I like being able to say, “you know that thing that you use? I made that.” It’s a great feeling. But for the past few months, I’ve really felt like I’ve had at least twenty-four hours of work to do, every single day. So I perpetually feel like I’m behind. And that’s a pretty crappy feeling.

I read a book a few years back called The Hacker Ethic. It discusses the Protestant work ethic, which I think is a huge influence, in this city, of why people will willingly put in ten- or twelve-hour working days, five or six days a week. I don’t get that. I’ve been doing that for months, and it’s awful. All you’re thinking about is what you have to do. Your main focus is making more money. But why? So you can buy more things?

Seriously, everybody should read that book. I got into computers professionally because I love using them, and bending them to my will. But I also love my wife, and it’s important to find that work/life balance that keeps you sane and healthy.

This is becoming a bit of a rant, so I think I’d best cut it off. Too much work to do, anyway.

Friday, March 12, 2010

In which an error is realised

I’ve been giving some deeper thought to the multiple-has table I described in my last post, and I’ve figured out where the problem is. All the usual actions can be performed in constant time for any specific table, as usual… the issue is what kind of time they require. For n-dimensional tables, the number of operations for find increases linearly with n, but the operations required for add and delete increases factorially. For n=2, add requires three actions, but for n=3, add takes seven. n=4 causes add to take twenty-five, unless I’ve mucked up my math.

This is, of course, bad. Unless you can magically prepopulate the hashtable, and you never have to modify it… and that’s exceedingly unlikely. I think I need to find a better way of storing the indices. I think the problem that’s causing the factorial growth is the fact that I was originally thinking of storing the sub-hashtables (say, where n=2, the tables one dimension deep) as independent of each other, and relying on encapsulation. Didn’t I specifically mention that the annoying part about classic multidimensional arrays is that they’re based on encapsulation?!

Back to the drawing board.

Sunday, March 7, 2010

In which an algorithm is partially puzzled out

This past week, I was working on a small project at work that seemed, at first glance, to need a unique type of multidimensional array—I needed to be able to store a collection of data that had a composite key, where the values of each component of the key were bounded sets. I realised after an hour or so of trying to figure it out that I was overthinking the problem (a regular multidimensional array would suffice, because as the programmer I can fully control which key component would be the primary axis of the array), but it left me with an interesting idea in my head that, after a brief bit of research, I've realised nobody's really approached... and while I thought of the idea at work, I also tossed it out as being unnecessary, so I don't think they can really lay claim to the intellectual property, because I'm not doing any development of the idea on their dime! Only part of what I'm going to get into was something done at the office, so the vast bulk of this is my own work, for my own amusement.

That idea is, in essence, an n-dimensional multi-hash table (which I'll just refer to as an MHT, for brevity's sake). The problem that I thought I had was that I would need to allow the user to specify either of the components, then update the available choices from the other component appropriately, or retrieve a subset of the data that had that key (like I said, it turned out to not be the case), but I've realised since then, what if, someday, I really need to do that, and I want to be able to do it quickly? My approach to the problem follows, but this is fairly off-the-cuff, and hasn't been implemented. I'm really just trying to think this out, and I want to make sure that if this turns out to be Important, that there's a published record of when I came up with it.

Let S be a set of data. Each item s in S contains (for the purpose of this example) a binary composite key, made of values j and k, where j and k are members of sets J and K. s has a unique key value n that may or may not be exposed to the user, but is available to the software. Instead, the user wishes to access s by specifying j and k, but may need to obtain a list of all items in S with a specific value of either j or k. Since S may be of arbitrary size, maintaining a low order for the algorithm is important. Hash tables are particularly well-suited to this, as they can typically maintain O(1) performance for most access functions.

In order to accomplish this, multiple hash tables must be maintained (it's actually quite similar to maintaining hash indices in a DBMS; perhaps I should look into the specific algorithms used there). A hash table of all J values must be maintained, and each item in the J hashtable must point to a hash table with values as its key. These relationships must in turn be transposed, so that the user can access all the s in S with a particular k.

The issue, of course, becomes one of implementation. We also want to maintain a low storage requirement, so each item in the J-to-K hash table should simply be a pointer into S; this is why I mentioned that each s in S has a unique key that is programmatically available, but not necessarily exposed to the user.

So, what's the best approach to doing this? It seems to me that n+1 hash tables are required, where n is the number of dimensions that the user needs to use to access a given s. A master S hash table will exist with all the items. Then each dimension will have its own hash table. For dimension J, each record would be a key-value pair with the key j, and the value being a hash table where the keys are values from K, and the values are pointers into S. Dimension K would exist similarly. When adding an item x to S, after calculating its unique key value, the add function would need to add xk to J and xj to K. The reverse would be necessary for deletions. So, for a 2-dimensional MHT, it seems that adds and deletes could be performed in 3 O(1) steps. It's not perfect, but it's better than O(n). Fortunately, the data storage requirements for the two-dimensional are 3n (so still O(n))—each record would exist one time in each of the S, J, and K hash tables.

The difficulty that I'm realising, as I write this, could very easily come up, is what do you do when you need to create an n-dimensional multihash table where n>2? The mechanism I've described above works fine where each j in J refers to a simple hash table, but what happens when you have keys J, K, and L? What would each j in J refer to? A multidimensional array? Then updating J may necessitate updating multiple tables with the same data (which is something I'd like to avoid). I think the answer lies in what each j in J points to, but I haven't worked out how that's going to work, and I tend to do my best figuring out of this type of thing when I can see the problem—and being 12:30 in the morning, at home, I don't exactly have access to a gigantic whiteboard on which to work, or the opportunity to try hacking on the problem to see what works best.

But there's my approach to a 2-dimensional hash table. This is definitely something I want to try implementing, and that I want to continue working with in a bid to make an n-dimensional, any-addressable, MHT. I don't know if it will give any kind of accolades, but I think it'll be a fun problem to solve... and considering the fact that I can't find any evidence of anyone else trying to approach it this way on The Google, I'm starting to think that it might even be patentable.. so I think this would qualify as prior art!

Now I'm even more excited.