Quality over Features: 8 Excuses You Don’t Have To Make

Key to being able to execute Continuous Delivery (CD) is keeping your software deployable at all times. Most CD books will talk heavily about automation, but that is only part of what you need to achieve it. CD is often built on Agile and Lean foundations. Key to both of those is building quality into your process at all stages. Agile methodologies like XP have many technical practices to help developers and “building quality in” is one of the key principles of Lean Software Development.

In an organisation I have worked in, the long release cycle resulted in people let quality drop for a while. Then they spent a significant part of the cycle re-stabilising. In one of the organisations I’ve worked for, rough estimates for put this at around 20-40% of engineering time on fixing bugs on old branches. Clearly this time could be better spent elsewhere.

In a team that want to do CD, you must value being able to release software at any time. To achieve this you must keep quality high. This implies you need to value quality over features.

Over the years, I’ve seen many excuses for writing crappy code, and I’ll address them. 

FAQs Frequently made excuses

1. But we don’t have time!

Time for what? The only person who can give estimate is the developers actually working on the code. If someone is trying to beat you into compromising quality then you have bigger problems than CD.

2. But doing it right will take too long!

Shortcuts make long delays — JRR Tolkien 

Similar to the above, but developers are terrible for making shortcuts. They’re optimistic about the time saved and pessimistic about the value added of doing it right. If during a sprint, the initial investment is 20% more for the right choice, then it shouldn’t even be debated – more than likely that will return on the investment in less than a day.

3. But I’m building an MVP or prototype!

Never compromise the quality of your code. An MVP is a minimal set of features. A prototype is about proving a concept or a technical challenge. Fundamentally these are about making sure the scope of your work is correct and small.

4. But this feature is too big to implement without bugs! We’ve over-committed this sprint!

It’s not always easy but sometimes the plans have to be changed. In implementing features we learn more, therefore surprises are inevitable. Try breaking the features down into smaller parts or using feature toggles to deploy something incomplete but safe.

5. But I don’t understand this code!

Take the time you need. You shouldn’t need permission to code things right. If you don’t understand the code you’re using then how do you know it is right?

6. But I’ll refactor it later!

No, refactor mercilessly (see XP practice of the same name). Refactoring is something you do all the time. If you aren’t spending hours each day “refactoring” then your code will suffer. A feature isn’t finished if there are obvious refactoring to do: just do it.

That said do not over engineering your code; it must be fit for purpose.

7. But I don’t have a spec!

Agile doesn’t get rid of specs. It changes them to user stories. If you don’t have something that resembles a spec then how do you even know what you’re doing? 

8. But it will take only 5 minutes to explain why this code is weird!

Ideally code should be self explanatory and unsurprising. Weird code needs explaining for someone else. If you have to interrupt someone for an explanation then you’re ruining their flow for a time far greater than the help they give you. Don’t underestimate the cumulative effect of all these “5 minutes” even directly. These interruptions are an impediment to scaling a team.

Keeping code quality high should be a top priority, it will give you agility and in turn raise the quality of the product. I deliberately haven’t mentioned anything about what quality is. There are lots of practices that can help raise quality but if you aren’t trying then quality isn’t going to get better on its own!

A Programme Manager in an Agile Organisation

In a previous job, a Programmer Manager came for guidance about what he should be doing. Since that time, I’ve gained more insight into his skills and know where I could have better made use of him. I thought I’d write some notes on my thoughts.

A Programme Manager should be a communication hub. Ideally, if a stakeholder wants to find out info about something inside then they should be able to come to him for that information.

To that end, I think there are some artifacts he should maintain. When I say artifacts, I mean these things should result in web page or documents that anyone can access.

Roadmap

Ensure roadmap is well formed, refined. He is not responsible for populating the roadmap – that belongs to the product team but must ensure that it is flowing correctly.

Facilitates with work to the relevant teams; ensures that things are progressively refined.

Project Lifecycle Status

Knows the status of any “work”, or at least who to go to, to find out. This status could sometimes be “not started” or “maintenance”.

RAG project status

Maintains and publishes a status of any current projects in progress. That said, Agile organisations should be thinking more in terms of products than projects so this might translate better into KPIs for particular initiatives.

Maturity matrices

Facilitates the development of Maturity Matrices and the team’s current levels

Other stuff

  • Champion the process: should be involved in and help facilitate the day-to-day Agile processes.
  • Manage Risks: within the traditional scope of a programme manager. Do we need a risk register? Probably not…
  • Dependencies: Agile processes seek to minimize dependencies, but when they do exist having someone to traverse teams is useful.
  • Security: works best as a team sport and therefore should be implemented as a process.

Updated Joel Test

Back when I started my career, Joel Spolsky published his list of 12 Steps to Better code. It’s incredible how well its stood the test of time. Some of his things were from the point of view of Fogbugz; a bug database.

I’m not the first person to do this but, I’ve updated them with a more modern twist – they probably won’t last as long as Joel’s but I think my changes make them more representative of modern best practices. I’ve not changed number 5, 11, or 12. I’m pretty confident these still make sense, although I’ve never done hallway usability testing, but maybe I should give it a try!

Original test:

  1. Do you use source control?
  2. Can you make a build in one step?
  3. Do you make daily builds?
  4. Do you have a bug database?
  5. Do you fix bugs before writing new code?
  6. Do you have an up-to-date schedule?
  7. Do you have a spec?
  8. Do programmers have quiet working conditions?
  9. Do you use the best tools money can buy?
  10. Do you have testers?
  11. Do new candidates write code during their interview?
  12. Do you do hallway usability testing?

Updated list:

  1. Do you use source control that supports lightweight branching? 
  2. Can you checkout, build and run your software in one step? 
  3. Do you build (and unit test) every pushed commit? 
  4. Do you have a backlog and roadmap?  
  5. Do you fix bugs before writing new code?
  6. Do you deliver progress every sprint? 
  7. Is your spec validated automatically? 
  8. Can developers work from anywhere? 
  9. Can developers pick the best tool for the job? 
  10. Is testing everyone’s responsibility? 
  11. Do new candidates write code during their interview?
  12. Do you do usability testing?

What do you guys think, agree or disagree? How well do you score on this – old and new? What would your updates be?

Is Agile For You?

This is a post I wrote for the now defunct pebblecode.com aimed at some conservative enterprises.

If you have engaged a development consultant in recent years, you have probably heard them wax lyrical about, “Waterfall is dead, we’re Agile” – but what does this actually mean for your product delivery? What is the business case for engaging with Agile?

Neither Agile or Waterfall are actually software development methods. Waterfall is the collective name for more traditional project management methods that have sequential separate stages from planning, design, implementation, testing to maintenance. Agile methods are incremental and adaptive methods for delivering software. The values and principles can be found in the Agile Manifesto.

Across the software industry there is no common definition of success of a project. Is it on time, on budget, to specification or fit for purpose? From each of these points, both traditional methods and Agile can be used to successfully deliver software products, however, research has found that Agile methods are a more effective way to deliver software.

Agile methods have many benefits but are not a panacea for software delivery. There are no guarantees, however, we believe these methods allow us to deliver the best quality and offer value for money for our customers.

Here at pebble {code} we use Agile methods in the development of all our projects. We have distilled the best/core practices from methodologies like Scrum and XP to created a lightweight Agile Framework that gives us the flexibility to apply them to the kind of projects we run with the minimum of ceremony.

Will the spec change over the life of the project?

One of the pathologies of traditional project management is scope creep. In traditional methods everyone agrees to the contract up front and spec changes are managed through some change control process. Each change will cost you; it is going to mean creating a change request, budget approval and back and forth with a purchase order and so on.

Traditional methods work hard to keep things on schedule by controlling scope. Unfortunately, this means, at the end of the project and a large investment, you risk getting a product not fit for purpose and losing your competitive advantage.

How certain are you about your assumptions?

Assumptions add risk to your project because they can turn out to be false. In fixed cost pricing, you have plan up front and no doubt those costs are passed on to you. Worse is when a false assumption is simply missed and it derails your project.

Traditional project management methods invest a huge amount of effort in documenting, tracking and managing assumptions and risks (if they do anything at all).

Are there any other uncertainties?

…there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns – the ones we don’t know we don’t know. – Donald Rumsfeld, 2002

There are many unpredictable things that could cause delays to your project. Developers get sick, get poached by rivals or go AWOL. Technologies have security flaws discovered. Suppliers go bust.

We can never identify all risks that might go wrong – there are almost infinite possibilities. You have to mitigate this risk by building in a significant contingency. These risks might never happen; forcing someone into a fixed price contract will inevitably lead to them adding this risk to the cost or failing to deliver on commitments when something goes wrong.

Mitigating Risks and Increasing Agility

Agile methods embrace change; changes are easily incorporated into the process of prioritization and planning that takes place at the start of each sprint. It is a much more efficient use of your budget. Change is managed, not prevented; you become adaptable.

Typically teams take work from a Product Backlog, which is ordered by priority. The Product Backlog contains things like User Stories and bugs. Product Backlog items are deliberately left as open as they can be. Detail is deferred which allows the specifics to be filled in as the product is developed and more information is gathered.

Software development is a process of continuous discovery. Agile’s iterative approach allows assumptions to be tested each sprint, for example, by getting the software in front of real users or, perhaps, doing performance testing on a piece of technology for its feasibility. If you find your assumptions are incorrect, you adapt during your prioritisation session and waste little or nothing.

Agile technical practices, like collective code ownership, work to reduce the risk that any one developers absence will cause delays or stop work. This practice helps the architecture of the software evolve by sensible practices rather than mirroring the social structure of the team.

Commitment to decisions are deferred as long as possible. This is sometimes called the Last Responsible Moment. Making decisions at the Last Responsible Moment is a risk avoidance strategy; decisions made too early in a project are hugely risky. The earlier you make a decision, the less information you have. In the absence of information, work might have to be re-done or thrown away. Another way to think of the last responsible moment is before your options expire.

When building software products, it is near impossible to get everything right first time. Agile methods allow you to evolve quickly, respond, and adapt. Through frequent testing and stakeholder (customer) engagement we can be sure we get it right as quickly as possible.

Agile methods can respond quickly and effectively to the complexity and uncertainty that characterise today’s business needs.

Thoughts on Code Reviews

Code reviews are a widely accepted practice in both enterprise and open source. In the words of Coding Horror: Just Do it. As we use Github and Visual Studio Online for VCS, this is built into our tooling with Pull Requests. However, sometimes making the most of a code review can be hard and feel like you’re just going through the motions. I’ve had developers struggle and lose sight of why we’re doing them so I attempted to write a list of some points to think about.

Prefer pair programming

We don’t often do pair programming, but it can be a superior alternative to Code Review. You might want to consider pairing though a code review. It gives you much more context, allowing more effective communication between you and the reviewer. It’s also harder to ignore or miss the comments when someone is sat next to you.

I don’t understand – Stupidity is a virtue

The hardest thing for a developer to admit is that they don’t understand. In some ways it is an admission of imperfection. However, the default position for a code reviewer should be that they don’t understand; maybe it hadn’t occurred to the author that it’s complex or they may have even made a mistake. I find being walked through the story of a code review is much better for my understanding with the intent of the author rather than trying to understand just by looking at the raw text.

Code reviews for knowledge sharing

Code reviews do offer some value in early bug catching, but can we use them as a tool for knowledge sharing? Think of the reviewer as the person who will next work on the code. What do you need to tell them to hand over? In the more Agile teams that I’ve worked in, the knowledge transfer was far more likely to happen.

Code reviews as an opportunity to ask for help

Some of the best code reviews that I’ve had are when I could ask questions of the reviewer about approaches that they prefer. Maybe this is just pair programming in disguise.

Code review is too late

If the author of the code has completed the task and it works to specification, but it doesn’t meet standards or its overly complex then is it really fair to ask them to re-write it all? Sometimes that is what is necessary, but at the least it’s quite inefficient. Consider reviewing smaller chunks or again pair programming.

Big Picture

Do we excessively focus on the line-for-line changes? Recently, in a code review, one reviewer commented on the naming of a method on an interface. The other reviewer looked at the bigger picture and suggested removing the interface entirely.

Do our tools encourage micro-review?

Similarly to above, Pull Requests encourage you to look only at the changes line-for-line. I’m convinced this makes me miss things sometimes.

Patches Welcome

Instead of doing a code review with loads of comments, why not just checkout the code and actually make the changes? Sometimes this can be the most effective way.

Objective vs Subjective?

When is it OK to reject a reviewer’s comment? The assumption shouldn’t be that the reviewer is always correct or vice-versa.

10400640-ff63-0131-6fed-0add9426c766

Slow turn around

Pull Requests often take a long time to be reviewed. This causes context-switching problems for the developer committing the code. One of the reasons I prefer pairing the review is that you can choose someone who is immediately available while the code is in your head. Similarly, I think it’s best if you can turn reviews around quickly when they are assigned to you.

Good enough

When is the code good enough? Think continuous improvement or kaizen. When is the point of diminishing returns? You don’t have to raise everything.

Automation

I see a lot of comments on whitespace, formatting, etc. You should automate this. Most IDEs have linter support these days and you should already have a build. If the code doesn’t comply with your coding standards then the build should fail!

Sprawling coding standards

Relating to the above. Are your coding standards too big to enforce? Will only the rules-lawyers do it? Avoid standards that can’t be enforced automatically.

On Proliferation of Small Libraries in .Net

I’m currently working on a Unity3d project, and often find myself in need of 3rd party DLLs. Unity supports .Net 3.5 libraries through an old version of Mono. Most libraries these days are distributed via Nuget packages and usually only with the latest version of the .Net framework. I’ve found myself having to find or create back-ports of those libraries. Unity isn’t the only framework I’ve had this problem with and portable class libraries are on attempt to solve this (but sadly they’re not supported on Unity). 

Most of the time, this is actually really simple. Add a reference to System.Threading, the TPL back port and there are no other changes required.

Recently, I’ve been focused on performance and I’ve needed to back port Distruptor-net and Microsoft.Io.RecyclableMemoryStream. These libraries are small; I only need to use maybe one public class from each, yet I have to include a DLL each time. For just these two public classes I’ve needed to add three DLLs. I could ILMerge them with my assemblies; but that can cause other problems.

I’d rather get a source. I’ll probably not need to upgrade these libraries, given their simplicity I’d like to be able to just get a single source file and drop that into my project. This is a work flow that works anywhere; even in doesn’t support Nuget. I wouldn’t have had to create back ports, the source just compiles (you could even use the preprocessor macros to hide some of the compile time differences).

There are actual performance reasons for not having lots of libraries; most people don’t care but the JIT can’t work across assembly boundaries (although to be honest this level of performance doesn’t bother me too much, but why deliberately slow your program down?).

The Javascript world there are plenty of libraries that are designed to be single file. NPM supports links directly to git repositories.

Nuget actually supports distribution of sources; when you install a package it will automatically add them to your csproj. I’d even commit these files into version control. This will save you from having to do a package restore too.

People often want to share snippets of code. Sometimes they do this by creating a misc/util library; usually I don’t want all the baggage. I’ve written about that before here.

Next time you want to create a new library, ask yourself, what is better; one cs file or another DLL to manage and version?

IIRC: Second Life Mono internals

This is the third part of my brain dump about my work on Second Life. One of the main projects that I worked on during my time at Linden Lab was implementing User Scripts on Mono

LSL2 VM is Second Life’s original VM for executing user code. Internally, Scripts were actor-like; they had a queue for messages and could communicate with other Scripts via messages. However, they can generate side effects on the world via library calls.

Each Sim was associated with a particular region and the objects that contained the Scripts could roam from one region to another. The VM provided transparent migration of Scripts as they moved between Sims.

Migration is described as transparent if the program can be written in the same way as non-mobile code. When the code is migrated it should resume in exactly the same state as before. LSL Scripts did not require any knowledge of where they were being executed.

To achieve this the full state of the Script is stored. This includes: the program code, heap objects, the message queue, the stack and program counter. As I said in my previous post, the LSL2 VM achieved this by putting all this state into a 16K block of memory. By storing everything is this block, this made migration simple because there was no serialisation step. When the Script is suspended, these blocks can just be shipped around and interpreted by any Sim.

Under Mono we wanted to provide the same quality of transparent migration. CLR (.Net) code is portable; write once, run anywhere, objects can be serialised, but the stack or program counter can not be accessed from user code.

The CLR stack is pretty conventional; it is composed of frames. A new frame is created each time a method is called and destroyed when a method exits. Each frame holds local variables and a stack for operands, arguments and return values.

To be able to migrate a Script, we need to suspend it and capture its state. Scripts are written by users and are not trusted. They can not be relied on to cooperatively yield. You can suspend CLR threads, but you have know way of knowing exactly what its doing and it can cause deadlocks.

The Script’s code is stored in an immutable asset. This is loaded by the Sim by fetching the asset via HTTP on demand. Multiple copies of the same Script will reused the same program code.

Classes can not be unloaded in the CLR, however, AppDomains can be destroyed. A Script was defined as live if it is inside an object rooted in the region or on an Avatar. When a new Script is instantiated, it is placed in the nursery AppDomain. If more than half the Scripts in a the nursery AppDomain are dead, the live Scripts are moved to the long lived domain. The long lived AppDomain has a similar scheme but it is replaced with a new domain. Migration was used to move Scripts between AppDomains.

The Scripts heap state is either referred to by a frame or the Script object itself contains a reference. The queue of messages is just an object on the Script base class. These objects are serialised and then restored at the destination.

The stack serialisation was achieved by modifying the program’s byte code that inserted blocks do the capture and restore of the current threads state. Doing it at the code level meant that we did not have to modify Mono and the code could also be ported to a standard Windows .Net VM.

The assembly rewriter was implemented using RAIL, which is similar to Mono Cecil (which was released part way through the project).

Each method on the Script object is modified. At the start of each method, a block of code is injected to do the restore. It get the saved frame and restores the local variables and the state of the stack. It then jumps to the part of the method that was previously executed. At the end of the method it inserts code to do the save which populates a stack frame. At various points within the method, it injects yield points where the program can be potentially suspended.

Here is a pseudo-code example of the the kind of thing the assembly re-writing did.
Original:

void method(int a)
{
    var result = 2;
    while (result < a)
    {
       result <<= 1;
    } 
}

Re-written:

void method()
{
   if (IsRestoring)
   {
      var frame = PopFrame();
      switch(frame.pc)
      {
      case 0:
        // instructions to restore locals and stack frame
        goto PC0;
      case 1:
        // instructions to restore locals and stack frame
        goto PC1;
        // ...
      }
    }
    // snip

  var result = 2;
  while (result < a)
  {
    result <<= 1;
    // Backwards jump is one of the places to insert a yield
    if(IsYieldDue)
    {
       frame.pc = 1;
       frame.locals = ...
       frame.stack = ...
       goto Yield;
    }
  PC1:
  }
Yield:
  if (IsSaving)
  {
    PushFrame(frame);
  }
   // return normally
}

Despite all the overhead injected into the methods, Mono still performed several orders of magnitude faster than the original VM.

We would have liked to allow other languages, like C#, however, supporting all of the instructions in the CIL was a challenge; there are hundreds. We controlled the LSL compiler and knew it only used a subset of the CIL instruction set.

The techniques I’ve described for code re-writing were influenced by some Java implementations of mobile agents. See Brakes and JavaGoX (I’m unable to find the original material, but I think this is the original paper)

For those who are really keen, you can also find the actual test suite I wrote for the LSL language here: LSL Language Tests (I was Scouse Linden, for those who look at the page history). We had to break it up into two parts because of the arbitrary 16K limit on program size of LSL; the basic tests were too big! If you look at the tests, you can see some of the quirks of the LSL language, from the assignment of variables in while loops, to rendering of negative zero floating point numbers.

Also here is the source to the LSL compiler but sadly it looks like they spelled my name wrong when they migrated to Git.

I’m not intending to write any more on the topic of Scripting in Second Life unless there is any specific interest in other areas. I hope you’ve found it interesting or useful.

IIRC: User Scripting in Second Life

I’ve had some positive feedback from the last post I wrote, so I thought I’d write up some information about how User Scripting functioned in Second Life from the time when I worked on it. This post covers the background of scripting, the LSL language and a bit about how the legacy LSL2 VM works.

One of the unique features of Second Life is the way users can create content. Using the Editor in the Viewer (the name for the client), they can place and combine basic shapes (known as Prims) to make more complex objects. Prims can be nested inside another prim or joined together with static or physical joints. They can also be attached to the user’s Avatar (guess what the most popular “attachment” was?).

To give an object behaviour, Prims can have any number of scripts placed inside them. A script is effectively an independent program that can execute concurrently with all the other scripts. There is a large library of hundreds of functions they can call to interact with the world, other objects and avatars. A script can do a wide variety of things, for example; it can react to being clicked, give users items or even send an email.

Users can write scripts in LSL; a custom language, which has first class concepts for states and event handlers. A script can consist of variables, function definitions, and one or more named states. A state is a collection of event handlers, only one state can be active at any one time.

The Second Life viewer contained a compiler which converted scripts into program bytecode for its bespoke instruction set and layout.

default
{
  state_entry()
  {
    llSay(0, "Hello, Avatar!");
  }

  touch_start(integer total_number)
  {
    llSay(0, "Touched.");
  }
}

Above shows the default script, created when you add a new script to an object. When created or reset it will print, “Hello, Avatar!” to the chat in the region. When clicked it will print “Touched.”.

LSL has a limited number of data types. It is limited to integer, float, string, list (a heterogeneous list), vector (a 3d floating point vector) and key (a UUID).

The approach of treating each script as an independent program lead to many regions containing thousands of individual scripts within them. To give the effect of concurrency (in the single threaded Simulator), each script would get a timeslice of the main loop to execute a number of instructions before control was passed to the next.

The individual scripts within the original LSL VM, are quite limited. They are only allocated 16KB for program code, stack and heap. This fixed program size made managing memory limits simple; the heap and stack would grow towards each other in address space, and when they collided, the script threw an out of memory exception.

With LSL2, when a Prim moves between regions, migrating the executing script to the destination server is required. Migration of scripts between servers is relatively simple, all the program state was stored in a contiguous block of memory. The running form is identical to the serialized form. This can simply be sent directly over the wire and execution would continue the next time it was scheduled. No explicit serialization required. Unfortunately, this simple design VM lead to poor performance. The code was interpreted; there was no JIT compiler, it did not generate any native instructions.

In an attempt to rate limit particular actions that scripts could do, some library calls would cause the script to sleep for a given time. For example, calling llSay (the method to print some chat text) would cause the script to yield control for 0.1 seconds. Users worked around this limitation pretty easily; users would create a Prim containing multiple scripts. They could farm out the rate-limited work to multiple scripts using message passing. There was no limit to the number of scripts a user could have so this meant they’d effectively have unlimited calls to the restricted function.

For later features, we replace these sleeps were with per Sim rate limits. The rate limit was fixed for scripts that were attached to an avatar and proportional to the quantity of land owned within a region. For example, own 10% of the land, get 10% of the budget. This same style of limit was applied to the number of Prims within a region. This means that the rate limit was now, at least somewhat, tied to actual resource usage.

The users applied similar techniques to give more storage to their objects. Again by placing multiple scripts in a single object, users could distribute the values to be stored to multiple scripts. To retrieve it, they could broadcast a message to all their scripts and the relevant one would respond.

On the Sim, scheduling of scripts is effectively round-robin, however, there are some differences in the way events are scheduled. These differences were discovered and exploited by the users; for example, the users would craft scripts that could get more CPU, they would creating a loop using the colour system. A user could add a handler to the colour changed event and then within the handler change the colour. This short circuited the scheduling and allowed the script to jump the queue.

Like any thing in SL, users created scripts within the Viewer. There is an editor window with basic autocomplete which users can write or paste in any code. The code is then compiled on the client and uploaded like any other asset. This service was a simple, Apache hosted, CGI perl script. The upload process did no validation of the bytecode and users would occasionally try to exploit this by uploading garbage (or deliberately malicious) bytecode.

Users want to be able to use a mainstream programming language; the professionals wanted to be able to hire normal programmers with language experience and the amateurs wanted a transferable skill. The language and runtime often got in the way of adding new functionality, the lack of any data types made some methods very inefficient.

Second Life is fundamentally about user created content and User Scripting was one of the main tools used for this. It suffered many serious flaws and limitations including the problems I’ve described above. We wanted better performance, fewer limitations and new programming languages. To that end we replaced the legacy VM with Mono and the compiler backend with one that could generate CIL instructions. However, this wasn’t a simple swap; it we had to solve many complex problems. The implementation of this is going to be the topic of my next post.

IIRC: Persistence in Second Life

In a colleague’s recent presentation, he mentioned the Second Life in the context of persistence in virtual worlds. As I used to work a Linden Lab, I thought I’d follow up with some more information/notes about how it actually worked. This stuff isn’t secret, they published it on their wiki, mentioned it in office hours and I’ve seen other presentations too. I worked on the small team that worked on the Simulator.

The Simulator or Sim was responsible for all simulation of a 256m x 256m region and all the connection of players within it. The state of a Sim was periodically saved to disk and uploaded by another process to a SAN (often called the asset database). The state was also saved to disk in the event of a crash. Upon restarting, the Sim would attempt to use the saved state, if it could not, it loaded the last normal save from the SAN. This meant that there was up to a 15 minute window from a change in the state of a sim to it being persistent.

This gap could be exploited by players, who would take an item into their inventory then deliberately cause crashes with various exploits. This would duplicate the item into their inventory and leave in place in the Sim.

To mitigate this, we managed to fix all of the reported crash causes. Discovering and fixing all those bugs took years and it was a constant battle to keep on top of them. We had great reporting tools and stats of call stacks. There was also an army of support people who could manually replace lost items. New features inevitably introduced new opportunities for exploit and old exploits were infrequently discovered. Although this mostly mitigated the problem, it did not solve it. Sadly, even if a Sim never crashes you cannot be sure that your transaction will be durable; Sims died for other reasons too. For example, they were killed if their performance degraded too much and occasionally there where accidental cable pulls, network problems and power outages.

The Sim state file could get pretty large (at least 100MB); it contained the full serialised representation of all the hierarchy of entities (know as Prims in SL jargon) within it. This was unlike the inventory database which just had URLs to the items within it. This was a legacy from a time when everything was a file.

The Second Life Sim was effectively single threaded; it had a game loop that had a time slice for message handling, physics and scripts. IIRC, it wrote the state by forking the process to do the write. If we had attempted to write the entire sim state each time we made a modification, it could have been a problem for performance, with the potential for users to introduce DoS attacks.

Users were not the only things that could modify Sim state; scripted items could spawn stuff or even modify themselves. That’s why it was done at a limited rate. Serialised Sim state was not the only form of persistence. Second Life had a rich data model, including user representations, land ownership, classified adverts and many other areas.

One of the largest databases was the residents (the name for players) inventory. The inventory was stored in a sharded set of MySQL databases. The items contained in the inventory were serialised and stored in a file in the SAN or S3, much like the Sim region data. The database contained a URL to the resource that represented the item. Some residents had huge inventories with hundreds of thousands of items. The inventory DB was so large (along with operational decisions to use commodity hardware) that it needed to be sharded. The sharding strategy is to bucketed user based on a hash of their UUID.

Having a centralised store for inventory was essential, most users had inventories way too big to be migrated around the world. It also has operational advantages, the Dev/Ops team were well versed with maintaining and optimising MySQL. Unfortunately, by sharding like this, you lost the ability to do transactions across users as they’re no longer part of one database.

Back when Second Life was still growing, the main architectural aim was focused on scalability, reliability was secondary. The database had historically been a point of failure. Significant effort was put in to partition it, so that it would not be again. The architectural strategy was to migrate all clients of the DB to use loosely-coupled REST web services.

REST Web services are a proven scalable technology; they are what the Internet is built of. Provided the services are stateless, they are able to scale well and will often exploit caching well. Web technologies (specifically LAMPy) used were well know by Dev/Ops; they made scaling a deployment issue.

A secondary goal, of this initiative, was to allow a federated virtual world; something to allow other companies and individuals to host regions and still continue to use their SL identity. We got part way through this before I left, but I don’t think it ever got completed since the growth of SL stopped.

Second life went the long/hard way round to achieve durable transactions. This in part was due to the general issue having a hard-to-change monolith in the simulator. This caused many other problems; it made architectural changes hard. Importantly, it wasn’t difficult to change because the individual classes/files and files were badly coded. The Second Life Sim had too highly coupled parts; a change in one part could affect something seemingly unrelated. The Sim suffered from the ball-of-mud anti-pattern; it wasn’t originally badly designed, but it grew too organically and lacked structure.

I spent significant time introducing seams to produce sensible, workable sub-systems. Persistence is a hard thing to get right; Second Life needed to change several times to support their scale. That said, the state of the art in distributed databases (and even in normal database including MySQL) has progressed significantly since that time. Get it right, engineer in the qualities of transactions and persistence that we need from the start, and it will save you significant effort.

Gitflow-style-releases with Teamcity versioning

On the current project I work on we’re using the Gitflow branching model. We also use Teamcity for CI. Using Gitflow with SemVer means that you have to specify the version number each time you release giving it a specific meaning based on the changes within that release.

Previously, when using SemVer, I’ve just used the pre-release tag to identify builds from the build server, preferably, with something that ties a version to a specific revision. This is fine, but there is some duplication. Gitflow calls for you to tag the revision of release with a label. Teamcity has a build number that you can specify. The two of these overlap and I’d rather not have to type this number twice. I want to make the process of releasing as simple as possible.

There is a meta-runner for Teamcity the uses GitVersion to set the version number. This might provide you with the functionality you need, but unfortunately, for me there are two problems for me. The first is that we run build agent accounts that cannot use chocolatey and the meta-runner attempts to use it to install GitVersion. The second was to do with a limited checkout branches that Teamcity does; it doesn’t have all the tags. GitVersion attempts to use a full checkout to get the branches, I’d rather not do this as Teamcity has its own style of checkout that I don’t want to go against.

The first thing we need to do is get the version from the tag. As I’ve already mentioned, as with GitVersion, you can’t always get the tag with Teamcity’s limited checkout. Instead, the approach I’ve taken uses Artifact Dependencies.

Gitflow assumes one releasable thing per repository, or at least, only one version number. At the moment I’m using a major version of zero so I’m not tracking API breaking changes.

Using Teamcity, create a build configuration with a VCS trigger for your master branch. This is the branch from which your releases are built. To do this you need to add the trigger with the filter of +:refs/heads/master

Then add a Powershell script build step that executes the following.

$TagVersion = git describe --tags --match v*
Write "##teamcity[buildNumber '$TagVersion']"
$Version = "$TagVersion".TrimStart("v")
Write "$Version" > library.version
$parts = $Version.split(".")
$parts[1] = [int]$parts[1] + 1
$parts[2] = "0"
$SnapshotVersion = $parts -join "."
Write "$SnapshotVersion" > library-next-minor.version

This will generate two files, one which contains the current version from the tag, and another which contains the next minor version. Add these files (library.version & library-next-minor.version, renamed as appropriate) as artifacts. These artifacts can be used by other Build Configurations that produce outputs that need to be versioned.

In your build configuration use an Artifact Dependency on the latest successful build of the master branch. To make teamcity update the build number add the following build step.

$branch=git rev-parse --abbrev-ref HEAD
$hash=git rev-parse --short HEAD
$thisVersion = Get-Content corelibrary.version
$nextVersion = Get-Content corelibrary-next-minor.version
if ($branch -eq "master") {
$version = "$thisVersion"
} elseif ($branch -eq "develop") {
$version = "$nextVersion-$hash-SNAPSHOT"
} else {
$upper = $branch.replace("/", "-").toupper()
$version = "$nextVersion-$hash-$upper-SNAPSHOT"
}

Write "##teamcity[buildNumber '$version']"

If you need to access this later from your build scripts, you can get it from the BUILD_NUMBER environment variable.