December 15, 2014

Monday’s downtime

You may have noticed we had a bit of downtime last Monday... 15 hours to be exact.

First things first: we're really sorry for the hassle this caused. Many of you rely on Niice as a key part of your workflow, and we take that responsibility seriously.

I thought it would be useful to explain what happened and what we’re doing to make sure it won't happen again.

What happened

We woke up on Monday morning to find that Niice was down. Our hosting provider had performed a server migration during the night, and there had been issues with the server we were hosted on. They were looking into it, but weren't able to give us an estimate for when it would be fixed. By this stage (9am GMT), the site had been down for around 8 hours.

Our response

With no clear idea of when the server would come back online, we began the process of getting Niice set up on another hosting provider. A few hours later (around 1pm GMT), we were nearly ready to go: the only problem was data.

There was a time gap between our latest backup and the site going down, and until the server issue was fixed we wouldn't be able to access the data that was created during that gap (about 50 people had created accounts and moodboards in that time).

We had hoped that we would get access to the old server to ensure that there was no difference in data between the old server and the new, but after another few hours (4pm GMT) we decided to go ahead and get the site live using the backup data.

We’ve since gotten access to our old server, and are in the process of merging the 'gap' data into the new database to make it available again for the few who were affected. It's taking some time though, as we’re being extra careful not to lose any data in the process. Not having access to data is bad, but losing it would be completely unacceptable.

Lessons learned

Despite doing all we can to prevent it, sometimes things go wrong and break. It's important that we have the right processes in place to minimise the impact of these rare, but frustrating, events.

This episode highlighted an important weakness in our recovery process: our data backups. While our current processes ensured that we didn't lose any data, it still led to data being unavailable to some people for an unacceptable amount of time. Here's what we're doing to fix this:

In the short term we're moving to more frequent data backups, which will lead to less data 'falling through the gap' and having to be merged back into the database later.

In the medium term we're moving to completely duplicating our data across two separate locations, so even if one becomes unavailable we'll be able to go live with the other one with a moment's notice.

In the long term, this highlights a larger issue with cloud services like Niice in general: i.e. what happens when the site becomes unavailable?

Over the past few years we've all been burned by services we depended on for work becoming unavailable due to everything from server issues to acquihires. No amount of apologies or 'thanks for joining us on our adventure' emails will make up for being let down by a service you depend on to do your job effectively. With features like Dropbox Two-way Sync we're taking steps towards making Niice available offline, but we have a way to go yet. I'd love to hear your thoughts and suggestions on what more we can do to improve in this area.

Once again, I’m really sorry for the hassle caused by our downtime. We’re working to make sure this won't happen again, and to prove that Niice is a service you can depend on.

Private collection & effortless presentations for creative teams.

Find out more