We invite each of our featured SoftLayer Tech Marketplace Partners to contribute a guest post to the SoftLayer Blog, and this week, we're happy to welcome David Mytton, Founder of ServerDensity. Server Density is a hosted server and website monitoring service that alerts you when your website is slow, down or back up.
Tech Partners Marketplace: http://www.softlayer.com/marketplace/serverdensity
5 Ways to Minimize Downtime During Summer Vacation
It's a fact of life that everything runs smoothly until you're out of contact, away from the Internet or on holiday. However, you can't be available 24/7 on the chance that something breaks; instead, there are several things you can do to ensure that when things go wrong, the problem can be managed and resolved quickly. To help you set up your own "get back up" plan, we've come up with a checklist of the top five things you can do to prepare for an ill-timed issue.
How will you know when things break? Using a tool like Server Density — which combines availability monitoring from locations around the world with internal server metrics like disk usage, Apache and MySQL — means that you can be alerted if your site goes down, and have the data to find out why.
Surprisingly, the most common problems we see are some that are the easiest to fix. One problem that happens all too often is when a customer simply runs out of disk space in a volume! If you've ever had it happen to you, you know that running out of space will break things in strange ways — whether it prevents the database from accepting writes or fails to store web sessions on disk. By doing something as simple as setting an alert to monitor used disk space for all important volumes (not just root) at around 75%, you'll have proactive visibility into your server to avoid hitting volume capacity.
Additionally, you should define triggers for unusual values that will set off a red flag for you. For example, if your Apache requests per second suddenly drop significantly, that change could indicate a problem somewhere else in your infrastructure, and if you're not monitoring those indirect triggers, you may not learn about those other problems as quickly as you'd like. Find measurable direct and indirect relationships that can give you this kind of early warning, and find a way to measure them and alert yourself when something changes.
2. Dealing with Alerts
It's no good having alerts sent to someone who isn't responding (or who can't at a given time). Using a service like Pagerduty allows you to define on-call rotations for different types of alerts. Nobody wants to be on-call every hour of every day, so differentiating and channeling alerts in an automated way could save you a lot of hassle. Another huge benefit of a platform like Pagerduty is that it also handles escalations: If the first contact in the path doesn't wake up or is out of service, someone else gets notified quickly.
3. Tracking Incidents
Whether you're the only person responsible or you have a team of engineers, you'll want to track the status of alerts/issues, particularly if they require escalation to different vendors. If an incident lasts a long time, you'll want to be able to hand it off to another person in your organization with all of the information they need. By tracking incidents with detailed notes information, you can avoid fatigue and prevent unnecessary repetition of troubleshooting steps.
We use JIRA for this because it allows you to define workflows an issue can progress along as you work on it. It also includes easy access to custom fields (e.g. specifying a vendor ticket ID) and can be assigned to different people.
4. Understanding What Happened
After you have received an alert, acknowledged it and started tracking the incident, it's time to start investigating. Often, this involves looking at logs, and if you only have one or two servers, it's relatively easy, but as soon as you add more, the process can get exponentially more difficult.
We recommend piping them all into a log search tool like (fellow Tech Partners Marketplace participant) Papertrail or Loggly. Those platforms afford you access to all of your logs from a single interface with the ability to see incoming lines in real-time or the functionality to search back to when the incident began (since you've clearly monitored and tracked all of that information in the first three steps).
5. Getting Access to Your Servers
If you're traveling internationally, access to the Internet via a free hotspot like the ones you find in Starbucks isn't always possible. It's always a great idea to order a portable 3G hotspot in advance of a trip. You can usually pick one up from the airport to get basic Internet access without paying ridiculous roaming charges. Once you have your connection, the next step is to make sure you can access your servers.
Both iPhone and Android have SSH and remote desktop apps available which allow you to quickly log into your servers to fix easy problems. Having those tools often saves a lot of time if you don't have access to your laptop, but they also introduce a security concern: If you open server logins to the world so you can login from the dynamic IPs that change when you use mobile connectivity, then it's worth considering a multi-factor authentication layer. We use Duo Security for several reasons, with one major differentiator being the modules they have available for all major server operating systems to lock down our logins even further.
You're never going to escape the reality of system administration: If your server has a problem, you need to fix it. What you can get away from is the uncertainty of not having a clearly defined process for responding to issues when they arise.
-David Mytton, ServerDensity
These Partners have built their businesses on the SoftLayer Platform, and we're excited for them to tell their stories. New Partners will be added to the Marketplace each month, so stay tuned for many more come.