Posts Tagged ‘operations’

A Day in the Life of a DC Supervisor

Friday, August 15th, 2008

Eric BushHosting companies often operate their data centers with The Wizard of Oz’s “Pay no attention to the man behind the curtain” mentality, so you never know what the data center operations look like on a day-to-day basis. You might say I’m one of The Planet’s “behind the curtain” employees … not only do I work in one of The Planet’s data centers, I’m also the overnight supervisor. While some of these activities may seem a bit mundane, each is designed to provide our customers with the very best operational environment. The processes and procedures documented here are performed by all shifts in all of our data centers.

Shift Change

Our overnight shift begins daily at 7:00 PM (CDT), with a hand-off of issues from the previous shift which may need special attention. These hand-offs are typically “work already in progress” issues like operating system reloads and a system upgrades.

Once this “shift change” is complete, and the new shift’s operations staff is in place, each member of the team has specific tasks to perform. These tasks include the assessment of scheduled maintenance during the shift, a review of pending customer orders and server upgrades, and response to reboot requests that were submitted during the shift change.

Following these initial checks, we move into normal operational status for a majority of the night. During this period, we monitor and respond to high priority issues (usually in the form of tickets) and undertake any scheduled maintenance work. In addition to this technical work, a staff member will also perform a perimeter patrol every four hours.

Perimeter Patrols

While the name doesn’t sound like much, the perimeter patrol is an integral part of our data center operations designed so our staff can constantly monitor and control the data center’s status and its operational readiness. Each patrol takes around an hour to perform and is a top-to-bottom inspection of our facility and server environment. To better understand the significance of the patrol, take a look at some of the system and facility checks.

- Temperature and Humidity Readings for all Environmental Units

Each phase of our data centers have as many as 18 heating, ventilation and cooling units. These units are operationally independent from the others and use a dual-compressor technology to provide heating, air conditioning, and humidity environmental control. Our data centers are kept at or below 76 degrees Fahrenheit, and if temperatures exceed that threshold on any one unit, redundant units are brought online and a facilities engineer is immediately contacted regardless of the time of day.

- Commercial UPS System Status Displays

Our data center is fully protected by redundant, commercial-grade uninterruptible power systems. In the unlikely event of an interruption to electricity supplied to our facility, these UPS systems provide stop-gap power and automatically signal our generators to start. Within seconds, our generators are running and providing power. In my data center, we’ve got 10 display panels to check for operational readiness on each patrol.

- Generator System Panels and Status Displays

There are two levels of generator monitoring in our data center. The first is a remote set of annunciation panels installed in our monitoring center. These displays provide an instant status of our power generating system. The second and more comprehensive set of screens we check are in our power generation and transfer control room. On those displays, we check important generator operational parameters such as engine coolant temperature, fuel level, and battery voltage. An interesting fact about our generators: the engines are temperature controlled, even when they are not running. If you ever win Jeopardy with that information, I’ll expect a cut of your prize.

- Electrical Power Transfer Switch Status Displays

Each one of our power transfer switches has a detailed display that shows us the status of the power entering our facility, whether backup power is online and available, and other, more detailed parameters. Here, we also check the total amperage in use by the facility for any abnormal variances.

- Life Safety System Status, Piping Pressures, and Fire Suppression Tank Levels

Our data center uses combined methods of fire suppression technology, so we regularly verify their operational status by reading air and water pressures, inspecting the piping for integrity and checking the master status panel in our monitoring center. Our fire suppression system is constantly monitored by our data center staff and an outside monitoring company. In the event of an alarm, emergency units are automatically dispatched to our facility.

- Exterior Doors and Intrusion Detection System Panels

All entry points are physically verified as being closed and locked on a regular basis. We have an intrusion alarm panel in our monitoring center which shows the status of our entire facility which is constantly monitored by data center staff.

- Closed Circuit Video Monitoring System

We employ a comprehensive video monitoring system in our data center with remotely controlled cameras mounted at strategic points both inside and outside of the data center.

- Outdoor Generator Enclosures and Engine Components

On each patrol, we actually open the generator compartments and peer inside at the massive engines which supply our data center with power during utility power outages to check the area around the generator for leaks, debris, and manually verify fuel level indicators. To get a sense of the size of the generator, check out Kevin’s “Data Centric” post.

- Outdoor Air Conditioning Condensers

For each of our heating, ventilation, and cooling systems inside the data center, there is a corresponding condenser outside the facility. We walk by each condenser and make sure the fans are running at optimal performance.

I’d love to sit and chat a little longer, but I’ve got some work to do. Assuming that this post is helpful and informative, I’ll start thinking of a few of the other tasks in the data centers that you might enjoy learning about.

-Eric
H2 Data Center Operations Supervisor (overnights)

SAS 70 Type II

Monday, June 30th, 2008

Kevin HazardOver the course of the last several months, we’ve been working with Weaver & Tidwell, L.L.P., a highly-regarded certified public accounting firm out of Fort Worth, to complete an exhaustive Statement on Auditing Standards No. 70 (SAS 70) Type II audit. Developed by the American Institute of Certified Public Accountants (AICPA), the widely recognized auditing standard certifies that The Planet has been through a rigorous evaluation of its internal processes and controls through an independent third-party auditor.

Voluntarily undergoing an exhaustive audit by a third-party that takes months to complete.

A SAS 70 Type II audit is certainly a big-time undertaking. Some even think starting the process of a future review is worthy of a dedicated blog post … we just got it done.

In the process of the audit, we checked and evaluated the controls and processes for our network, customer provisioning systems, physical and environmental security, problem management and resolution through our customer portal, human resources department organization and administration, data center operations, and most importantly, our data centers themselves.

Daniel Golding, vice president and research director for Tier1 Research explains the significance of SAS 70 compliance in the context of the hosting industry:

Hosting providers that want to offer meaningful IT services to larger enterprises see SAS 70 as the means of both meeting Sarbanes-Oxley auditing requirements, while reassuring IT decision makers that their processes, facilities and staff are capable of providing true enterprise-grade services.

The Sarbanes-Oxley legislation consists of standards required of every public company and important to any company considering/anticipating an IPO. In searching for additional reference information on the significance of SAS 70 to SOX compliance, I came across a great resource: www.sas70.com. The site has a dedicated Sarbanes-Oxley page, where the significance of a Type 2 audit masterfully described:

Section 404 [of Sarbanes-Oxley] draws attention to the significant processes that feed and comprise the financial reporting process for an organization. In order for management to make its annual assessment on the effectiveness of its internal control, management is required to document and evaluate all controls that are deemed significant to the financial reporting processes. If the organization uses a service provider to process transactions, host data, or other signficant services, management may need to evaluate the design and test the operating effectiveness of the service organization’s controls.


Management will either need to conduct an evaluation of the service organization’s controls, or management may obtain a Type II SAS 70 service auditor’s report from the service organization, if a service auditor has been engaged, to gain an understanding of the service organization’s controls. The relevant audit guidance for SAS 70 already requires that a service auditor’s report contain information on the five components of internal control as it relates to the service organization.

The difference between a Type I audit and Type II audit is pretty significant: Both say “we have well-designed processes, controls and goals,” but the Type II audit must show that the controls and processes have been practiced and they were successful in achieving the initial goals. The proof is in the pudding.

What Does It Mean?

It’s clear that the successful completion of the SAS 70 Type II review is important to all of our customers. It reinforces our commitment to providing the best hosting experience in the industry. Our processes, practices, procedures and controls have been tested and have been proven successful in helping us achieve our operational goals.

-Kevin

New Schedules in Data Center Operations

Monday, May 12th, 2008

Scott KingAt The Planet, we work in a continuous-improvement environment, always looking for new ways to improve our operations and our service. One of the recent changes we’ve implemented in Data Center Operations is a new 12-hour schedule for all data center technicians and supervisors.

With the new schedule, our employees work 12 hours each day in a two-week rotation. In one week they’ll work three days, followed by four days the following week. These schedules actually provide more people on each shift throughout the entire day, and also eliminate wasted time overlaps between shifts and days with double staff.

Every technician affected by the schedule change alternates three- and four-day weekends. Because the shifts are longer, each technician comes into the office seven out of every 14 days, compared with 10 out of every 14 days with the old schedules. The new schedule also streamlines our support communications, especially with regard to ownership and handoffs. We’re eliminating a few shift changes in the middle of the day, so projects and tickets aren’t as likely to bounce between shifts. Not only is this a benefit to our customers, our employees like the additional freedom it provides.

The efficiencies gained from the new schedules will reduce our data center operations costs by approximately $500,000 a year. With that savings, we can fund new positions and projects to improve service. It also increases our efficiency and effectiveness, which are two of the most important mantras in any “operations” handbook.

We’ve implemented the new schedules in four of the six data centers so far, with the final two migrating to the new schedules within the next few weeks. The feedback from the technicians and management staff in the first four data centers has been overwhelmingly positive.

I’d like to personally thank all of our helpful and open-minded technicians who have allowed us to make a significant change in their lives. Our success is fundamentally tied to the team’s 100 percent commitment to the new structure, and we couldn’t be happier with the results.

-Scott

A New Spin on Policy Creation at The Planet

Friday, May 2nd, 2008

Scott KingOne of the great aspects of being in a new operations role is the ability to take a look at policies and procedures with a fresh eye. As the new senior director of Data Center Operations at The Planet, that’s been one of my first priorities.

I have never been a fan of policies written by senior executives and then simply rolled out with a directive to follow. For any policy to be effectively implemented, I contend that the team that must follow these policies should have a hand in helping to craft the policies or at least understand why the policies were created. More importantly, these processes come together through the team, rather than from a single individual’s effort. The best method I have seen for this comes from the Six Sigma Continuous Process Improvement (CPI) methodology.

In our data centers, we have created a Continuous Process Review (CPR) team made up of technicians, supervisors and managers across multiple cities. This team is solely responsible for the review and creation of all operations policies, processes and procedures across our six data centers.

The results from our team’s work together have been outstanding. In the last two months, we have produced new and/or updated policies for operational functions like ticket prioritization, technical escalations and management escalations. Since all levels of the data center operations organization are represented, the response to the new policies has been fantastic.

I owe many thanks to all the people who have participated on this new CPR team. None of this could have been achieved without their hard work and dedication.

The Planet’s goal is to provide the best customer experience in the hosting industry, and the data center operations team plays a huge role in that experience. Our ability to adapt to the evolving hosting landscape and respond directly to customer feedback is instrumental in improving the way our data centers run, so if you have any suggestions with regard to our data center policies and procedures, please let me know.

-Scott

Earth Day 2008

Tuesday, April 22nd, 2008

Yvonne DonaldsonEarth Day typically inspires widespread environmental introspection. How can we cut down on waste? Can we be more efficient? Are we actively pursuing “greener” operations? And how can we reduce our costs and be fiscally responsible?

Houston is recognized as the energy capital of the world, so it may be a surprise to learn that amidst that distinction The Planet does its part to reduce energy costs. In fact, we have been featured in several “green technology” articles over the past few months and acknowledged for our common sense approach. Ultimately, we look to save money, reduce consumption and improve data center efficiency. And in the coming weeks, we’ll announce an expanded program that takes us to the next level in increasing those efficiencies.

Tier1 is a leading research firm, and Martin Levy is the firm’s “green” analyst. In his report on The Planet, his headline was simple: “Down-to-earth solutions help improve efficiency at The Planet.”

Martin goes on to say the following:

Not a word about carbon offsets. Nobody planting trees. Nothing about Renewable Energy Credits (RECs). No recycling bins at the entrance to the datacenters. Instead, today’s announcement from The Planet was all about core datacenter efficiency. The company runs six datacenters and because of a focus on efficiency, it expects to save over one million dollars during 2008 … T1R is impressed. The Planet has shown that going green can be done the old-fashioned way. Make the technology work better and the company sees a positive ROI. That’s still good for the environment and even better for the bottom line!

Our facilities team is always on the lookout for new ways to reduce energy costs, since it’s one of our biggest expenses. Our vice president of facilities, Jeff Lowenberg, took an interesting challenge at the end of last year: Cut power costs by $1 million dollars in 2008, while we continue to grow and provision new servers in our six world-class data centers.

In his Sustainable IT blog, Ted Samson reported on a few of the initiatives aimed at improving our efficiency:

  • Rearranging floor tiles to better manage cold airflow
  • Installing seals and grommets in the ceilings, walls, and floors to reduce bypass airflow
  • Installing blanking plates in server cabinets to direct airflow more efficiently
  • Sealing power distribution units to reduce bypass airflow

Ted also explained the significance of those “minor” improvements:

Cool air was going to only where it was needed: the server intakes … Six months later, the company finds that its efforts have paid off substantially. Even though critical server loads increased by 5 percent, the facility’s overall cooling power needs dropped by 31 percent … The Planet also improved its “coefficient of efficiency,” an EPA- and Uptime Institute-recognized measurement of the total power necessary to operate a data center, divided by critical power, which represents the energy required to operate its computers. The company increased its rating to 1.7 – a near-ideal number – from its previous “good” ranking of 2.0.

Matt Stansberry at Search Data Center also spoke with Jeff about our progress and shared a few additional details in the quest to improve data center cooling:

Data center cooling is where most of infrastructure energy efficiency is lost. The fundamental rule in energy efficient cooling is to keep hot air and cold air separate … The Planet uses a method of extending the height of its computer room air conditioning (CRAC) units’ return-air plenums to optimize air cooling … By extending the plenums higher, it ensures that the CRAC units are not sucking in any cold air from the cold aisles, as it allows for the hottest air to be sucked into the units. In this scenario, the top of the plenums must be at least 2 feet from the ceiling.

To get an idea of what “plenums” are, you can visit Matt’s post or Heather Clancy’s recent article about The Planet at ZDNet’s GreenTech Pastures … and while you’re there, be sure to check out the post’s opening line. :-)

To stay in the loop about what is being done in the “green tech” sphere, keep an eye on Ted Samson’s Sustainable IT blog, GreenerComputing, The Daily T1R from Tier1 Research, ZDNet’s GreenTech Pastures and Search Data Center.

And watch for more news from us.

-Yvonne