By Gene Kim, CTO-Founder of Tripwire, Inc. and co-founder of the IT Process Institute
Have you ever had this happen to you?
Project Killer Kumquat is finally going to deliver the set of features that’s going to allow us to catch up to the competition. We’ve had over 300 developers have been working on this project for nine months. It’s been a death march for them.
This is one of those damned date-driven projects where senior management made some promise to Wall Street and customers that we were going to ship this week.
The developers were over two months late delivering their code. But, instead of what rational people do, the business just said, “That’s okay. We’ll just cut the time dedicated to the downstream tasks, like QA and Production Deployment.”
QA and Production Deployment. I’m the QA Manager. Between us and the deployment team, it’s like being stuck between the truck and the loading dock. It sucks.
29 hours ago, the developers checked in all their code, and we started the QA testing. Not only did things not go as planned, we now have a potential catastrophe on our hands. This was supposed to be a damned 4 hour deployment, and we’re 29 hours in, with no end in sight.
I look blearily at the clock that says it’s 3am, and I regret the decision I made twelve hours ago not to cancel this whole damned release and initiate a rollback. Now, it’s too late. We’re in so deep that we’ll be lucky if we have everything running by the time the East Coast customer start trying to access the systems in three hours.
I just knew something really bad was going to happen when the deployment team kept saying, “I just need another hour”, and I had already given them five hours. At some point, we should just put down the shovel and step away from the hole.
Now it’s pretty clear what happened. And upon some reflection, and after taking a 15 minute walk outside to clear my head, I’m starting to think that this is what happened to us in our last release, too. (But nowhere nearly as painful…)
28 hours ago, when we started testing, my team started finding failures left and right. Which is what we expected, given all the corners that were cut by the developers because of deadlines. But, for some of these issues, it took us hours to figure out whether it was a problem with the code, or something wrong with the QA environment, like an incorrectly configured OS, library, database, or variance between what we’re using and what Dev used.
And so, being the heroes that we are, once my team started finding the errors, we bent over backwards to fix them. We changed mount points, we modified configuration settings, changed file permissions, modified database stored procedures, we added user accounts, etc…
The problem is, none of those changes were systematically replicated downstream to production.
In fact, our problem is right now, my team is so tired from 28 hours of firefighting, they can’t remember what they did to get things running. (Jeez. I’m looking at one of my guys trying to figure out what he had written on his hand eight hours ago to figure out what he did, but it’s long since faded.)
And so now, we’re repeating the whole firefight again, but this time in production. And frankly, we’re now screwing up more stuff than we’re actually fixing.
But, actually, that’s not the worst part. Some stuff is breaking because this happened in our last release, and all *those* changes weren’t systematically replicated into our Dev and QA environments!
Lesson: Preproduction changes must be captured, and systematically replicated on downstream systems (e.g., Production), as well as queued up to be replicated in upstream systems for the next release (e.g., Dev, Integration Test, etc.)
This is one of my favorite uses of Tripwire, which is to control pre-production environments, to ensure that we can quickly move releases into production, faster than ever, without introducing chaos and disruption to the production environment. I’ll write more about this later.
Click here to stay Informed With RSS Feeds or Email Alerts Here:
Gene Kim is the CTO of Tripwire, Inc. and co-founder of the IT Process Institute & founder of Tripwire, Inc. He is currently actively working on a series of cross-industry projects to capture and codify how “best in class” organizations have IT operations, security, audit, management, and governance working together to solve common objectives. Gene co-chaired the “Generally Accepted IT Principles Summit” with the Institute of Internal Auditors in July 2005 to help codify how to create reasonable IT audit scope for SOX-404. In 2004, he co-wrote the Visible Ops Handbook, codifying how to successfully transform IT organizations from “good to great.” In 2003, he co-chaired two conferences with SANS and the Software Engineering Institute, and was named by InfoWorld as one of the “Four Up and Coming CTOs to Watch.” Gene is certified on both IT management and audit processes, possessing both ITIL Foundations and CISA certifications.
Tripwire helps over 6,500 enterprises worldwide reduce security risk, attain compliance and increase operational efficiency across virtual and physical environments. With its industry leading configuration assessment and change auditing software solutions, IT organizations achieve and maintain configuration control. Tripwire is headquartered in Portland, Ore. with offices worldwide.
The Author gives permission to link, post, distribute, or reference this article for any lawful purpose, provided attribution is made to the author, Tripwire, and to Information-Security-Resources.com.
No comments:
Post a Comment