In part 1, we described a data model where the entire business was mapped with its supporting processes and technologies. We were not only able to set an accountability matrix, but also set the stage for connecting other systems and metrics as we saw fit.
Choosing One Metric
From an Org wide metrics perspective, an industry standard is using a defined Service Level Agreement (SLA) for each business process. In this example, we are going to use availability, or “uptime”, so if you have a 99.9x% SLA, that corresponds to a number of hours per year your service is up and running. For example, a 99.95% SLA equates to 365 days (typical year) * 24 hours a day is 8,760 hours, subtracting 99.95% and that leaves 4.38 hours to be down.
Another question that is asked at this point would be, what counts as downtime? We determine it by how much it affects the business processes. It can vary from organization to organization, because there are geographic factors, partial outages, and varying degrees of service disruption. At the end of the day, we put these all in a formula and categorize it by severity as part of the main calculation for our uptime. For example, an application can suffer a Severity 1 incident for 1 hour, and that will be docked against the total count, it could also suffer many Severity 3 or 4 incidents, which will be recorded but not counted against the SLA. Thus at the end of the day, we are able to tie any event to a single measurement (for the purposes of this post we will keep it at one measurement).
From here, we had to design a system to track downtime as well as tying it to all 3 components in part 1. There is a lot of simplification here, but at the very base unit, we tied various events, incidents, cases, and other data points to our technology records. That is, if there is a disruption, we always tie it to a technology, and then from our data model we are able to link the technology to services, and then to capabilities.
So in a (simplified) nutshell: Support Cases, Incidents, Tech Assets (Servers), Requests are all tied to 1 Technology, and these then roll up to the Services/Capabilities data model we’ve set up so far.
Still with me here? Let’s use our previous example of Recruiting. The Workday integration server that pushes data gets corrupted and and all of a sudden the hiring team’s source of truth has been erased.
The moment the integration server goes down, our monitoring agents will know immediately, if not, there are various other means to find out, like chatter posts, word of mouth or getting reported that something is wrong. All paths create an Incident, which then automatically informs the Application Manager and Service Manager. The teams then restart the integration and it is fixed within the hour, all data is back to normal.
In this scenario, we’ve properly documented and tracked each step of the disruption, and can do other processes like our Root Cause Analysis to prevent this same type of event from happening again.
With the data recorded, we can tie 1 hour of disruption to the workday integration Technology, which affects the Recruiting service, which affects the Recruit to Retire business capability.
So what does the data look like?
Taking the hours downtime in consideration from any source, we can get a view of which Technologies were down in a given time period:
While the technologies look red and yellow, what really matters is how they roll up to Services:
As you can see, we can track how each Service Manager is performing as well prioritizing which Services to proactively monitor and perform preventative maintenance on. This view is too granular though, so we roll up once again to Capabilities…
Here is what we call the CEO view, where at a once stop glance, the CEO can determine the health of the organization and check in with other leaders as to any issues they need be on the look out for.
Our Operations also work on what Root Cause Analysis projects are in flight as well:
Obviously, there is a lot of simplification on what is going on here, and we are only zeroing in on a specific measurement and bit of data. There are other systems in place as well as other metrics that we care about, but I hope that gives a partial look at how we run operations at Salesforce. I gave a Dreamforce talk on this here: https://www.salesforce.com/video/3620722/ , where our computer crashed in the beginning and all our demos broke, but it may still be worth a listen. Please do let me know if you want me to discuss other topics in this realm and I’d be happy to post what I am allowed to.