Here's your copy of
Incident Management Basics
Transcript of webinar
Introduction:
Joseph: The presenter for this webinar is Neil Thomas, managing director and VP, Professional Services at Service Sphere. He has an extensive industry experience in IT, IT support and ITIL consulting. He specialized in service catalog and CMDB. Neil take it over.
Neil: Brilliant, thank you Joseph, welcome everybody and Good to have you back on this webinar series, the third in a row that we've been doing and so that's all about me.
Slide 3: What we do, very little about what we do, but my company changed its name just very recently to SERVICE SYMMETRY, we're an offshoot of what was Service Sphere. So we do a whole load of bits and pieces around ITIL, ITIL training, service desk institute training consultancy, learning all those good things you've got there. So if you're interested get in touch with me, but more importantly is the fact that all that good stuff has allowed me to have many years of experience within the industry especially in terms of ITIL as a consultant but also as a product manager working for a vendor in terms of constructing and building IT service management product sets right the way through from incident right there through the Service Catalogue. So I have a pretty extensive knowledge of how this stuff works both from the inside and also how people in the real world actually take it on and use it.
The Webinar Series - 1:27
Slide 4: So without further ado, this is the third in the series of ManageEngine, really taking us right through from the Service Catalogue, talk about how services effective business and used by the business what delivers them that the stuff in the configuration management database, configuration items, and the like and then of course, the good stuff that we're all familiar with of incident, problem and change and then the last in the series in a few week's time will be how do we measure the service desk performance.
So today is all about incident and it's very interesting to see quite a few of you on the call today and why are people so interested in incident? it's been around for a huge number of years and what we're looking at is getting back to some of the basics, and it's very interesting that we should be doing this because, in my a role as a consultant, I come across so many organizations that you know, what to do incident management, they tick the box and unfortunately they get it wrong. It's so often that the fundamental principles of incident management are forgotten. So ManageEngine being the good people they are, thought it would be a good time to just revisit some of those things. So for those of you who are out there, it's a bit of a catch up, for those of you who are just first into the game and want to know a bit more about it. Then this covers some of the basics that we have around incident management.
Topics Today - 2:48
Slide 5: So we're going to be looking at where it fits within the whole ITIL framework just doing incident isn't enough, because there's so many other bits and pieces that you don't necessarily have to do, but you certainly have to understand and consider and put the thing in its context looking at a bit of how it fits in the service desk, how incident compares to service requests. Another incident workflows looking at a bit of knowledge and how that fits in as well as the whole thing of service-level-agreements and good old relationship between incident and problem and incident and change. It's all part of the same ITIL framework.
Slide 6: So without further ado, we come on to a slide you've seen several times before, these are the services that the company puts out that IT is supporting and delivering. So these are things that you might find in the Service Catalogue, but certainly are things the company does from sales to marketing to field support to credit check in finance, you name it, all the pieces all the functions within an organization that it needs to survival of course it's supported by IT and IT services and of course those IT services sometimes don't perform quite as they should which is where you get the whole incident coming in.
Slide 7: So let's have a look at that in context, so those are services right in the middle and ITIL is there and IT service management is there to support those services and that's the key thing incident management doesn't just exist to make us feel good and to get us in to work on a Monday, although it does provide a lot of people with a lot of jobs. No, incident management is all about making sure the services that are delivered to an organization carry on being delivered because without them the company is not being as efficient as it should and it's likely to have trouble.
So the user needs something to have a service delivered to them by service request or to the Service Catalogue and if something goes wrong with that service, that's the good old job of incident management. It is fundamental that if something breaks, we need to fix it, we need to get that service back up and running cause every minute that service is down it's costing the company money, time, inefficiency, competitors are getting in the way etc., so a very vital piece of the whole jigsaw, obviously a lie to that is how quickly those people, those services are supported. So the whole idea of service level management comes in here, we touch on that briefly today.
Now if something keeps going wrong again, the link between incident and problem comes in here something keeps going wrong then obviously we move over to the whole principle of problem management, and these are processes, best practices that have put out by ITIL as part of their version 3 framework.
Now if it keeps going wrong, we find out what the issue is and we want to fix it then obviously we start looking into change management, changing things in a very rigorous and defined process manner. So that we don't have risk introduced to the organization, manage risk out of it. The thing is that of how we deliver those changes to the business is all to do with release management, we then have what delivers configuration management, how we make sure that all those services are there in the future all those other esoteric bits and pieces that we get from ITIL from availability, capacity, service continuity management and then of course finally sorry, we get through to the managing of those services with the service portfolio and financial management. But you see here that the delivery of service the stuff of business the stuff that makes business work is absolutely vital, a cornerstone of that which, is why everybody sort of does incident management, of course it is the fact that if it goes wrong we need to get it up and running quickly.
Slide 8: So, on we go to incident management, so really this slide recaps just what I've been wittering on about it's all there to restore normal service as quickly as possible. I think certainly the people on the service desk, any of you guys that are out there and certainly from my own point of view having done this way back in my history. The key thing that users want and get frustrated about is getting on with their job they, they know that every minute that they are not working there, is a downside to the business that the efficiencies could come in and sometimes of course, with big outages, with big problems, we need to get on these things quickly. So having a good incident management process and being able to manage this properly is absolutely vital. So I guess a definition a good old ITIL stuff here, so any event that disrupts or could disrupt one of those services that we've been talking about,
Key Elements - 7:34
Slide 9: So the key things, some key facts I suppose about incidents well really an incident could be applied to just about anything, doesn't have to be IT those of you who've got combined service desks know that it could be about facilities, it could be about HR, it could be about services or IT equipment. it can span a whole range of different things that the span and the breath is really down to you guys and of course they can these days be reported just about any means possible, not just people in as in my day running up to you and saying "look I've got a problem how do I get to do this? Or my computer's broken." but it's reported by email, by phone, by self-service novices. Now we get into the whole social media aspect by things like Twitter and Facebook something that I've been working on very recently.
Another area where things have gone wrong if you like, is in the infrastructure, and there's a whole piece that's now been carved out in ITIL version 3 called event management, where events that are being detected by a network management tool sets automatically raise an incident and coincidentally can automatically close incidents as well. But at least you have a visibility about what has happened, not going to touch on that very much today except that they can be can be raised from a particular way and obviously these are normally all these incidents reporting recorded in the service desk to ensure there is somebody who is managing the whole process and working out what's going on and all that stuff produces data, data is absolutely vital in ITIL to be able to carry on and improve and resolve services, making sure that you can improve the things.
9:16
Slide 10: So there we have the key elements going down a bit more detail detecting them, recording them, classifying, investigating, diagnosing them and then resolving them and then recovery and then closing it down and in that closure, so this sort of process that ITIL has put forward, the whole point here is that the right information is kept and captured all the way through and that the process is followed.
ITIL is all about best practice and with incident management, you have to stick to the process or processes as we'll come on to later. And it's very important to make sure all the different pieces are adhered to, otherwise things start going wrong and certainly with incident management I've seen so many groups where they have a process, fantastic tool sets, but the problem is that they just put in garbage data and of course when you're putting garbage data out of the back end comes garbage data garbage-in garbage-out. You need to be able to make sure the people put in the data in the first place that the diagnosis is accurate, that is checked and that the resolutions once they come in and reply to those incidents, are also correct and obviously a part of this as well as the whole aspect of ownership, owning a particular incident or a group within the Service Desk owning an incident monitoring it, working out to tap into it, tracking it through the various groups and communicating to the various groups within the service desk. And of course to the good old user at the end who is the person who is sitting there waiting for that service to be restored to him. Incident management, ITIL and the Service Desk do not exist for their own benefit, they exist for the benefit of the organization to ensure that services are restored as quickly as possible.
Slide 11: So there is an incident management process showing the different feeds in from the top, and it's worth just running through some of these things here whichever way it comes in, we identify the incident we'd start categorizing and ITIL and indeed the whole service desk principle is about trying to manage things through classification. We look at the types, we look at the different pre-existing types that we've got, so that we can start working out whether it's hardware, software, a particular piece of software, a particular person or time of day or group. So we want to start understanding data and then once we've got the data we can look at it and use trend analysis to ensure that we understand what's going on.
So toggle with data, we categorize it and we'll come on some of these pieces in a minute, is it a service request? If it isn't, service request being something that somebody wants and not say what comes back in a minute, we then start saying right what's the priority? And of course we prioritize according to the service that is being affected. Is it a major incident again something that we'll come on to in a minute or is it just a general run-of-the-mill thing if it's major it gets treated in a different way, you then do an initial diagnosis, quick look at it. Now does that does that incident need to be escalated? Have we got all the data? Have we got a skill set and the tools to be able to resolve that here and now? If we have, we get on and do it. If we haven't, then it goes off and goes off to other groups for a hierarchy or peer escalation. And of course as we come into the end of the process we look at resolving it making sure we've got the resolution accurately put away within the incident, and then we look at the recovery and what needs to be done to get back to a full level of service. It may be more than just telling the chap at the other end how you fixed it, but also what he needs to do to get back up to speed, to get data restored whatever it happens to be.
And finally, of course there's an incident closure, really we should be looking at with an incident closure, making sure that the user is happy with what has happened. So often I see incidents being closed by different teams and what's happened is that they closed it down without even referring back to the user. Well, you've got to find out whether the user actually is happy with what has happened. So a part of a good closure is either customer satisfaction or going back to the person making sure that it's resolved to their satisfaction.
Slide 12: So here we go going into a bit more detail, we record it, we look at these things it's normally recorded at the service desk, we record all incidents although we'll talk about that in a few minutes time whether we record absolutely every single one and we have a look at, does it comply to the service-level-agreement for a particular service? So when we start looking at the service that's being affected, we look at the service level that has already been agreed to make sure that we are going to treat it in the right way. Record all relevant data and that little sentence there as I've just we have just been saying a couple minutes ago is a very short forwarded sentence but it encapsulates a whole host of trouble for a lot of people a lot of people in service desks do not record properly and what happens is as I say put garbage in you get garbage out it's very difficult to go back and to learn from mistakes to use those incidents to recover and restore from other things that happen of a similar nature in the future.
So very important thing that recording basis is done properly, and obviously there's the facility for other people not on the service desk to report incidents quickly as well. I would encourage those of you, who don't yet get yourself into some sort of self-service mode whatever that happens to be by email or on the internet or even by Facebook and Twitter to really consider that as a mechanism for getting data into the system for logging calls. It does mean however talking about the data side that we do have to consider quite well what has happened to the data and maybe have to massage it to make sure that we understand what the true problem is, true incident is and get it up to speed. So that it sits in the database with enough and sufficient data.
Slide 13: We then categorize it, now that I again mentioned this categorization there's two aspects really we're trying to determine the incident type, you know, for example, the IT service has been degraded and the thing that is effected and we talked last time about the things that deliver service and those things being assets or as we call them in ITIL, configuration items. We look at those ones that have been affected and when we note down on the incident we really should understand the type of the service that's being affected and indeed the configuration item that's also been affected by the incident, because both of those things will dictate how quickly we should then respond to it and obviously we should use some sort of standardized coding criteria. Normally these things are very well handled by the tool sets such as you have with ManageEngine. So that it's done for you and you just have to pick from a predetermined list, any of the tool sets are just perfect. ITIL cries out and has been crying out for a long time for good tool sets to be able to help people do this stuff more effectively and of course that's what we have here.
Slide 14: We then prioritize it and of course all these things here categorizing and then especially the prioritization is all to do with the service. We don't just look at an incident and say "well! that feels like it sir, oh I think that's quite severe who's on the call oh it's somebody very nice the other mate of mine I'm going to do that for", no, we do it on the basis of a set routine, a set number of things like how important is that service to the organization. You can't really go into the service level agreements much it in this webinar, but really it's not in its own like making sure the agreements between IT and the business. So that IT understands what is important. So if the business says the finance department at month end is absolutely vital you have to keep them up and running they're your priority at month end then sure enough you have to make sure the financial that are the most important people and you prioritize them.
So here we have an example list of prioritization and severity to that, so at level 4, there's no business impact like there's no loss of service. Level 3, a minor loss of service right the way up to a complete loss of service. Now some of these things are very standard things that you get from the ITSM books or you can make them up yourselves and some of the tools have these predefined in them. But it is worth being quite simple on these things I wouldn't go into a lot of priorities and severities, there's one organization as at the other day which had 18 different levels of severity when you come to use that on the desk is very difficult to work is it at 17 or 16? Hmm I don't know is the sun shining? it gets very, very esoteric at that point, no, very very clear priority and severity levels as you have here maybe 1 to 4 or 1 to 5 and then of course once you've done that.
Slide 15: Then it's a case of escalating it and the tool sets help you escalate very, very effectively. Because what we need to do is to manage and what happens to those incidents if the first person who gets hold of it within an incident process cannot resolve it hasn't got the tool set or the experience, then obviously they need to be escalated as quickly as possible to allocate the right resources if necessary either horizontally, i.e. someone within the same group i.e. same incident management function or vertically into other groups specifically in a second line, third line however you like to look at it.
And obviously, there should be rules around how those things escalate and one of the things that we do need to have here is the fabled bounce count we need to work out and make sure that, that space very low, bounce count for those we've not come across this some entertaining term is all to do with a number of times incidents bounces is not back if you like between the different groups as someone says It's a bit like hot potatoes so someone says said "oh! I don't want this it's a bit too complex" and hand it on to another group, who thinks exactly the same and pass it on to yet another group. Incidents can go rebounding around the system like a ball around the squash court quite for many times if you don't have a careful eye on what how many times it's actually been passed from one group to another and that's the function service desk or whoever owning that particular incident and I would say ownership is a vital piece of any incident management, service management process. And I've sent that right at the bottom again accurate data I should have put it on every slide actually, every time anything is touched with an incident process accurate data all the way through.
Slide 16: And then of course, we come to the end of the process where we have to resolve, recover and restore. So when we come to resolve it and you'll see this in a minute we look for known errors I'll explain that in a second or existing workarounds. We resolve the incident with the solutions or workaround and sometimes the solution will mean that a request for change an RFC needs to be submitted, so that incident isn't just an end in itself it actually does move on as we'll see in a second to being resolved through the problem process and indeed through the change process as well and let's not forget and I think this should be the mantra this piece in red should be stuck across everybody's screen or maybe we can talk to ManageEngine make it flash up on people's desktops every three seconds. But the goal of incident management is to restore service, that's what it's all about, best for the whole organization if nothing ever failed. But that nirvana scenario or which doesn't happen, things happen, people happen equipment fails, people don't turn up for work because they're sick, so things just don't work, a good incident management process restores the business to normal operation as quickly as possible. That is our mantra we should almost be saying every couple of seconds as we're walking around trying to fix incidents. But that's the goal and that's really why incident management is all about, that process that I've just run through.
Slide 17: So, some key things I just sort of throw out here that are very important I don't particularly fit anywhere nicely, so I put it right in the middle of the presentation, take ownership for an incident either as an individual or as a group and tool sets that allow you to do that do that even better. I've got a workload list as things that I am doing, the person responsible you can say, right who's got this? ah!, George over there has got this particular instance, he's going to run it through, there's a contact point for the user, there's someone who's got a responsibility to make these things work. If he can't, he has a responsibility to pass it off to someone who can and depending on how you are working he keeps hold of the incident and manages it on the way through and ensures other people do the job or he passes it off and ensures that the next person takes the ownership, the baton if you like of their relay race on with them.
We want to keep a focus on the incident there's no blindsiding, forgive me if I use the wrong term here, for those of you here in the States, but no blind side and we want to make sure that the incident is the key thing. You don't want to be put off by other things that crop up that we find within the incident. One incident may spawn other incidents of different type, that's fine, but what we need to do is keep really, really focused on the thing that is in hand and not get put off by other things around the edge, escalate to the people at the right time in an appropriately Swift manner, and also swiftly escalate to management if there is a problem in terms of doing things or getting stuff done.
That often happens in service desk that things just sort of dribble around and sort of sit there because people don't like to touch them it's almost like a negative bounce count bouncing, bounce count goes one very hot potato incident rounds to different people you've almost got this type of incident sits there nobody wanted to touch it and nothing ever happens to, again that's what managers are there for and that's why service desk and good service best management helps people understand and takes ownership with problem.
I suppose, this next point should be in pulsating red letters "keeping the customer informed" again the customer is the key point here and I think actually these days we all understand that the customer is the user that's on the end of the phone to us, because we're often the customer on the end of the phone to other people in our everyday lives when we use IT equipment we're on to the bank at home or whatever we're doing on the Internet, we like to have service given to us very quickly, we like to be served and looked after very, very effectively and very well. So we understand these things now and being informed about what's going on is, is a really important part of ensuring that the business carries on because if the customer knows they're not going to get a speedy delivery because of whatever or resolution, then maybe they can do something to help themselves by doing something else keep them informed, keep them happy.
Act as an interface between the different groups again if you're owning it, but other people doing things on it and then keep track of time and activities all very it's all very good just fixing one problem but if your workload list has got 20 on it, just keep an eye on those other things, it's very important to make sure that the priority calls that are coming in are looked after. You may be doing something for somebody that is important but if the if you're Amazon and the main website goes down the ordering piece then really you have to get on to that as quickly as possible, it's all to do with those services.
Slide 18: And going back to a different view into the management process obviously as we know from ITIL that although there are a couple of set processes within the, the ITIL framework. There are many interpretations and many implementations that's the good thing about ITIL. It's flexible, it is best practices, and it gives you ideas of how to do things. So look at your incident process and see how it can be adapted to your business to your organization and see if tweaks can work this particular one is quite an interesting one, because we have the customer to stop contacting the Service Desk creation of a ticket issue being identified, is it an incident? But no, it might be a service request more about in a minute. If it is, we come on to this thing called known errors and outages if it is, is there a workaround that we can apply, if not then it goes for escalation. So what we're going to be doing now is moving into the whole area of how incidents interrelate with things like problems and those sorts of things as well.
Slide 19: First off though, touching on service level agreements, these are the things within an incident that are so important to be aware of, because these are the things that allow us to identify which thing to work on first. Now don't forget service level agreements are those things that should be, but not always are, should be negotiated and agreed with the organization of a certain level of response. It doesn't always happen sometimes SLA just imposed by IT on the rest of the organization, which sometimes has to happen because that's just the way it works and organizations don't necessarily always help, but these things should be agreed with the organization.
So they know that when something happens they're going to get a response in a certain time frame, again we're setting expectations we're helping people out, we're trying to be good members of the organization. So they can work within the framework, if something takes four hours or eight hours or two weeks to fix that's how long it takes. But we need to tell the business that and get their agreement on that. Therefore we can start looking at ways around it to ensure the business carries working. We work with the business in terms of the SLAs, Different AFT different SLA, different things, different priorities, and different configuration items. So a server probably has a much higher priority than someone's desktop PC, unless maybe that desktop PC is something to do with the payroll system at the end of the month.
A particular user, you know, the MD, the managing director, chief executive is always an important person to get it right, but are they actually more important than the sales guy at the end of the month. He's trying to get some get some contracts through and you can't get a particular email out system for whatever reason the attachment doesn't work, sometimes those people at the end of the month when they're trying to get contracting are more important than the even the chief executive, however it might appear the other way around very important to be very clear about what is important, who is important and when is important and it's got to be appropriate to the organization. Obviously different organizations vary so considerably one to another. But again has come back to service what are the services and the aim here is to restore those things as soon as possible, given the importance of the service. Now I keep going on about this, but it is actually so vital that we get that into our heads and make sure when you when you set the tool set up a lot of these things have got default settings. Just have a look at them, check them through, see if they fit your organization don't necessarily take them as gospel. Ok, which brings us into the whole area of how incidents linked into problems and other things, they don't just exist on their own.
Slide 20: And here we have the idea of a major incident, but what's the difference in a major incident and just a normal incident, well, it's very hard to hard to assess, I think this will depend on the organization and in turn and in some senses it's going to be a question of looking at the service look at the thing that's affected and coming to a judgmental decision. But basically these are things that are happening unplanned or temporary interruptions to a service with severe negative consequences. Here is an idea, here above an iceberg where you have the nice visible fluffy portion of floating above the ways and beneath it the Titanic sinking rock that is floating around underneath that is that is absolutely there to sink an organization.
So the more you have of an incident, the more likely it is that you're going to be having a particular major problem on route. So the more you have, the more likely it is that you've got one, which is what this is trying to show. Here but also major incident the incidents that combine to create one can be very many and varied and not necessarily, obviously interrelated.
They could be between things that are happening or processes, human error, oversight, shortcuts. Things are just going on, it could be a number of these things all coming together. But as you come into the more and more of these things and you get a huge iceberg floating around, that's that when a major incident can start coming together almost like a typhoon that comes together to then wreck the organization and that's where we need to be very clear and this is where a good tool and a good service desk is absolutely likely to spot the disparate trends come in. Now some things are fairly straight forward if something goes down, if a server goes down takes a major application with it you're going to get the service desk flooded with a number of calls about the same thing, that's probably fairly, obviously a major incident depending on obviously the service that's gone down.
But do take care to think about the breadth of things that may be under the water that you can't see that could be equating to a major incident and that's where we start bringing problem management and we start thinking about I had a few of these things and they might gut feel is, this is indicating that we've got something else going on and this is where incidents lead on to problems and a whole problem process and the whole point around the problem process is, that we can get access to a bit more data we go into things in a bit more rigor than we otherwise would do.
Slide 21: So from a problem point of view, it's trying to come to the cause probably unknown of one or more incidents. So the activities that someone would use things like root cause analysis, looking at workarounds and systematic removal of the root cause eventually by change requests, is the key thing we're looking at what is the, the underlying problem and there is a process in ITIL, a problem process that we can use to do that.
So what we normally have is a number of incidents coming together either they're the same type or as I've just been suggesting of things that are vaguely connected that suggest something, something much more fundamental coming together to be used to then ensure that we resolve these things properly.
Slide 22: Now when we do identify the key underlying cause then obviously we might and it might take many incidents to understand the work root cause, when we've identified that cause or factor we get what's known as known error. Something that's sitting there we know if we poke this particular system in a particular way it's always going to respond in a particular way it falls over or we get a particular error message. So that is something that we know and we can publish, we can communicate that sort of thing.
But also when we're looking in the incident process at resolving some of these things we need to understand what those known errors already are. So that we can readily identify them, in two ways, one we can tell users about them, so that we don't have to keep letting phone up and us tell them, but also so that we can then identify them very straightforwardly and see what's happening. And then if possible to give people a workaround to that if one exists, and obviously if a workaround doesn't exist then that's what's known in the system as a workaround and we can start using that to quickly get people back up and running if we have one of these known issues floating around. But, it must be part of the incident process, we have this thing that's just cropped up and have we got something to resolve it.
Slide 23: There's our incident process again looking at the known errors and work around etc.
Slide 24: Which brings us on to another area here, just before we get on to the change side of another way that we can help our incident process to be resolved. We've got the ability to use knowledge and incidents together in two particular ways. One of, is in self-service, so people are coming in to log a call because they've got a problem or an issue they don't understand something, there's the whole concept of self-help with in things like self-service.
So going into knowledge base is an FAQ document and using script based help if it's available in tool sets. And what we're trying to do here is to guide a user through doing things themselves to try to sort their own problem out. That's still incidents and we still need to record that fact, because we need to understand how often these things are cropping up in the in the everyday life, but we need to make sure that we understand what those things are. So that we can fix them or make things easier. So that's the first way that we can use knowledge.
The second way is, turning the whole thing round looking at how incident can construct knowledge, so what's here is that we've got that and we've got incidents that's to get put in into the system. They obviously contain a description of the issue that's actually occurred, now if we've been talking about the fact of doing data properly, if we've made sure that, that data is put in, in a good format in the first place and maybe we have someone who goes into the calls into the incidents afterwards and make sure that they've been done properly.
We can then use that and those incidents in those descriptions within the knowledge-base as something to be searched on, so that in the future, we can find out what the resolution for that particular incident is and we can either publish that so the user can help themselves or it's an available resource for the service desk to then go and sort out their own problems and see if it's been fixed in the past. But basically knowledge, conforming both ways trying to help people out as they come in for a through self-service or indeed to be part of the knowledge domain how the knowledge base itself to help resolve problems in the future. And they also inform the problem i.e. the part of the input to the problem process, some of the descriptions the things that have been tried and some of the issues go into the problem process and again can inform the whole resolution piece that comes in to resolving a particular problem.
Slide 25: And of course any talk about incidents wouldn't be finished if we didn't talk about its application within the service desk as well. I've talked about it all the way through to be honest, the service desk is the choice, the service desk uses the incident management process, it almost like the light saber to protect itself and protect the organization that obviously comes through incident logging, ensuring that customers get the satisfaction they need in terms of resolution as quickly as possible or ensuring that they do get resolved as fast as they can, prioritizing those calls efficiently and effectively, providing that first line support in terms of trying to resolve them on the first go, so in some rings up the best way of doing it is to resolve it there and then rather than have to push it through to another set of expert second line, third line or a networking group or whatever it happens to be. We can fix it on first line, the person ensures the customer gets up and running as quickly as possible, services guaranteed and maintain as fast as you can.
Escalating to other members of the team if we don't know the answer ourselves other team members, teams themselves who have different expertise and obviously communication between those groups and back to the customer. And finally the service desk is responsible and should be responsible for containing and managing and collating, measuring bringing together all those metrics that service-management produces, how many did we have ?, how many of this particular category?, how many for this configuration item?, what severity?, how many for this service? And then we start looking at the data that we've got here to start highlighting where we can improve service and ITIL version 3 brought in this whole concept of continual service improvement. I mean as if it was actually to be honest as if it was some great, great thing that no one else had ever thought of, but of course people have been doing it for years, you guys out there have been long looking at the service and trying to improve its seeing where it could be done.
ITIL version 3 put a name to it which is the whole idea of continual service improvement, but it's done by looking at the data that we've been capturing, analysing the trends keeping going over the same thing seeing if we can improve where things have been breaking. So almost like a problem management on the data of the service desk itself, but obviously the service desk absolutely vital function and the inside and peace a vital part in their in their armory.
Slide 26: But I would say just as a word of warning here and this is coming from years and years of experience what I've seen with different service desk do know when to stop when it comes to looking at that data. It's possible and I've seen it sometimes that people take more time and effort in trying to analyse every last detail, capturing every last piece of information that sometimes it takes longer to log a call than it should do that we start being counterproductive in terms of taking or slowing the whole process down.
We don't we want to understand what the configuration item is that is affected, we don't need to ask the person for the serial number, the RAM or rest of it, if we're thinking about that and that's important then we should start thinking of where that should come from maybe configurational asset management databases. But we need to start looking at what what's appropriate management information to manage on. Just the fact of an end of month report that says we've had and we've closed 4,000 calls, 4,000 incidents, is a great statistic, but what does it actually mean? Better that we've closed the one call that was that was caused the companies? websites to shut down. That's the sort of them is the resolution of service that's important is the significance of that of the data that we're putting forward.
Those of you who know about network management know that we have these things called SNMP traps, the things that Routers and computers and things like servers stick out their neck when are having a crisis. Occasionally network hubs and switches and things can produce many main thousands of these things. Now just telling people that we've had 45,000 traps it's absolutely immaterial, except if one of them happens to takedown the key service that is producing the company an awful lot of money. We have to make sure that the data is relevant and significant, so we should be talking about things like how that the amount of time the services were up and available for use, the number of people that had good customer satisfaction ratings if their external customers or even internal customers, looking at key facts, key data that is important to that organization.
So in terms of some of those key performance indicators that we should be looking at, again it comes down to what's important, so it's things like if you're Amazon, is making sure you have a 100% time up on your on your web store. If it's a manufacturing plant is making sure that you are producing to within whatever tolerance you happen to your machines happen to be rated for you know whatever it happens to be 98, 100%, 50%. Whatever it happens to be this important to your organization should be put in and then measured against it.
And some of the key performance indicators I would say for the Service Desk are things like customer satisfaction, because that really does tell you how you're doing. Because the customer isn't the guys delivering service or doing it for business, if they're happy generally it means that things are going pretty well.
Time to resolution is obviously a very important one and key incident resolution, key incidents being those things that affect key services, so how quickly were they resolved, how quickly were they resolved within the service level agreed. So if you've agreed to a two hour service level for some of your key applications and services, then if you're hitting it, then you're good you're on target, if you start wandering over the top of that, then obviously that's where the customer that's whether the business starts to be at risk of certain of not being able to do what it needs to do.
Slide 27: And finally how does it feel against Service Catalogue and service request, what's the difference? Well, it's a very fine line to be quite honest and the Service Catalogue obviously is a place where a lot of those services are delivered. The service request is how you get access to those services be it software, hardware or whatever it happens to be.
Slide 28: So how do you determine whether it's a service request or incident? Well, many people some people put service requests in as incident and actually there's nothing wrong with that, the purest might complain slightly and say that you shouldn't do it. I take a much more pragmatic view it's whatever works for you as an organization. From a purist point of view, oh yes, it is possible to kind these things off, because they have definable ways of doing things and it's possible to do things in a particularly different way. So what we need to what we need to do is, to understand what a service is and when an incident is a service.
So some of the alternate incidents that we have here are things like new hire, leaver, equipment request, software provision and a virus scan. There isn't a right answer whether it should be an incident or a service a service request but, but what we're trying to do here is to ensure that we have the right way of approach that we do things in the right way and at the right time and that the service levels are taken accordingly.
Slide 29: Other things that happen from there is that you define the process for that particular thing and you manage it then buy the product priority and then set realistic SLA's. So whether it's an incident or whether it's an SLA is really entirely up to you, but whichever it is make sure those things are defined properly and you understand which is a service requests, how you treat them and then managed accordingly.
New Hire Process - 46:02
Slide 30: Some of those things that you have as different incident or service request processes things like the HR tasks in terms of the new hire process than a list of things that you have here and in terms of things like IT tasks that go along with it. So the HR task from the recruitment request the health cover range and all those pieces and then the IT tasks that go along with it, and that's the new hire process that can be encapsulated within incidents or within a service request.
Slide 31: And of course when it comes to change, many incidents make a problem and a problem then gets sorted out and resolved through the change process. That means you get an accurate analysis, an identification there the configuration items that are a resolved and a good problem analysis that touches all particular incidents. And obviously here what we're trying to do is to have a link to known errors and to work around and then what we can then do is ensure that once we've seen all the different incidents, we have a particular problem and then we see the change that link that solves the problem, which then solves all those particular incidents.
Slide 32: And finally, obviously, it comes down to configuration management knowing what things deliver a service the configuration items and that's defined by all the relationships that are contained within the configuration management database, know what is important and how it connects to each other and obviously that involves understanding the impact that is then had on the organization through change, through impact analysis.
Slide 33: So finally why incident management it's all to do with know what's important, who is the most important service to who is the customer, who is the contact, what's the expected fixed time and what we try to do is not fight the same fires over and over again. What we're trying to do is to build a better and more repeatable process around the incident thing. So that we, we actually do this firefighting much better in a much more efficient and effective way possible, building on the knowledge of a call, documenting what's happened, who did what and when they did it and obviously if we're doing this, what we're trying to do is to avoid the duplication of work continually doing things time and time again the same thing repeatedly. We don't keep making the same mistakes and allowing the business to get into difficulty. We're really trying to ensure that things are done properly. And it also as I said before avoids the difficult tasks being at the bounce account etc.
And finally it should ensure the communication occurs between the individual and the users and the groups that have been involved and also the different groups that are involved in service desk, the different incident management groups etc. So just, just to wrap up then, incident management is an absolute vital piece of the jigsaw in terms of delivering service to an organization, restoring the service, the good service that's absolutely vital in an organization to ensure the business carries on work. The longer a service is down, the more at risk the businesses from competitors or losing money or whatever.
So incident management it does work it allows us to restore a service as quickly as possible, it informs the problem and the change process and it allows us to effectively do for the business what it needs to do in keeping those services alive and running. So on that note I'm going to pass back to Joseph for a little piece from ManageEngine.
Q & A time: 52:10
Joseph: Thank you Neil. It was a wonderful presentation. We have a couple of questions for you, first of all, Who should be the major decision maker in creating an SLA, IT or other parties involved?
Neil:Well, I mean to be honest the service that's being delivered is a business service. So it's the service that the business needs. So in that case I'd say, really it should be the business that's in the driving seat. But obviously it's got to be IT that actually sort of can put some common sense around it, there's no point the business shouting and banging its fist on the table and stamping his feet and saying, no, no, we've got to have a fix within 10 minutes. You know, it is just unrealistic, we have to bring to the table the reality of IT and what IT problems or how long it can take IT problems to, to be resolved. So we need to be realistic and also we need to bring alternatives, so sometimes we also need to bring to the table, well, it's going to be two weeks but we know that's not going to be acceptable but how about doing this that or the other. So I think it's probably I'd say IT should be in the driving seat but it has to take its prioritization, its importance from the business.
Joseph: OK, thank you Neil, ok one more question before we go into the 10 minute product brief about our ServiceDesk Plus. The last question for Neil would be, we usually resolve incident and let the system close them automatically after three days, what is the recommended approach?
Neil: Well, this is one of those things what ITIL is all about is best practice and it's really what works best for you. In some cases what we can do is, you know, what some service desks do is to go back to the individual user and confirm with that individual use that everything's all right. Now we can do that and if we haven't heard back for them in three days, I mean that sounds a very reasonable way of working and there are the other ways of doing it, we could say if we haven't heard from you or we're going to close it unless we hear from you. But I think there has to be some involvement with the end users some agreement with the end user that everything is ok with the thing that's been fixed. But I think it's very sensible to say if you haven't heard back then yes go ahead and close it in three days. I'd say it is eminently sensible.
Joseph: Okay, thank you Neil! Now we have Bhaskar with us, he's a product consultant and he will take us through a 10 minute session of how ServiceDesk Plus handles incident management and we also have a couple of more questions which Bhaskar would take after his 10 minute brief about the product. Baskar, take over.
Bhaskar: Hi! This is Bhaskar, product consultant from ManageEngine and I'm going to explain to you how ServiceDesk Plus handles incident management.
Let's move on with the creation of a particular incident. We have different modes of creating request, one is an email where you can configure an email address for your support and anyone sending an email to that address will be converted as an incident like you see on the screen.
And now, other way is a phone call where an end-user calls up the support agent who will in turn file a request using this form.
Another option is a self-service portal where you give access to a portal for the end users. So once they come in here for filing a request they can go for a knowledge-base, a search in the knowledge base, if they find out the solution they can actually fix their own problem in case if they don't, they can go ahead and file a new incident. So once they file a new incident these incidents will be listed under the requests tab as the technician can view these tickets and pick those tickets.
And before creating a ticket when you input the requested details like username, ServiceDesk Plus has an option to import the requesters from the active directory, so the details of these users will automatically populate and also the assets which are belonging to the user will be shown here. By this, we will be avoiding unwanted questions asking the request about what is the asset he is using and the hardware configuration of them.
Next, we move on to the urgency and priority with the impact. You can choose the impact and the urgency, this can be configured as a priority matrix under the admin section. This help you to determine the right priority based on the business impact and urgency. This is also one of the IT best practices and also we have given you the flexibility to modify the business, you know, urgency and impact and the priority and the users and the technician can override this priority matrix, but we do not recommend this.
Next we're going to look into the classification of incidents, you can classify different types of incidents into different category like you see on the screen. We support a hierarchy of category a subcategory and an item using which a type of incident can be granularly categorized. So an incident categorization is very important which helps you understand the source of all incidents. Next, you can choose the group and the technician and file a request.
We're going to look into the business rules, all these parameters like category, sub category, the group and the technician which can be automatically set using the help of a business rule. Business rule automation and ServiceDesk Plus helps in organizing the incoming requests and perform any action from categorizing an incident, delivering them to a group and assigning them to a technician. So here is where the business rule, you go into admin > business rules. We can define a criteria, we can perform an action as to place into a group of all high priority tickets. And we also have an option to email and SMS a particular technician where a business rule is executed, which should be helpful when a technician is aware of what the ticket is being assigned to him as well.
And now we're going to look into the templates available which can be configured under the admin section. The frequently occurring issues or incidents such as email or printer problem can be configured as templates. With all these parameters pre-populated which makes an incident logging much easier, so you can pre-populate these fields. Say for an example, if this is a printer issue template, you can configure these values and save it as a template. So whenever a technician would like to create a request for a printer problem, he can get into the new incident button and select the printer problem template. This will automatically fill in those field information.
And next, we're going to look into how acknowledgments are sent to their end users. So once I create a ticket with a requestor name, an acknowledgement email is sent to the end user with the ticket number, so that the next time when he calls, the helpdesk he can refer the ticket number.
And we're going to look into the service level agreement, once an incident is created in service desk, the SLA helps us to set a due by time for a particular request. So here you can see the created date, the due by time, and the response to due by time. The due by date is the resolution due by date and the response time is going to be the first response for this particular ticket and this is configured under admin > service level agreement.
So for a request after a set period of time, if the request is unattended by a service desk agent, it will send an automatic escalation email to another technician that the service desk, you know the incident is violated. So you can configure the response level SLA and the resolution SLA over here. For this, I have set a priority of high for a mail fetching issue, the response level should be within one hour, the first response and a resolution within eight hours. You can also enable escalation, so that the other tech gets notified when this SLA gets escalated. So once a ticket is created with an SLA, a technician looks in for error resolution for this particular ticket.
In ServiceDesk Plus, we have provided you the convenience of searching for solution in a particular ticket based on you know, the subject of a particular email service desk search for resolution. So here you get the list of solutions available in the knowledge base, you can select the solution and copy it to the actual incident. And once it is copied, you can make use of the status to be changed over here to be resolved and as a technician I can also add the work log of how long I have been working on this particular ticket, the time spent and the troubleshooting that I have been through this particular incident.
So an incident must be closed only if the end user confirms that the resolution that I have provided is a valid one. For this, we have a parameter under admin > request closing rules, so here is where we're going to define the mandatory fields which has to be filled in before you close a particular request and this is what I was talking about, acknowledgment from an end user and also Neil was talking about automation close of a particular ticket, here you can set once a ticket is moved to a resolve state, you can set an automation close where you can specify the number of days, three days after which a particular ticket gets automatically closed. This is also an important ITIL best practices where the technician doesn't have to constantly follow up for the response from the end user.
And next we're going to look into the satisfaction from the end user. We have an option under admin > survey settings, there was also a question from one of our attendees, what is the method do you use to evaluate the customer satisfaction? So for that, we have an option to enable survey which can be enabled for every ticket created in Service Desk has closed. This satisfaction level can be you know, the results of the survey will be shown under the survey result and there additional comments will also be listed below.
And going to the reporting module of how incident management can be best evaluated here, you have predefined more than 100 predefined reports that you can generate based on SLA violation, the number of incoming tickets per day and the pending tickets for each group technicians and the list of tickets with high priority, you can take a report under the reporting section and you you've got the option of making your own custom reports for all requests and choosing the type of fields that you require. So by this incident management will best work with ServiceDesk Plus.
So I'll also look into another question that was submitted by an end user. We are currently using ServiceDesk Plus version in our IT department, we would like to expand the system to the HR department, finance and others as well for the users to enter a service request, how can we do this?
In ServiceDesk Plus, the Enterprise version holds the feature of Service Catalogue by which, you can add a service category, so these are predefined service category which are available and adding a service to a service category will allow you to choose the list of your users to whom you want to show these services. So the user groups can be configured under admin > user groups. And you have a criteria that you could choose here, so job title is... so you can select, you can set a filter using which it will show you the list of users who belongs to the IT department and once a service request is configured, you can show this only to the IT department by which the other user groups will not have access to this service.
So by this, we conclude the session on how best we can use incident management in ServiceDesk Plus. Thank you all for attending.