IT incidents would be something that IT people would stay away from. Is it a blessing for an organisation which has never encountered a single IT Incident? Is it fortunate for an IT Staff who has never been involved in IT Incident recovery or troubleshooting? Some would say yes, and some would say no, depending on which angle they are coming from. For those who agree that it is indeed a blessing, they are looking from the angle of sufficient preventive protection and no disruption to the business. For those who disagree, they are looking from the angle that the IT Staff is not yet tested with real-life experience to overcome an actual IT Incident should it occur.
On the contrary, we also have the type of IT Staff who would get involved in IT incidents in every organisation that he joins. This is the type of IT Staff who is always challenged to bring up the services to normalcy in every incident. We used to joke that if we are in this position, we should “mandi bunga” (Note: The term “mandi bunga” is seeking help from a traditional healer for bad luck cleansing). Of course, it is just a figure of speech. Not like any of these people need one. Frankly, being in this category myself, IT incidents really toughen you up and mentally prepare you for any possibilities to come. Every IT Incident is a unique incident. If you encountered the same incident repeatedly, that means you and your organisation did not learn from it.
Incident Management vs Problem Management
Everyone in the organisation, including the Management should be aware of the terms Incident Management and Problem Management. By knowing the difference, it would help in setting up the right expectations from the IT Team.
1. Incident Management
- The objective of the Incident Management is to restore the services to normalcy as soon as possible, to avoid any further damage or loss to the company’s operations.
- When an incident occurs, it would potentially halt the business operations and it is critical to recover the situation back to its operational state. For example, multiple servers overheated and abruptly shut down due to failure of the data centre’s precision cooling unit.
- A quick resolution needs to be decided, whether to activate the Disaster Recovery (DR) site or to bring in a mobile cooler unit while waiting for the repairs of the precision cooling unit.
- This is not the time for anyone to come down and harass the IT Staff and ask why and who is responsible for the incident. The priority is to bring up the servers so that business could resume.
2. Problem Management
- The objective of the Problem Management, however, is to identify the root cause of the incident so that corrective action could be taken to avoid a similar incident from reoccurring. It is to investigate from every angle and to come out with analysis and lessons learnt.
- If your organisation has a big IT Team to spare, then most probably both Incident Management and Problem Management could be carried out at the same time with expedited results.
- This is when the IT Team needs to analyse the incident from every possible angle and to put up an end-to-end historical timeline to ascertain the source of the problem.
- If a company was hit by ransomware, this forensic investigation would identify which user who accidentally clicked the malicious link or attachment and why the organisation’s IT Security tools were unable to detect it sooner.
Handling an IT Incident
In the event of an incident, it is important for CIOs and IT Heads to remain focused and calmed. The pressure from the Management is something to be expected and not surprising. It is how we react that would determine the outcome of the IT Incident recovery:
1. Objective Mindset
We need to stay focussed and calm during any IT Incidents. Only when you are clear-headed can you coordinate and strategize the recovery plan effectively. Panicking will not solve any problem and your judgement will be clouded or irrational. It would also impact morale as your IT Team is looking up to you for instructions. They will end up running around like headless chickens should you be indecisive or unable to put everyone together. Hence, as I mentioned earlier, it is our positive or negative reaction that determines the outcome of the incident recovery. You may have the best and competent IT Team, but it means nothing if they are being guided by inconsistent instructions.
2. Stakeholder Management
We need to absorb all the pressure coming from the Management and Business Heads, so that our IT Staff would perform their best to rectify the situation. You are there at the Management level so that you could shield your team from unnecessary pressure that might deter their work and delay the recovery. You should not allow anyone to come directly harassing your IT Staff for resolution. The way you lead would instil confidence in your subordinates. Besides, we should be allowed to handle the incident the proper way as we are the one running the IT Department.
3. Continuous Support
I have seen IT Heads who were shouting and pressuring their IT Staff to expedite the rectification. I came from a different school of thoughts, and I am really against this type of approach. I believe that jumping around or breathing down their neck would not solve any problem. Some staff may crack under pressure whilst he or she is the only subject matter expert that we could rely on. We should support our IT Staff and bring out their best potential to recover the situation. We are in this together and to work as a team to come out from this challenging situation.
4. Expert Guidance
The CIO and the IT Heads would be the single point of reference when the IT Staff were hit with roadblocks or dead ends. This is due to some IT Incident that may drag up to hours or maybe days to be resolved. It is expected that some of the IT Staff would be extremely fatigued and not in the best state of mind. They were trapped in the ‘problem cocoon’ and would not be able to see the way out. Hence, it is essential that we view the issues from a 3rd party point of view so that we can guide them and show them of the potential exits to their problems.
5. Close Supervision
Time is of essence and we need to make sure that the recovery activity is heading in the right direction. Even though we absorbed the pressure from the top and continued being supportive, that does not mean that the IT Team could take their own sweet time to carry out the troubleshooting. The sole purpose of an Incident Management is to recover the situation as soon as possible and we need to make sure that all manpower is being mobilised to achieve this. We also need to help the burnt-out IT Staff to get back on track and to align those who deviate from the incident recovery tasks.
Setting the Right Expectations
Sometimes, the Management has a different understanding which does not align with the committed delivery from IT. Therefore, CIOs and IT Heads need to work closely with the Management to set the correct and realistic expectation. That is why we need a good SOP and process in place, when dealing with the IT Incident. These documents are periodically reviewed and presented for Management’s consent:
Severity of the Issues
Not all IT issues are an IT Incident. Having a clear SOP on the Severity Issues classification would help to set the right expectation from the Management. For example, a phishing email trying to impersonate a CEO gave instructions to the CFO to make a multi-currency payment to a bogus company. This email was already being diverted in Spam Folder. The Management needs to trust the filtering system. It was diverted in Spam Folder because an internal email (CEO to CFO) is treated as local, and it could not have originated from outside of the domain. Blocking every bogus domain that came with phishing email would not solve any problem and would only waste the resource. What we need is a good and effective Email Filtering Tools.
Out of control Situation
Sometimes, no matter how good or how prepared we are in mitigating IT incidents, there are still things that are out of our control. For instance, the Microsoft Teams outage incident that happened a few months back. Yes, we understood that the Staff were unable to communicate with each other and meetings with Clients were at stake, but this outage was totally out of our control. We were unable to commit any expected recovery time because this was totally dependent on the 3rd party, Microsoft. This type of situation needs to be clearly communicated to the Management so that they understand.
Not everything about negligence
It is normal for the Management to pressure IT to find out who had caused the incident. Was it due to negligence? Was it due to failure of following the process? Was it due to the IT Staff incompetence in carrying his or her duty? However, we need to set the right understanding that not all incidents originated from human error or negligence. For example, a “Zero-Day Malware” attack that may have crippled some of your sales system. Internally, all the servers have been patched to the latest version and all the users’ devices are running the latest antivirus signature. It happens to be that the Endpoint Protection was unable to detect this “Zero-Day Malware” because it’s new and the manufacturer has yet to come out with the updates to combat this threat.
Anticipating an IT Incident
My late father used to say, “Pray for the best, but be prepared for the worst”. His words of wisdom had been guiding me all these years. We may have everything covered and all the Risks being considered, but it may only reduce the likelihood of happening. The key is to anticipate that it might happen, and to always keep your cool when it does happen.
Catch When Expert Meets Expert by Ts. Saiful Bakhtiar Osman articles every bi-weekly Tuesday. Don’t forget to subscribe to stay connected. You are also encouraged to ask questions and seek advice from him.
Share this post
- 07 Nov 2022
- By:Eugene Chung
- Category: WEME
How do Cybersecurity sales convince prospects to trust their services and/or products? Learn more about it from ArmourZero’s mentor and expert Eugene Chung.