Event/Problem Management

An effective event/problem management process helps protect institutions from financial risks, operational risks, and reputation risks. Management should ensure appropriate controls are in place to identify, log, track, analyze, and resolve problems that occur during day-to-day operations.

The event/problem management process should be communicated and readily available to all IT operations personnel. Appropriate personnel-from IT operations, institution management, internal audit, fraud and loss prevention, information security, and computer security incident response teams-should participate in the event/problem management process. Event/problem management plans should cover hardware, operating systems, applications, and security devices and should address at a minimum:

  • Event/problem identification and rating of severity based on risk;
  • Event/problem impact and root cause analysis;
  • Documentation and tracking of the status of identified problems;
  • The process for escalation;
  • Event/problem resolution;
  • Management reporting; and
  • Contact and communication information, including:
    - Current names and/or positions of individuals that should be contacted;
    - Current phone numbers of contacts; and
    - Who should be notified (e.g. regulators; FBI; public relations group; media; affected business lines) and the circumstances under which they should be notified.

Operations personnel plan the work for each shift in advance to ensure that it is finished in an accurate and timely manner. However, unusual events often occur during production, which management should monitor and correct. Examples of common production events include the following:

Production Program Failure - Operations personnel should properly log and record program failures that require immediate intervention. They should also notify the appropriate personnel so proper change management procedures can be initiated. Some production failures require immediate intervention by programming staff in order to meet an important production goal (such as month-end or cycle processing). In these cases, emergency procedures, sometimes called "fire call" procedures (who to call, what to report, etc.), are invoked, and the programming staff members perform emergency repairs either at the IT operations facility or from a remote location.

Out-of-Balance Conditions - Personnel responsible for scheduling should document and correct all production processes that do not contain proper run control balances. Personnel should rerun the data to check for operator error or erroneous transactions. When totals do not balance after being re-run, operations personnel should log and record the event and notify management of the need for further investigation and resolution.

Operations Tasks Performed by Different Parties than Normal - Operations personnel customarily are cross-trained and have back-up duties in case another employee is absent or temporarily assigned other functions. For example, operators may act as back-up to tape librarians or production control analysts. In these circumstances, it may be possible for the parties to intentionally or unintentionally cause an error, fraud, or service disruption. Where back-up employees have the potential to compromise segregation of duties, management should establish mitigating controls.

Logging Issues - Most problem-solving techniques in an IT operations center depend on the ability to read, consolidate, and interpret various operations logs. Consequently, an institution should not destroy or modify its logs. Disclosure of log tampering or manipulation is an event that requires management resolution and the involvement of the computer incident response team. Operations management should periodically review all logs for completeness and ensure they have not been deleted, modified, overwritten, or compromised.

Database Operations - Although various security devices protect databases, it may be possible for the operator to use system utilities or unauthorized compilations to modify the system. In such cases, the database may become corrupt or inaccessible. Operations management should regularly and carefully review all logs involving database programs and files and should report all unauthorized modifications to the computer incident response team.

Termination of Operations Personnel - Whenever the employment of someone with access to sensitive or confidential material is terminated for any reason, management should revoke or change all physical and logical access controls including all key locks, badges, common locks, and cyber locks. It is sound practice to ask the employee to leave at the time notice is served. If this is not practical, management should carefully monitor and review the employee's activities to ensure the protection of all data, files, and security devices. There should be written procedures to define the responsibilities for all operations, IT management, and human resources personnel when a termination occurs.

Run Time Anomalies - Management, a shift supervisor, or another independent person should review run time logs, identify any anomalies, and review their cause and resolution. It is possible for computer operators to run programs out of sequence or with improper inputs to cause error or fraud. Automated scheduling programs commonly used in large, complex institutions significantly reduce the risk of this type of event. Unexplained or inadequately explained anomalies should prompt a production rerun. Event report logs for unexplained anomalies should be forwarded to the computer incident response team for review.
Management should train and test operations personnel on their ability to recognize security events that require referral to the computer security incident response team, security guards, management, or other parties. Social engineering is a growing concern for all personnel, and in some organizations personnel may be easy targets for hackers trying to obtain information through trickery or deception.

Management should consider the safety of its employees as paramount when there is a life-threatening event. Policies and procedures should reflect this philosophy. Management should ensure it trains all operations personnel to act appropriately during significant events. Employees should also receive training to understand event response escalation procedures.

Management should properly train operations personnel to recognize events that could trigger implementation of the business continuity plan. Although an event may not initially invoke the plan, it may become necessary as conditions and circumstances change. Management should train and test institution personnel to implement and perform appropriate business continuity procedures within the timeframes of the BCP. Operations personnel should properly log and record any events that trigger BCP response and document their ultimate resolutions. Refer to the IT Handbook's "Business Continuity Planning Booklet" for additional discussion on this topic.



Previous Section
Next Section
User Support/Help Desk