Cyber resilience in an industrial environment
For the most impatient readers, you can go directly to the Key Elements at the end of the article.
Reminder of the state of the threat
ANSSI states in ÉTAT DE LA MENACE RANÇONGICIEL – À L’ENCONTRE DES ENTREPRISES ET INSTITUTIONS published on 05/02/2020: « Since 2018, ANSSI and its partners have observed that more and more cybercriminal groups with significant financial resources and technical skills favour the targeting of particular companies and institutions in their ransomware attacks. ».
Faced with this observation, it is more necessary than ever to secure information systems. This involves applying the fundamentals of security: applying patches, managing accounts and passwords, managing network segmentation etc. As a reminder, the application of these initial measures permits a significant reduction in the probability that an information system will be subject to a ransomware but can in no way guarantee that this will not happen.
Specificity of the industrial sector
However, even though new defensive solutions are continually being developed, the cost and complexity of deploying some of them ultimately make them little used. This is truer in an industrial environment, where their integration can be complex, as some systems are fixed in a functional configuration. Moreover, the budgets allocated to IT security in an industrial environment, although increasing in recent years, are still not sufficient for many sites.
Furthermore, an industrial information system shares a common base with a conventional information system and is therefore subject to the same attacks. Of course, attacks such as Stuxnet, Triton, or BlackEnergy (on a smaller scale) require additional skills. However, it is always worth remembering that the targets of interest for groups possessing this type of means are generally already subject to regulatory obligations (LPM in France, NIS directive etc.), which if respected, greatly limit the risks of a successful attack against them. However, these systems are not invulnerable, and must therefore also be prepared to respond to an attack.
Inevitable attack on industrial systems: how to minimise the impact and restart operations quickly?
It therefore appears that:
- Protecting oneself from the threat is often limited to the application of basic security measures if there is no regulatory obligation applicable to the target information system;
- Identifying the sources of threat and detecting an attack before it reaches its objective requires in most cases resources that are too important in relation to the budgets of current industrial information systems.
If the probability of an information system undergoing a successful cyber-attack, and more specifically a ransomware, is almost certain, the following question arises: “How can we prepare for a major cyber-attack, maintain critical activities in a degraded mode, while rapidly regaining confidence in the industrial information system? ».
The answer to this question is covered by the last two pillars of computer security according to the NIST framework: respond and recover. An attempt to answer this question is presented in this article.
Note: the first part of this article “How to respond to an attack before it is too late?” is not necessary to implement the recommendations detailed in the second part “How to recover after an attack if it could not be contained? ». Although the implementation of network filtering measures is highly recommended, it may be interesting for sites where the implementation of such filtering measures takes too long to implement, to start with the preparation part of the remediation of a cyber-attack, which is easier to implement.
How to respond to an attack before it is too late?
Involving industrial teams
Before talking about the measures that can be put in place to respond to a digital security incident, it may be interesting to remember that industrial staff are used to crisis management.
Indeed, many industries regularly organise crisis management exercises (fire, chemical risk, natural disasters, etc.). On many sensitive sites, procedures are therefore already available to respond to this type of incident, under the direction of a dedicated manager. In addition, autonomous physical protection is generally available: pressure relief valve, non-return valve, sprinkler etc., although these are sometimes replaced by connected instrumented safety systems.
The context is therefore appropriate for adding a new procedure in order to respond to a computer attack. This will generally consist of isolating the industrial information system from the outside via a procedure known as the “red button”. In order to draw up the associated procedure, the involvement of site personnel will be essential, particularly to ensure that the application is not more harmful than the attack itself.
A prerequisite for the implementation of the isolation posture: the control of its flows and the implementation of network partitioning/filtering.
It is necessary to measure the impacts generated using the “red button”. To do this, it is necessary to list the interconnections of the industrial site with other systems.
List the interconnections with other information systems.
It may be interesting to start by listing the flows between the industrial information system and the outside. First of all, it is necessary to define what this system contains. In a basic case, it includes the PLCs, the supervision, as well as the equipment necessary for the interconnection of the first two.
Other equipment can then be added: an Historian server, client stations for supervision, a NAS, etc. This network, later called an industrial network, is generally connected with other networks in order to share information with the equipment of the latter.
It is possible to mention:
- Exchanges with the company’s ERP (whether an MES – Manufacturing Execution System is present or not), generally located on the office network;
- Exchanges with partners: regulation of electricity, water and gas networks, etc.;
- Exchanges with service providers: weather, cloud solutions for energy optimisation, predictive maintenance, etc.
These flows, although useful to simplify operations, can generally be temporarily cut off or replaced by alternative means (telephone call to indicate production levels for example).
Moreover, each industrial site is different, and therefore manages these interconnections differently. It is common to see MPLS networks dedicated to industrial sites when the company owns several of them. In other cases, the office network will be used to federate them. It is also true for the connection needs between these industrial networks and the Internet, which sometimes pass first through the office network, or benefit from a direct output.
List its internal flows
After listing the interconnections between the industrial network and the outside, the internal flows remain to be listed. Most of these flows should be strictly necessary for the proper functioning of the industrial process, such as those between supervision and PLCs. Cutting off these connections would therefore require stopping the industrial process, or at least making it safe.
It may then be interesting to separate the equipment and associated flows into several zones:
- Field network;
- Others (supervision client stations, historian server, etc.).
Setting up these zones allows the exposure of these components to be drastically reduced. Indeed, only the supervision should have access to the field networks, while the “Others” category should only have access to the supervision.
Other zones may be necessary to implement such as:
- An administration zone: which could also be used to program the PLCs according to the distribution of roles and responsibilities on site;
- A DMZ: which can accommodate a relay server so that equipment outside the industrial site does not connect directly to the supervision system to retrieve production data, etc.
Depending on the services offered (WSUS server, antivirus server, Terminal Server for remote access etc.) other zones can of course be added.
Evaluate the real need for these flows
After listing all these flows, it is interesting to identify the real need for each of them. For example, is it necessary to be able to access e-mails from a supervision server?
In order to limit the exposure of the industrial network to the outside, it could also be necessary to take some equipment out of it. For example, if a database accessed from the office network is fed by the supervision, but not useful to it, hosting it directly on the office network may prove simpler than trying to limit access.
Once the necessary flows have been clearly identified, the associated filtering rules must be configured in detail (source IP address, destination IP address, destination port). This work generally requires a significant human investment, mainly from the teams in charge of the industrial site, as well as a significant material cost to acquire security equipment. However, it is a prerequisite for setting up the fallback postures described below. In an ideal case, application filtering (level 7 of the OSI model) could also be implemented.
This work, although essential to the implementation of isolation postures, is also one of the fundamental actions to be carried out within the framework of securing an information system (industrial or not). Indeed, each flow cut off is a flow that does not need to be monitored, as well as one that is less exploitable by an attacker.
Preparing fallback postures
Complete isolation of all the equipment in an industrial information system is not always desirable, even in the event of an attack. After having listed these flows, it may be interesting not to set up a single isolation posture, but several fallback postures, allowing in some cases to continue working almost normally.
Preventive fallback posture: isolate the plant in the event of an attack on an external network
After identifying the flows between the industrial network and the outside, it is possible to create an associated fallback posture in order to deactivate them if necessary. The objective of this posture is to cut all interconnections of the industrial network with the outside in order to prevent any propagation of an attack. A proven solution is to group these flows on a few dedicated Ethernet ports. Thus, it is sufficient to indicate in the associated procedures to disconnect the associated cables to activate the fallback posture. This also avoids having to intervene on the configuration of firewalls in the event of a cyber security incident.
In addition, it is also necessary to define the cases in which this posture should be activated. If it can be activated without posing any problem to production, or adding too much work to the site staff, the question may arise as to whether these flows are necessary.
If this posture does have an impact on the site’s industrial activities, a good balance must be found between triggering it too early (as soon as the antivirus software on an office workstation raises an alert), or too late (after the first industrial workstations have been encrypted). This will also depend on the context of the company and its resources (dedicated or non-dedicated security monitoring team, etc.).
Specificity (distributed sites, non-autonomous sites, etc.)
If all flows with the outside do not have the same destination, it may also be interesting to define several specific fallback positions. Indeed, if the service provider in charge of managing the site’s cameras warns that he is undergoing a ransomware attack, it seems more optimal to disconnect only the flows between this service provider and the factory network, rather than all the flows, including those to the ERP.
In the case where the industrial process is distributed over several sites (production and distribution plant in particular), the activation of the preventive fallback posture should not cut off the flows between these different sites. Indeed, specific links should be dedicated to this. If this is not the case, use of the office network to ensure these connections, for example, a project to overhaul the industrial network is probably to be expected (deployment of a dedicated VRF, or a SDWAN network for example).
Finally, it is always good to remember that each factory is different, so a local study will have to be carried out on each one to understand its specificities.
Last resort fallback position: switch off the information system in the event of a proven attack on the plant
Finally, it may be interesting to prepare a last resort fallback posture. This should consist of isolating each VLAN (if defined, preferably with a local HMI per VLAN to ensure a degraded mode) or each piece of equipment (turn off the switches) in order to prevent the attacker from continuing his actions, which in the most advanced cases of attack, could directly target the site’s industrial process.
The objective is then to secure the site or ensure its essential services. The activation of this posture implies working without an information system and should only be applied in the event of proven compromise of at least one piece of equipment on the site, since it leads to the same immediate result as a ransomware, if not worse.
An upstream work with the operators will be necessary in order to list all the actions to be carried out when this posture is activated and to define degraded modes. Indeed, this will generally require the activation of on-call duty in order to manually perform certain tasks: checking the correct operation of equipment, especially on remote sites, use of local HMIs, etc. Moreover, some industrial processes are no longer manually controllable today, and will therefore have to be stopped since no degraded mode is available.
In order to estimate the impacts of activating such a posture, it may be interesting to look at the impacts listed in the event of fire or a general power failure. Moreover, only a real test of this posture can ensure its operational impacts.
How to recover after an attack if it has not been contained?
In some cases, the activation of fallback postures may not be sufficient to protect the entire industrial information system, especially if they are activated too late. It is then essential to be able to proceed with the reconstruction of all or part of the said system in a sufficiently short time to limit the associated impacts.
The main prerequisites for restoring an industrial information system are listed below.
What must be backed up to be able to restore its PLCs?
In order to be able to restart the factory, it is necessary in most cases to start restoring PLCs, which requires two main elements.
Having an up-to-date copy of your PLC programs
PLCs are spared in most current attacks, probably because targeting Windows workstations is enough for attackers to achieve their intended objectives. However, attacks are likely to be increasingly targeted, and most PLCs currently in use are not secure (unencrypted and unauthenticated communications, default passwords, administration functionality that cannot be deactivated, etc.).
It is therefore necessary to save these programs, which is already generally the case, particularly on the programming station (sometimes belonging to a service provider) used when the device is commissioned. It should be noted that these backups should be stored on at least one off-line medium, so that they are not encrypted in the same way as the workstation hosting them.
These observations remain valid even for the new generations of PLCs, which, although benefiting from a level of security that is far superior to that of their predecessors, are not invulnerable.
Save a means of downloading these programs to the PLCs
Many PLCs require dedicated software to be programmed. This is even the case if you just want to download an already written program. It is therefore advisable to have a copy of these programs.
In some cases, a programming station disconnected from the network and reserved for this purpose can be a solution. It should be noted, however, that maintaining such a station in a safe condition can quickly become complex. If this solution is selected, this station could also host the copy of the PLC programs. Keeping a second backup set off-line (external hard disk for example) would however be an additional security measure.
Furthermore, if new generations of PLCs are used, with the latest security features enabled, other elements should be backed up such as: PLC program passwords, certificates used for certain communications (or a means of regenerating them) etc.
These prerequisites are also valid for network equipment (firewalls, switches etc.).
What needs to be backed up to be able to restore essential computer hardware?
Identifying what is really needed
Restoring SCADA system, and associated client workstations, is generally equivalent to restoring a Windows system and associated programs. Several questions must be asked to identify the items to be backed up:
- What equipment is needed? An engineering workstation, a SCADA server, a few operator workstations?
- Is it possible to reinstall the SCADA system from scratch (new installations of Windows and the supervision software) and then deposit a backup of the SCADA configuration? Is this feasible in a sufficiently short time?
- Would not a complete copy of the SCADA server disk be simpler? It would indeed be sufficient to insert the saved disk to reboot.
- Are changes regularly made to the supervision software? If yes, is it necessary to back them all up? In this case, it seems complex to make a complete copy of the disk each time.
Backing up intelligently
In many cases, backups of Windows workstations are made in the same way as those of PLC programs, by copy/paste. It could then be interesting to look at automatic backup mechanisms. However, these are probably to be avoided for factories starting from scratch and not having enough budget to install them serenely. Indeed, implementing this type of solution in a secure manner is generally more complex than making a simple bit-by-bit copy of a hard disk.
Do not neglect documentation and training
However, it is not enough to have complete backups available. It is also necessary to draw up detailed operating procedures for restoring these backups. Indeed, if a crisis were to occur, the stress of the teams and the potential unavailability of some of the knowledge could lead to handling errors in the absence of documentation.
These procedures are not intended to enable a complete restoration of all systems, but at least to enable the essential elements previously identified to be restarted:
- An engineering workstation with the associated PLC programming software;
- A SCADA server;
- Two to three operator workstations;
- The plant’s essential PLCs.
In addition, it is generally recommended to have at least two sets of backups, one to be stored near the equipment concerned, the other to be stored on another physical site, with access limited to a limited number of people. It may be tempting to store an additional set of backups online, but it should be noted that in the event of a cyber-attack, and activation of fallback procedures, it is complex to download these backups and deposit them on the systems to be restored.
Finally, it is essential to test all these procedures to ensure that they are exhaustive. A test could, for example, be the opportunity to realise that the backup of the SCADA configuration does not include the licence key, or that the passwords configured when the complete disk was copied have since been modified without keeping the history.
Crisis management is an important component of the business for many industrial system operators. These same people are also the most experienced in their perimeter. However, they are generally not IT experts. Pragmatic measures, adapted to their context, will therefore be far more useful than a generic 200-page guide containing all the good practices to be applied to an information system.
As in development with the KISS principle (Keep it simple, stupid), fallback postures, as well as restoration procedures, should be kept simple to understand, and stupid to apply.
Furthermore, although the application of a strict network filtering policy can only be advised, it is not strictly necessary for the implementation of backup and recovery actions. Thus, even if the probability of a successful attack is increased, it will still be possible to restore critical systems.
Finally, it should be noted that more and more industrial processes are nowadays operating in a just-in-time mode. In this type of context, the preservation of the industrial system from an attack, or the ability to restore it quickly, would not be sufficient to maintain the level of production if the management of orders or distribution, for example, are unavailable. Cyber resilience must therefore be considered at the company level, and not only at the level of the industrial site.
To respond to an attack before it is late, it is necessary:
- To involve the industrial teams (without which it is highly likely that the computer will survive the attack, but without the factory continuing to fulfil its primary mission);
- To control its flows and implement network partitioning/filtering in order to be able to set up fallback postures:
- Preventive, in order to isolate the factory in the event of an attack on an external network without having too significant an impact on the industrial process;
- As a last resort, in order to shut down the information system in the event of a proven attack on the factory before the attacker modifies the industrial process.
- To test these fallback postures, in order to ensure that their activation is not worse than the attack.
And in the case where the attack could not be contained, the following elements are generally necessary in order to recover from the said attack:
- Possess an up-to-date copy of your PLC programs;
- Save a means of downloading these programs to the PLCs;
- Have at least one copy of all critical backups on an off-line medium (external hard disk for example);
- Identify its essential computer equipment (in particular so as not to restore the history server before the supervision server, etc.);
- Backing up intelligently, sometimes a bit-by-bit copy of the hard disk is more efficient than an automatic copy on a dedicated server, generally encrypted at the same time as the system whose backups it hosts;
- Don’t neglect documentation and training (otherwise a forgotten license key, or someone on holiday could quickly sign the end of the restore…).
A new version of the threat assessment was published at the beginning of the year: https://www.cert.ssi.gouv.fr/uploads/CERTFR-2021-CTI-001.pdf