Continuous Monitoring of OT Key Risk Indicators (KRIs)
By Rani Kehat, VP Business Development, Radiflow
The problem with “guesstimating” the probability of a threat
I usually start the process of assessing risk by creating a Risk Registrar, filling in the risk statement, description, details on the loss scenario and so on… eventually working my way to the risk analysis part of the register.
The classic formula for risk is:
Risk = Probability x Loss
And in the popular modern Open FAIR™ Risk methodology:
$ Value of Cyber Risk = SUM (LEF X $ ML)
- ML is the Magnitude of Loss for the Asset at Risk
- LEF is the Loss Event Frequency which is derived from the combination of:
- Threat Event Frequency (TEF): the number of times over the next 12 months the threat likely to materialize, and
- Vulnerability: the percentage of threat events are likely to result in loss events, based on Threat Capability and on Resistance Strength given the security controls installed
Better input for better results
Many of these parameters are similar and confusing:
“Confusing a loss event with a threat event in an analysis will lead to inaccurate results. Remember, Loss Event Frequency is how often the organization actually suffers a loss and the damaging event materializes.” (Source: the Fair Institute)
Furthermore, since many of these parameters are not available as an accurate figure the calculation uses ranges and a statistical simulation to calculate the probable range of the risk value.
For the ML section, filling in the boxes (for potential loss) isn’t too bad, as they can be defined based on known business parameters such as revenue loss, cost of response and so on.
But then we come to the “head-scratcher” boxes of LEF, TEF Vulnerability. What values do you enter in those?
Am I to hunt the latest cyber reports and look at my geolocation and industry breach history? Does past history predict future scenarios? Are all food manufactures the same? And where am I in this benchmarking cauldron?
So, as it often happens with too many such vague parameters: you reach a state of GIGO (Garbage in Garbage Out). Either your calculations are wrong or the range of the resulting risk score is too wide to be useful for decision-making.
Applying a Data-Driven approach
In this article I propose that in order to answer the above questions, we need to use a data-driven approach which combines OT breach attack simulation (OT-BAS) with statistical simulation techniques.
By using virtual OT-BAS we are able to obtain data points on system vulnerability, i.e., Threat Capability and Resistance Strength on a specific production system under consideration (SUC), not just generic information. And combining breach attack and statistical simulations de-facto applies a data-driven approach of entering values into statistical simulation tools instead of “guesstimating”.
Using the above approach, we can reduce our input variance on threat actor capabilities and resistance strength and thus narrow down the value range of risk we get as an output.
For example, I’ll use the FAIR-U tool on a Phishing database breach scenario supplied with the tool.
For the purpose of this post, I won’t change the Loss Magnitude (ML) side, and only concentrate on the left side, Loss Event Frequency (LEF).
Our starting-point: Using common sense (I hope my sense is common), with no “prior” information on what values to enter i.e., the “guesstimate” methodology.
The initial values I entered are: 50/50 on threat capability, 50/50 on probability of action, 40-60-80% on resistance strength, and [1-2-4] on Contact frequency.
As you can see in the chart below, the starting-point LEF looks good, with values in the tolerable risk area.
I’ll run the same statistical simulations with values from an OT-BAS simulation.
After entering the resistance strength and threat capability (TI), taking into account the security level achieved (SLA) at the site, the digital image, and the relevant threat intelligence, the resulting risk values now look very different, and not for the better.
|4||Probability of Action||50||50||50|
|5||Risk||0$||Avg 138K$||8.1M$||Within tolerable risk|
|7||Resistance strength||30||40||50||Site SLA using BAS|
|8||Threat Capability||60||70||80||TI, MITRE, BAS|
|9||Contact Frequency||1||2||4||No change|
|10||Probability of Action||60||70||80||Insight From (7)(8)|
|11||Risk||0$||Avg 4.3M$||22.7M$||Exceeding tolerable risk|
Continuous Risk Monitoring
In today’s everchanging environment, an annual risk assessment is no longer sufficient. To continually monitor LEF as threat landscape and vulnerabilities change, we need to continuously monitor key risk indicators (KRIs) to alert us of changes.
The IOR institute defines KRIs (Key Risk Indicators) as metrics that provide information on the level of exposure to a given operational risk which the organization has at a particular point in time.
KRIs are an early warning system of changes in our threat landscape and system vulnerabilities, which provide the needed time to proactively address changes in our risk posture.
Using the changes in these KRIs, we re-run our OT-BAS and enter the new values from the OT-BAS in our risk registrar using a statistical simulation tool for the probability ranges.
In order to continuously track changes in LEF, we recommend assigning KRIs (key risk indicators) to TEF and vulnerabilities, in particular probability of action, threat capability and resistance strength.
Example KRIs for LEF:
TEF KRIs: “How many times will the asset face a threat action?”
- Industrial sector
- Active adversaries
- Adversary capability and ATT
Vulnerability KRIs: “What percentage of threat events are likely to result in loss events”
- Production functionality and topology
- Distribution of security controls
- Possible threat scenarios causing a loss
- Connectivity, interdependencies
- Escalation and propagation of a loss scenario
- Vulnerabilities in system and procedures
Use the OT-BAS to determine and prioritize “Key” Indicators
It’s important to note that KRIs scores are dynamic. They may not be as frequent as EPS to a SIEM, but KRIs alerts need to be timely to identify the shift in risk posture and give us the needed time to adjust our defenses.
Changes to a KRI signal a change in the level of risk exposure associated with specific processes and activities. Thus, KRIs are pro-active metrics used by organizations to provide an early signal of increasing risk exposures in various areas of the enterprise.
We recommend the following work flow:
- Understand TEF
- Understand Vulnerability
- Understand if a threat event scenario’s is a loss scenario
- Address only scenarios that cause loss
- Use KRIs for each loss threat category – no more than 5 KRIs.
- Monitor changes in the KRIs and re-evaluate risk for each such change
Example of Risk registrar with KRIs
KRIs are added to the risk registrar to pro-actively recommend mitigation controllers that would reduce the risk before a loss event happens.
Below is an example of the extended risk registrar with quantitate values to address LEF.
|Risk category:||OT operational|
|Risk Program:||Cyber Origin / network connectivity|
Risk Title: (scenario)
|Loss of control on heat level – boiler tank A12
Remote Safe shut-down not possible
|Frequency of scenario as security target LEF
(times per year):
|Current values after BAS|
|Current Threat likelihood of risk title (simulated):||80% (High)|
|Estimated current LEF||1<N<=10|
|BAS simulation to SLT3|
|Threat likelihood of risk mitigated to IEC62443 SLT3 (simulated):||45% (Medium)|
|Estimated simulated mitigated LEF at SLT3||1<N<=5|
|Overall Impact rating:||High (I omitted the pre overall impact calculation stages for simplicity)|
|Overall risk rating:||High (80%)|
|Risk tolerance:||Medium (45%)|
|Risk response:||Mitigate overall risk rating down by reducing threat likelihood SLA = SLT3 to reach tolerance level|
|KRIs for risk title:||
New ATT and cyber tools
Change of asset vendor
Change in connectivity to asset
Change of project on logic controller
So next time when we are challenged by the “head-scratcher” Loss Event Frequency, we recommend a data driven approach using statistical tools such a FAIR-U and to add data points that are derived from simulated breach attack simulation on your specific production environment, thus reducing the ranges of inputs for the statistical calculation.
If you’ve found this article interesting, please visit and follow Radiflow on LinkedIn, where you’ll find a wealth of exclusive content.
Assessing OT network risk requires knowing both the impact of a materialized threat on a each specific business process, as well as the Loss Event Frequency associated with a specific threat. “Guesstimating” these values would result in skewed findings and mitigation recommendations; therefore, a data-driven approach, based on breach attack simulation is needed.