From Micro to Macro: High Reliability Organization and Resilience Engineering

Journal paper of High Reliability Organization and Resilience Engineering

學習一套新的理論，最好先看看對立的批判

Disclaimer: 乏味學術期刊文章筆記

The Limits to Safety? Culture, Politics, Learning and Man-Made Disasters

Nick Pidgeon

Journal of Contingencies and Crisis Management Volume 5 Number 1 March 1997

Man-Made Disasters: Organizations as Vulnerable Socio-Technical Systems

A significant disruption or collapse of the existing cultural beliefs and norms about hazards, and for dealing with them and their impacts. All organizations operate with such cultural beliefs and norms, which might be formally laid down in rules and procedures, or more tacitly taken for granted and embedded within working practices.

In Turner's terms, disaster is then differentiated from an accident by the recognition (often accompanied by considerable surprise) that there has been some critical divergence between those assumptions and the `true' state of affairs.

MMD also highlights how system vulnerability often arises from unintended and complex interactions between contributory preconditions, each of which would be unlikely, singly, to defeat the established safety systems.

This point was explored later by Perrow (1984) in his more deterministic account of the causes of normal accidents in technological systems.

為何組織會對潛在與蓄積中的風險視而不見？

Four classes of information difficulties are central to this cultural process of defective reality testing and denial. They stem from the attempts of both individuals and organizations to deal with problems that are, in foresight at least, highly uncertain and ill-structured.

1. Critical errors and events may initially remain latent, or are misunderstood, because of wrong assumptions about their significance. This leads to a selective problem representation at the level of the organization as a whole, a situation which, in turn, structures the interpretations and decisions of the organization's individual members. Such a representation may arise through organizational rigidity of beliefs about what is and is not to be counted a `hazard'.

A related syndrome described in MMD is the `decoy phenomenon'. Here personnel who are dealing directly with risk and hazard management, or others who suspect there is something amiss, may be distracted or misled into thinking the situation has been resolved by attention to related (that is decoy) events.

2. Dangerous preconditions may also go unnoticed because of the inherent difficulties of handling information in ill-structured and constantly changing situations, leading to a condition described by Turner as variable disjunction of information. Here the problem may becomes so complex, vague or dynamic and the information that is available at any one time dispersed across many locations and parties- that different individuals and organizations can only ever hold a partial(and often very different and changing) interpretation of the situation.

3. Uncertainty may also arise about how to deal with formal violations of safety regulations. Violations might occur because regulations are ambiguous, in conflict with other goals such as the needs of production, or thought to be outdated because of technological advance. Alternatively, safety waivers may be in operation, allowing relaxation of regulations under certain circumstances as occurred in the case of the Space Shuttle Challenger O-ring seals.

4. Finally, when things do start to go obviously wrong, the outcomes are often worse than they might have been because those involved will tend to minimize danger as it emerges, or to deny that danger threatens them.

`the `radius of foresight' is much shorter than the `radius of action''.

shaping blindness to certain forms of hazard

Political Design for Political Problems?

Can institutional resilience be a realistic design goal, through changes to an organization's safety culture?

A first issue to resolve is the problem of warnings.

Few would probably disagree that foresight is indeed limited and, as such, the identification of `signals' in advance of a major failure is problematic. But just how

limited? For if the identification of system vulnerability sets an impossible task then high reliability cannot be achieved irrespective of politics.

On a more pragmatic level, one needs to know whether differences in safety performance observed across contexts and in foresight are more than mere error variance. Most of the time, as Sagan's (1993) account only too readily

illustrates, it is a matter of judgement as to whether the current safety glass is half-empty or half-full. Certainly, careful observation and measurement of theoretically relevant events- unsafe acts; known barriers to communication; diffusion and fragmentation of responsibilities; financial constraints- is one route to follow and with some success (Wagenaar et al, 1994), although it remains to be seen precisely which empirical questions will differentiate vulnerable from resilient systems.

High reliability organizations (HROs)

Kathleen M. Sutcliffe

Best Practice & Research Clinical Anaesthesiology 25 (2011) 133–144

如何定義何謂HROs

One can identify this subset by answering the question, ‘how many times could this organisation have failed resulting in catastrophic consequences that it did not?’ If the answer is on the order of tens of thousands of times, the organisation is ‘high reliability’”.

呵呵，實務的角度而言，出事不奇怪，最奇怪的是很多很鳥的為什麼不出事

there are no safe organisations because past performance cannot determine the future safety of any organisation.

應該正名

Reliability-seeking organisations are not distinguished by their absolute errors or accident rate, but rather by their “effective management of innately risky technologies through organisational control of both hazard and probability.

Competing approaches to achieving reliability

Prevention

Prevention or anticipation requires that organisational members try to anticipate and identify the events and occurrences that must not happen, identify all possible causal precursor events or conditions that may lead to them and then create a set of procedures for avoiding them.

Studies show how HROs are obsessed with detailed operating procedures, contingency plans, rules, protocols and guidelines as well as using the tools of science and technology to better control the behaviour of organisational members to avoid errors and mistakes.

Nevertheless, research also shows that adherence to rules and procedures alone will not prevent incidents. There are limits to the logic of prevention.

One limitation is that unvarying procedures cannot handle what they do not anticipate. Moreover, even if procedures could be written for every situation, there are costs of added complexity that come with too many rules.

Resilience

HROs are unique in that they understand that reliability is not the outcome of organizational invariance, but rather, results from a continuous management of fluctuations in job performance and human interactions. To be able to become alert and aware of these inevitable fluctuations, to cope with, circumscribe or contain untoward events, such as mistakes or errors, ‘as they occur’ and before their effects escalate and ramify, HROs also build capabilities for resilience.

Resilience involves three abilities:

(1) the ability to absorb strain and preserve functioning in spite of the presence of adversity (e.g., rapid change, ineffective leadership, performance and production pressures, increasing demands from stakeholders);

(2) an ability to recover or bounce back from untoward events – as the team, unit, system becomes better able to absorb a surprise and stretch rather than collapse; and (3) an ability to learn and grow from previous episodes of resilient action.

HRO的特徵

Mindful organising (Situation Awareness)forms a basis for individuals to interact continuously as they develop, refine and update a shared understanding of the situation they face and their capabilities to act on that understanding. Mindful organizing proactively triggers actions that forestall and limit errors and crises.

Building a group and organisational culture, where it is the norm for people to respectfully interact. Second, they foster a culture where people interrelate heedfully so that they become more consciously aware of how their work fits in with the work of others and the goals of the system. Third, HROs establish a set of practices that enable them to track small failures, resist oversimplification of what they face, remain sensitive to current operations, maintain capabilities for resilience and take advantage of shifting locations of expertise.

Applying HRO and resilience engineering to construction: Barriers and opportunities

Eleanor J. Harvey, Patrick Waterson, Andrew R.J. Dainty

Safety Science 2016

安全觀念演進的脈絡

NAT源起於三浬島核災事故

HRO源起於對抗NAT，針對航母運作進行了五年的現場觀察與訪談

High reliability organizations are characterized by their capacity to respond, learn, and feedback quickly through accurate communications, and their flexibility to improvise by recombining resources, skills and experience.

RE則混雜許多先前觀念-人因、組織文化、系統安全

The characteristics of a resilient organisation are less well-defined than high reliability organisations, but the RE community believes any organisation can become resilient, with different industries managing stability and flexibility in different ways

以上理論都是概念，難以定義哪些組織是所謂HRO and RE那些組織不是HRO and RE，遑論進一步實證比較確認其效益effect size

先前與傳統的觀念

Before the age of systems safety, accidents were believed to have a root cause - a technical malfunction or individual failure on which events could be blamed.

怪罪犯錯的人與不給力的設備有好處：This simplistic model is emotionally satisfying and has legal and financial benefits.

The prominence of the ‘Zero Accidents’ discourse also confirms this model.

Accidents in HRO are described in causal terms, as the result of an unfortunate combination of a number of errors; hence, detecting failures as they develop through sensitivity to weak signals is advocated.(欸，信號越微弱，型一型二誤判風險越高)

Based on this interpretation, risk analysis depends upon the systematic identification of causal chains and implies safety is a static commodity that can be quantified, not a dynamic process.

For RE, safety is a dynamic process, human behaviour cannot be categorised in a bimodal way and the causes of accidents are far more subtle and complex – nothing worth reporting happens. Instead, accidents are caused by an undetectable ‘‘drift into failure” which is a natural part of operations in resource constrained environments.

The efficiency thoroughness trade off (ETTO) principle (the tendency to sacrifice thoroughness for efficiency) is key to understanding the ‘drift’ that means failure can develop out of normal behaviour. Humans have a natural tendency towards efficiency (Hollnagel, 2009). Rational decision-making is also limited by context, subject to social and cultural factors (Perrow, 1984), and constrained by finite cognitive resources so people “muddle through” making what they perceive to be ‘‘sensible

adjustments to cope with current and future situational demand.

呵呵，RE的觀點竟然跟NAT很像

Drift, adaptation, resilience and reliability: Toward an empirical clarification

Kenneth A. Pettersen, Paul R. Schulman,

Safety Science 2016

本文的關鍵問題

How can we differentiate adaptation and resilience from an organizational drift which undermines reliability and safety?

思思有兩種，resilience也有好幾種

Resilience has alternately been conceived as

“rebound” from failures;

‘‘robustness” (absorbing shocks without major failures);

‘‘graceful extensionality” (extending boundaries or ‘‘stretching” organizational capacity to reduce brittleness and cope with surprises); and

a sustained adaptability

事前的預料與準備

Precursor resilience which is about monitoring and keeping operations within a bandwidth of conditions and acting quickly to restore these conditions rapidly as a way of managing risk.

遭遇事故後的恢復力、回復速度

Restoration resilience which consists of rapid actions to resume operations after temporary disruption.

系統適應力

Recovery resilience, which is about putting damaged systems back together to establish a ‘‘new normal” at least as reliable and robust as before, if not improved.

呵呵，很多組織在熱力學第二定律底下逐線衰敗的調適adaptations，其實它的以上三種resilience 都是越來越差(如同人年紀大了，認知記憶、骨骼密度與肌耐力/應變反應都只會衰退)

“adaptations” actually become a negative drift in relation to the pursuit of larger reliability and safety goals in these organizations.

所謂的調適adaptations

正面來說就是在有限資源與種種限制下，用馬蓋先式的就地取財與隨機應變完成任務

負面來說，可謂是集體的駝鳥心態、偷工減料與共犯結構(有多少資源、大家的心態到哪，事情就做到哪)

好有犀利的一段話

No organization is exempt from drifting into failure.

The reason is that routes to failure trace through the structures, processes and tasks that are necessary to make an organization successful.

Failure does not come from the occasional, abnormal dysfunction or breakdown of these structures, processes and tasks, but is an inevitable by-product of their normal functioning.

The same characteristics that guarantee the fulfillment of the organization’s mandate will turn out to be responsible for undermining that mandate.[Dekker, 2011, p. xiii]

HRO與RE兩個理論背後的潛在衝突

High Reliability Organization(提升可靠度=減少變異)：包括教育訓練提升人的認知與能力、建立各種SOP與checklist、查核與防呆的procedure

Resilience Engineering(加大組織與系統的容錯與承擔變異的能耐)：例如工程上的layer of protection, redundancy and backup；管理上預先發想各種極端情境，以擬定應變與BCP、進行各項準備與替代方案

而這種作法，背後必須做出一些取捨、無法兼顧

提高可靠度=標準化與嚴格管控(=失去應變與自作主張的彈性)

提升韌度的flexibility, 備用備份的投資，從某些角度與對錙銖必較的經營管理而言，可謂是沒有效率與效益的浪費

High reliability organizations are characterized by well understood and relatively stable technologies. They feature elaborate analysis, anticipatory planning and modeling of their technical systems (in American nuclear plants, for example, it is a violation of federal regulations to operate them ‘‘outside of analysis”). This analysis and anticipation is reflected in detailed procedures which govern most of the work.

innovation to careful system-wide analysis (Schulman, 1993).

HRO達成安全的技法

HROs in fact significantly reduce uncertainty by means of (at least) the following features:

Reliability and safety goals are clear and well monitored. There is a strong shared recognition of the stakes of system failure. Further, no production or output goal is allowed to come before safety and reliability. HROs will shut down operations rather than operate in unsafe conditions, including uncertainty, and there is public and political support for this priority. 不安全就中斷生產

There is careful management of bandwidths in organization and operation. Tasks and the operation of the organization and its technical systems are kept well within the limits of its known reliability envelope. There is also effective ‘‘precursor

resilience” (as we will describe) to monitor and restore operations within specified bandwidths (Roe et al., 2002). 不挑戰與測試極限

Protection of social structures (e.g. social networks are carefully managed and new members are trained and socialized over long periods of time).團隊組織成員的人際關係

Skepticism concerning change. Improvement is seen as a necessity but change to accomplish this improvement is approached with caution. If practical change is introduced this will be done under a systemic and not simply a localized perspective (Binci and Cerruti, 2012). The dominant attitude concerning change in HROs we have observed is to always assure that every step toward improvement must keep the organization at least as reliable as it currently is. HRO managers do not take risks to reduce risks. 對於改變與創新保守.

Organizational drift and risk of catastrophic failure

In apparently safe states it is difficult to maintain a commitment to system safety over time as safety goals will be compromised, particularly under conditions of scarce resources.

When a technology is viewed as reliable and many successes and no failures are observed, there is increased confidence coupled with political incentives to revise the likelihood of failure downwards in order to justify shifting resources to other activities. With an improved safety record and long periods of safe performance, resources gradually shift away from safety toward support of efficiency goals. This leads to reduced safety margins and a drive to drift away from safety concerns that may eventually lead an organization toward increases in vulnerabilities and allow another catastrophe.

the Cycle of Failure

http://www.yourarticlelibrary.com/company/service-management/difference-between-the-cycle-of-failure-and-the-cycle-of-mediocrity/34316/

評估Drift要問的問題

1. Drift with respect to what?

Is drift connected to a shift in goals, values, psychology, protocols or practices? (E.g. goal displacement from safety to efficiency, or a cognitive change from formal to schema-based decision-making.)

2. Drift with respect to whom?

Who is making behavioral changes? Who is going to be making representational errors because of them: operators, directors, regulators, stakeholders, and/or the public?

能夠用 Safety Change Management來對抗組織的Drift 與熱力學第二定律的衰敗？

Safety change management – A new method for integrated management of organizational and technical changes

Marko Gerbec

Safety Science xxx (2016)

哪些是所謂關鍵的改變？

a. The technical/technological and organizational changes are interconnected in an organization, so changes should be managed in an integrated way.

b. The complexity and propagation of the impacts likely spans over more than one organizational level, so the changes shall be managed considering implications on all relevant levels.

c. The ‘‘pure” technical/technological impact(s), as well as organizational issues impacted at various management levels, shall be clearly identified, categorized and subject to careful safety evaluation, planning and documentation.

單純技術與材料的改變不重要，重要的是那些攸關各階層與不同權責單位的改變

The overall purpose is to prevent risk information gaps among the stakeholders in a change, thus the proposed approach will build on the concept of situational awareness/Common operational picture.