DOC

While clearly engineers have been concerned about the safety of

By Bobby Wells,2014-06-23 11:49
8 views 0
While clearly engineers have been concerned about the safety of

    Effectively Addressing NASA’s Organizational and Safety Culture: 1 Insights from Systems Safety and Engineering Systems

    By

    Nancy Leveson, Joel Cutcher-Gershenfeld, Betty Barrett, Alexander Brown,

    John Carroll, Nicolas Dulac, Lydia Fraile, Karen Marais

    MIT

1.0 Introduction

    Safety is an emergent, system property that can only be approached from a systems perspective. Some aspects of safety can be observed at the level of the particular components or operations, and substantial attention and effort is usually devoted to the reliability of these elements, including elaborate degrees of redundancy. However, the overall safety of a system also includes issues at the interfaces of particular components or operations that are not easily observable if approached in a compartmentalized way. Similarly, system safety requires attention to dynamics such as drift in focus, erosion of authority, desensitization to dangerous circumstances, incomplete diffusion of innovation, cascading failures, and other dynamics that are primarily visible and addressable over time, and at a systems level.

    This paper has three goals. First, we seek to summarize the development of System Safety as an independent field of study and place it in the context of Engineering Systems as an emerging field of study. The argument is that System Safety has emerged in parallel with Engineering Systems as a field and that the two should be explicitly joined together. For this goal, we approach the paper as surveyors of new land, placing markers to define the territory so that we and others can build here.

    Second, we will illustrate the principles of System Safety by taking a close look at the two space shuttle disasters and other critical incidents at NASA that are illustrative of safety problems that cannot be understood with a decompositional, compartmentalized approach to safety. While such events are rare and are, in themselves, special cases, investigations into such disasters typically open a window into aspects of the daily operations of an organization that would otherwise not be visible. In this sense, these events help to make the systems nature of safety visible.

    Third, we seek to advance understanding of the interdependence between social and technical systems when it comes to system safety. Public reports following both shuttle disasters pointed to what were termed organizational and safety culture issues, but more work is needed if leaders at NASA or other organizations are to be able to effectively address these issues. We offer a framework for systematically taking into account social systems in the context of complex, engineered technical systems. Our aim is to present ways to address social systems that can be integrated with the technical work that engineers and others do in an organization such as NASA. Without a nuanced appreciation of what engineers know and how they know it, paired with a comprehensive and nuanced treatment of social systems, it is impossible to expect that they will incorporate a systems perspective in their work.

    Our approach contrasts with a focus on the reliability of systems components, which are typically viewed in a more disaggregated way. During design and development, the Systems Safety approach surfaces

     1 This paper was presented at the Engineering Systems Division Symposium, MIT, Cambridge, MA March 29-31, 2004. ?Copyright by the authors, March 2004. All rights reserved. Copying and distributing without fee is permitted provided that the copies are not made or distributed for direct commercial advantage and provided that credit to the source is given. Abstracting with credit is permitted. This research was partially supported by the NASA Ames Engineering for Complex Systems grant NAG2-1543.

     1

    questions about hazards and scenarios at a system level that might not otherwise be seen. Following incidents or near misses, System Safety seeks root causes and systems implications, rather than just dealing with symptoms and quick-fix responses. Systems Safety is an exemplar of the Engineering Systems approach, providing a tangible application that has importance across many sectors of the economy.

2.0 System Safety An Historical Perspective

    While clearly engineers have been concerned about the safety of their products for a long time, the 2 It resulted development of System Safety as a separate engineering discipline began after World War II.from the same factors that drove the development of System Engineering, that is, the increasing complexity of the systems being built overwhelmed traditional engineering approaches.

Some aircraft engineers started to argue at that time that safety must be designed and built into aircraft 34just as are performance, stability, and structural integrity. Seminars were conducted by the Flight Safety

    Foundation, headed by Jerome Lederer (who would later create a system safety program for the Apollo project) that brought together engineering, operations, and management personnel. Around that time, the Air Force began holding symposiums that fostered a professional approach to safety in propulsion, electrical, flight control, and other aircraft subsystems, but they did not at that time treat safety as a system problem.

    System Safety first became recognized as a unique discipline in the Air Force programs of the 1950s to build intercontinental ballistic missiles (ICBMs). These missiles blew up frequently and with devastating results. On the first programs, safety was not identified and assigned as a specific responsibility. Instead, as was usual at the time, every designer, manager, and engineer had responsibility for ensuring safety in the system design.

    These projects, however, involved advanced technology and much greater complexity than had previously been attempted, and the drawbacks of the then standard approach to safety became clear when many interface problems went unnoticed until it was too late. Investigations after several serious accidents in the Atlas program led to the development and adoption of a System Safety approach that 5replaced the alternatives"fly-fix-fly" and ―reliability engineering.‖

    In the traditional aircraft fly-fix-fly approach, investigations are conducted to reconstruct the causes of accidents, action is taken to prevent or minimize the recurrence of accidents with the same cause, and eventually these preventive actions are incorporated into standards, codes of practice, and regulations. Although the fly-fix-fly approach is effective in reducing the repetition of accidents with identical causes in systems where standard designs and technology are changing very slowly, it is not appropriate in new designs incorporating the latest technology and in which accidents are too costly to use for learning. It became clear that for these systems it was necessary to try to prevent accidents before they occur the first time.

    Another common alternative to accident prevention at that time (and now in many industries) is to prevent failures of individual components by increasing their integrity and by the use of redundancy and

     2 For a history of system safety, see Nancy Leveson, Safeware, Addison-Wesley, 1995. 3 C.O. Miller. A Comparison of Military and Civil Approaches to Aviation System Safety, Hazard Prevention,

    May/June 1985, pp. 29-34. 4 Robert Stieglitz, Engineering for Safety, Aeronautical Engineering Review, February 1948. 5 William P. Rogers, Introduction to System Safety Engineering, John Wiley and Sonds, 1971.

     2

    other fault tolerance approaches. Increasing component reliability, however, does not prevent accidents in complex systems where the problems arise in the interfaces between operating (non-failed) components.

    System Safety, in contrast to these other approaches, has as its primary concern the identification, evaluation, elimination, and control of hazards throughout the lifetime of a system. Safety is treated as an emergent system property and hazards are defined as system states (not component failures) that, together with particular environmental conditions, could lead to an accident. Hazards may result from component failures but they may also result from other causes. One of the principle responsibilities of System Safety engineers is to evaluate the interfaces between the system components and to determine the impact of component interaction where the set of components includes humans, hardware, and software, along with the environment on potentially hazardous system states. This process is called System Hazard Analysis.

    System Safety activities start in the earliest concept formation stages of a project and continue through design, production, testing, operational use, and disposal. One aspect that distinguishes System Safety from other approaches to safety is its primary emphasis on the early identification and classification of hazards so that action can be taken to eliminate or minimize these hazards before final design decisions are made. Key activities (as defined by System Safety standards such as MIL-STD-882) include top-down system hazard analyses (starting in the early concept design stage and continuing through the life of the system); documenting and tracking hazards and their resolution (i.e., establishing audit trails); designing to eliminate or control hazards and minimize damage; maintaining safety information systems and documentation; and establishing reporting and information channels.

    One unique feature of System Safety, as conceived by its founders, is that preventing accidents and losses requires extending the traditional boundaries of engineering. In 1968, Jerome Lederer, then the director of the NASA Manned Flight Safety Program for Apollo wrote:

    System safety covers the total spectrum of risk management. It goes beyond the hardware and

    associated procedures of system safety engineering. It involves: attitudes and motivation of

    designers and production people, employee/management rapport, the relation of industrial

    associations among themselves and with government, human factors in supervision and quality

    control, documentation on the interfaces of industrial and public safety with design and

    operations, the interest and attitudes of top management, the effects of the legal system on

    accident investigations and exchange of information, the certification of critical workers, political

    considerations, resources, public sentiment and many other non-technical but vital influences on

    the attainment of an acceptable level of risk control. These non-technical aspects of system safety 6 cannot be ignored.

3.0 System Safety in the Context of Engineering Systems

    During the same decades that System Safety was emerging as an independent field of study, the field of Engineering Systems was emerging in a parallel process. In the case of Engineering Systems, its 7codification into a distinct field is not yet complete. Though the two have emerged independently on

    separate trajectories, there is now great value in placing System Safety in the larger context of Engineering Systems.

     6 Jerome Lederer, How far have we come? A look back at the leading edge of system safety eighteen years ago. Hazard Prevention, May/June 1986, pp. 8-10. 7 See ESD Internal Symposium: Symposium Committee Overview Paper (2002) ESD-WP-2003-01.20

     3

Engineering Systems brings together many long-standing and important domains of scholarship and 8 As a field, Engineering Systems bridges across traditional engineering and management practice.

    disciplines in order to constructively address challenges in the architecture, implementation, operation, 9and sustainment of complex engineered systems. From an Engineering Systems perspective, the tools

    and methods for understanding and addressing systems properties become core conceptual building blocks. In addition to safety, which is the central focus of this paper, this includes attention to systems properties such as complexity, uncertainty, stability, sustainability, robustness and others as well as 10their relationships to one another. Scholars and practitioners come to the field of Engineering Systems with a broad range of analytic approaches, spanning operations management, systems dynamics, complexity science, and, of course, the domain known as systems engineering (which was pioneered in significant degree by the Air Force to enable project management during the development of the early ICBS systems, particularly Minuteman).

    A defining characteristic of the Engineering Systems perspective involves simultaneous consideration of social and technical systems, as well as new perspectives on what are typically seen as external, contextual systems. Classical engineering approaches might be focused on a reductionist approach to machines, methods and materials with people generally seen as additional component parts and

    contextual factors viewed as ―given.‖ By contrast, the focus here is not just on technical components, but also on their interactions and operation as a whole.

    When it comes to the social systems in a complex engineered system, the field of Engineering Systems calls for examination in relation to the technical aspects of these systems. This includes both a nuanced and comprehensive treatment of all aspects of social systems, including social structures and sub-systems, social interaction processes, and individual factors such as capability and motivation. Similarly, contextual elements, such as physical/natural systems, economic systems, political/regulatory systems, and other societal systems that are often treated as exogenous are instead treated as highly interdependent aspects of complex engineered systems.

    Thus, System Safety is both illustrative of the principles of Engineering Systems and appropriately considered an essential part of this larger, emerging field. In examining the issues of NASA‘s organizational and safety culture in the context of the two space shuttle tragedies and other critical incidents, we will draw on the principles of System Safety and Engineering Systems. This will involve a more comprehensive look at the organizational and cultural factors highlighted in the two accident reports. In taking this more comprehensive approach, the challenge will be for the problems to still be tractable and for the results to be useful indeed, more useful than other, simpler alternatives.

4.0 A Framework to Examine Social Systems

    In its August 2003 report on the most recent Space Shuttle tragedy, the Columbia Accident Investigation Board (CAIB) observed: ―The foam debris hit was not the single cause of the Columbia accident, just as

    the failure of the joint seal that permitted O-ring erosion was not the single cause of Challenger. Both

     8 The roots of this field extend back to the work of early systems theorists such as von Bertalanffy (1968) and Forester (1969), include important popular work on systems thinking (Senge, l990) and extend forward through the use of advanced tools and methods, such as multi-dimensional optimization, modeling and simulation, system dynamics modeling, and others. 9 For example, researchers at MIT‘s Engineering Systems Division are examining the Mexico City transportation system, space satellite systems architectures, lean enterprise transformation systems, aluminum recycling systems, global supply chains, and much more. 10 For example, Carlson and Doyle (2002) argue that fragility is a constant in engineered systems with attempts to

    increase one form of robustness invariably creating new fragilities as well.

     4

    11 Columbia and Challenger were lost also because of the failure of NASA‘s organizational system.‖

    Indeed, perhaps the most important finding of the report was the insistence that NASA go beyond analysis of the immediate incident to address the ―political, budgetary and policy decisions‖ that impacted the Space Shuttle Program‘s ―structure, culture, and safety system,‖ which was, ultimately, responsible 12for flawed decision-making.

    Concepts such as organizational structure, culture and systems are multi-dimensional, resting on vast literatures and domains of professional practice. To its credit, the report of the Columbia Accident Investigation Board called for a systematic and careful examination of these core, causal factors. It is in this spirit that we will take a close look at the full range of social systems relevant to effective safety systems, including:

    ; Organizational Structure

    ; Organizational Sub-Systems

    ; Social Interaction Processes

    ; Capability and Motivation

    ; Culture, Vision and Strategy

    Each of the above categories encompasses many separate areas of scholarship and many distinct areas of professional practice. Our goal is to simultaneously be true to literature in each of these domains and the complexity associated with each, while, at the same time, tracing the links to system safety in ways that are clear, practical, and likely to have an impact. We will begin by defining these terms in the NASA context.

    First, consider the formal organizational structure. This includes formal ongoing safety groups such as the HQ System Safety Office and the Safety and Mission Assurance offices at the NASA centers, as well as formal ad hoc groups, such as the Columbia Accident Investigation Board (CAIB) and other accident investigation groups. It also includes the formal safety roles and responsibilities that reside within the roles of executives, managers, engineers, union leaders, and others. This formal structure has to be understood not as a static organizational chart, but a dynamic, constantly evolving set of formal relationships.

    Second, there are many organizational sub-systems with safety implications, including: communications systems, information systems, reward and reinforcement systems, selection and retention systems, learning and feedback systems, and complaint and conflict resolution systems. In the context of safety, we are interested in the formal and informal channels for communications, as well as the supporting information systems tracking lessons learned, problem reports, hazards, safety metrics, etc. and providing data relevant to root cause analysis. There are also key issues around the reward and reinforcement systemsboth in the ways they support attention to system safety and in the ways that they do not create conflicting incentives, such as rewards for schedule performance that risk compromising safety. Selection and retention systems are relevant regarding the skill sets and mindsets that are emphasized in hiring, as well as the knowledge and skills that are lost through retirements and other forms of turnover. Learning and feedback systems are central to the development and sustainment of safety knowledge and capability, while complaint and conflict resolution systems provide an essential feedback loop (including support for periodic whistle-blower situations).

    Third, there are many relevant social interaction processes, including: leadership, negotiations, problem-solving, decision-making, teamwork, and partnership. Here the focus is on the leadership shown at every level on safety matters, as well as the negotiation dynamics that have implications for safety (including

     11 Columbia Accident Investigation Board report, August 2003, p. 195. 12 Ibid.

     5

    formal collective bargaining and supplier/contractor negotiations and the many informal negotiations that have implications for safety). Problem solving around safety incidents and near misses is a core interaction process, particularly with respect to probing that gets to root causes. Decision-making and partnership interactions represent the ways in which multiple stakeholders interact and take action.

    Fourth, there are many behavioral elements, including individual knowledge, skills and ability; various group dynamics; and many psychological factors including fear, satisfaction and commitment that impact safety. For example, with the outsourcing of certain work, retirements and other factors, we would be concerned about the implications for safety knowledge, skills and capabilities. Similarly, for contractors working with civilian and military employeesand with various overlays of differing seniority and other

    factorscomplex group dynamics can be anticipated. As well, schedule and other pressures associated with shifting to the ―faster, better, cheaper‖ approach have complex implications regarding motivation and commitment. Importantly, this does not suggest that changing from ―faster, better, cheaper‖ to another mantra will ―solve‖ such complex problems. That particular formulation emerged in response to

    a changing environmental context involving reduced public enthusiasm for space exploration, growing international competition, maturing of many technical designs, and numerous other factors that continue to be relevant.

    Finally, culture itself can be understood as multi-layered, including what Schein terms surface-level cultural artifacts, mid-level rules and procedures and deep, underlying cultural assumptions. In this respect, there is evidence of core assumptions in the NASA culture that treat safety in a piecemeal, rather than a systemic way. For example, the CAIB report notes that there is no one office or person responsible for developing an integrated risk assessment above the subsystem level that would provide a comprehensive picture of total program risks. In addition to culture, there are the related matters of vision and strategy. While visions are often articulated by leaders, there is great variance in the degree to which these are shared visions among all key stakeholder groups. Similarly, while strategies are articulated at many levels, the movement from intention to application is never simple. Consider a strategy such as lean enterprise transformation. All of the social systems elements must be combined together in service of this strategy, which can never happen all at once. In this respect, both the operational strategy and the change strategy are involved.

    Technical leaders should have at least a basic level of literacy in each of these domains in order to understand how they function together as social systems that are interdependent with technical systems. In presenting this analysis, we should note that we do not assume that NASA has just one organizational culture or that it can always be treated as a single organization. Where appropriate, we will note the various ways that patterns diverge across NASA, as well as the cases where there are overarching implications. While the focus is on the particular case of NASA, this paper can also serve as a more general primer on social systems in the context of complex, engineering systems.

    In providing a systematic review of social systems, we have organized the paper around the separate elements of these systems. This decompositional approach is necessary to present the many dimensions of social systems. In our examples and analysis, however, we will necessarily attend to the inter-woven nature of these elements. For this reason, we do not have a separate section on ―Culture, Vision and Strategy‖ (the last item in the framework presented above). Instead these issues are woven throughout the

    other four sections that follow. For example, in the discussion of safety information systems, we also take into account issues of culture, leadership, and other aspects of social systems. A full presentation of the separate elements of social systems in the context of complex engineered systems is included in the appendix to this paper. Presentation of the many interdependencies among these elements is beyond the scope of the paper, but this chart in the appendix provides what we hope is a useful overview of the elements in this domain.

     6

5.0 Organizational Structure:

    The organizational structure includes the formal organizational chart, various operating structures (such as integrated product and process design teams), various formal and informal networks, institutional arrangements, and other elements. As organizational change experts have long known, structure drives behaviorso this is an appropriate place to begin.

    The CAIB report noted the Manned Space Flight program had confused lines of authority, responsibility, and accountability in a ―manner that almost defies explanation.‖ It concluded that the current organizational structure was a strong contributor to the negative safety culture, and that structural changes are necessary to reverse these factors. In particular, the CAIB report recommended that NASA establish an independent Technical Engineering Authority responsible for technical requirements and all waivers to them. Such a group would be responsible for bringing a disciplined, systematic approach to identifying, analyzing, and controlling hazards through the life cycle of the Shuttle system. While the goal of an independent authority is a good one, careful consideration is needed for how to accomplish the goal successfully.

    When determining the most appropriate placement for safety activities within the organizational structure, some basic principles should be kept in mind, including:

    (1) System Safety needs a direct link to decision makers and influence on decision making

    (2) System Safety needs to have independence from project management (but not engineering)

    (3) Direct communication channels are needed to most parts of the organization

    These structural principles serve to ensure that System Safety is in a position where it can obtain information directly from a wide variety of sources so that information is received in a timely manner and without filtering by groups with potential conflicting interests. The safety activities also must have focus and coordination. Although safety issues permeate every part of the development and operation of a complex system, a common methodology and approach will strengthen the individual disciplines. Communication is also important because safety motivated changes in one subsystem may affect other subsystems and the system as a whole. Finally, it is important that System Safety efforts do not end up fragmented and uncoordinated. While one could argue that safety staff support should be integrated into one unit rather than scattered in several places, an equally valid argument could be made for the advantages of distribution. If the effort is distributed, however, a clear focus and coordinating body are needed. We believe that centralization of system safety in a quality assurance organization (matrixed to other parts of the organization) that is neither fully independent nor sufficiently influential has been a major factor in the decline of the safety culture at NASA.

    A skillful distribution of safety functions has the potential to provide a stronger foundation, but this cannot just be a reactive decentralization. The organizational restructuring activities required to transform the NASA safety culture will need to attend to each of the basic principles listed above: influence and prestige, independence, and oversight.

5.1 Influence and Prestige of Safety Function: In designing a reorganization of safety at NASA, it is

    important to first recognize that there are many aspects of system safety and that putting them all into one organization, which is the current structure, is contributing to the dysfunctionalities and the negative aspects of the safety culture. As noted in the earlier Lederer quote about the NASA Manned Space Safety Program during Apollo, safety concerns span the life cycle and safety should be involved in just about every aspect of development and operations. The CAIB report noted that they had expected to see safety deeply engaged at every level of Shuttle management, but that was not the case. ―Safety and mission

     7

assurance personnel have been eliminated, careers in safety have lost organizational prestige, and the 13 Losing prestige Program now decides on its own how much safety and engineering oversight it needs.‖

    has created a vicious circle of lowered prestige leading to stigma, which limits influence and leads to 14further lowered prestige and influence. The CAIB report is not alone here. The SIAT report also

    sounded a warning about the quality of NASA‘s Safety and Mission Assurance (S&MA) efforts.

    In fact, safety concerns are an integral part of most engineering activities. The NASA matrix structure assigns safety to an assurance organization (S&MA). One core aspect of any matrix structure is that it only functions effectively if the full tension associated with the matrix is maintained. Once one side of the matrix deteriorates to a ―dotted line‖ relationship, it is no longer a matrix—it is just a set of shadow

    lines on a functionally driven hierarchy. This is exactly what has happened with respect to providing ―safety services‖ to engineering and operations. Over time, this has created a misalignment of goals and inadequate application of safety in many areas.

    During the Cold War, when NASA and other parts of the aerospace industry operated under the mantra of ―higher, faster, further,‖ a matrix relationship between the safety functions, engineering, and line operations operated in service of the larger vision. The post-Cold War period, with the new mantra of ―faster, better, cheaper,‖ has created new stresses and strains on this formal matrix structure and requires a shift from the classical strict hierarchical, matrix organization to a more flexible and responsive 15networked structure with distributed safety responsibility.

    Putting all of the safety engineering activities into the quality assurance organization with a weak matrix structure that provides safety expertise to the projects has set up the expectation that system safety is an after-the-fact or auditing activity only. In fact, the most important aspects of system safety involve core engineering activities such as building safety into the basic design and proactively eliminating or mitigating hazards. By treating safety as an assurance activity only, safety concerns are guaranteed to come too late in the process to have an impact on the critical design decisions. This gets at a core operating principle that guides System Safety, which is an emphasis on ―prevention‖ rather than on auditing and inspection.

    Beyond associating safety only with assurance, placing it in an assurance group has had a negative impact on its stature and thus influence. Assurance groups in NASA do not have the prestige necessary to have the influence on decision making that safety requires, as can be seen in both the Challenger and Columbia accidents where the safety engineers were silent and not invited to be part of the critical decision making groups and meetings (in the case of Challenger) and a silent and non-influential part of the equivalent Columbia meetings and decision making.

5.2 Independence of Safety Function: Ironically, organizational changes made after the Challenger

    accident in order to increase independence of safety activities has had the opposite result. The project manager now decides how much safety is to be ―purchased‖ from this separate function. Therefore, as noted in the CAIB report, the very livelihoods of the safety experts hired to oversee the project management depend on satisfying this ―customer.‖ Boards and panels that were originally set up as

    independent safety reviews and alternative reporting channels between levels have, over time, been effectively taken over by the Project Office.

     13 CAIB, p. 181 14 Henry McDonald (Chair), Shuttle Independent Assessment Team (SIAT) Report, NASA, February 2000. 15 Earll Murman, Tom Allen, Kirkor Bozdogan, Joel Cutcher-Gershenfeld, Hugh McManus, Debbie Nightingale, Eric Rebentisch, Tom Shields, Fred Stahl Myles Walton, Joyce Warmkessel, Stanley Weiss, and Sheila Widnall. Lean Enterprise Value: Insights from MIT’s Lean Aerospace Initiative, New York: Palgrave/Macmillan (2002)

     8

    As an example, the Shuttle SSRP (originally called the Senior Safety Review Board and now known as the System Safety Review Panel) was established in 1981 to review the status of hazard resolutions, review technical data associated with new hazards, and review the technical rationale for hazard closures. The office of responsibility was SR&QA (Safety, Reliability, and Quality Assurance) and the membership (and chair) were from the safety organizations.

    In time, the Space Shuttle Program asked to have some people support this effort on an advisory basis. This evolved to having program people serve on the function. Eventually, program people began to take leadership roles. By 2000, the office of responsibility had completely shifted from SR&QA to the Space Shuttle Program. The membership included representatives from all the program elements and outnumbered the safety engineers, the chair had changed from the JSC Safety Manager to a member of the Shuttle Program office (violating a NASA-wide requirement for chairs of such boards), and limits were placed on the purview of the panel. Basically, what had been created originally as an independent safety review lost its independence and became simply an additional program review panel with added limitations on the things it could review (for example, the reviews were limited to out-of-family issues, thus effectively omitting those, like the foam, that were labeled as in-family).

    One important insight from the European systems engineering community is that this type of migration of 16 Small an organization toward states of heightened risk is a very common precursor to major accidents.decisions are made that do not appear by themselves to be unsafe, but together they set the stage for the loss. The challenge is to develop the early warning systemsthe proverbial canary in the coal mine

    that will signal this sort of incremental drift.

    The CAIB report recommends the establishment of an Independent Technical Authority, but there needs to be more than one type and level of independent authority in an organization. For example, there should be an independent technical authority within the program but independent from the Program

    Manager and his/her concerns with budget and schedule. There also needs to be an independent technical authority outside the programs to provide organization-wide oversight and maintain standards.

    Independent technical authorities within NASA programs existed in the past but their independence over time was usurped by the project managers. For example, consider MSFC (Marshall Space Flight 17Center). During the 1960‘s moon rocket development, MSFC had a vast and powerful in-house

    research, design, development, and manufacturing capability. All relevant decisions were made by Engineering, including detailed contractor oversight and contractor decision acceptance. Because money was not a problem, the project manager was more or less a budget administrator and Engineering was the technical authority.

During the 1970‘s Shuttle development, MSFC Engineering was still very involved in the projects and

    had strong and sizable engineering capability. Quality and safety were part of Engineering. The project manager delegated technical decision making to Engineering, but retained final decision authority, especially for decisions affecting budget and schedule. Normally the project managers did not override a major engineering decision without consultation and the concurrence of the Center Director. However, some technical decisions were made by the projects due to schedule pressure and based on fiscal constraints or lack of money, sometimes at the expense of increased risk and sometimes over the objections of Engineering.

     16 Jens Rasmussen, Risk Management in a Dynamic Society: A Modeling Problem. Safety Science, 27, 1997, pp.

    183-213. 17 The information included here was obtained from Dr. Otto Goetz, who worked at MSFC for this entire period and served as the SSME Chief Engineer during part of it.

     9

    In the period of initial return to flight after Challenger, the SSME chief engineer reported to the Director of Engineering with a dotted line to the project manager. While the chief engineer was delegated full technical authority by the project manager, the project manager was legally the final approval authority and any disagreement was brought before upper management or the Center Director for resolution. The policy was that all civil service engineering disciplines had to concur in a decision.

    Following the post-Challenger return to flight period, the chief engineer was co-located with the project manager‘s office and also reported to the project manager. Some independence of the chief engineer was lost in the shift and some technical functions the chief engineer had previously exercised were delegated to the contractors. More responsibility and final authority was shifted away from civil service and to the contractor, effectively reducing many of the safeguards on erroneous decision-making. We should note that such shifts were in the context of a larger push for the re-engineering of government operations in which ostensible efficiency gains were achieved through the increased use of outside contractors. The logic driving this push for efficiency did not have sufficient checks and balances in order to ensure the role of System Safety in such shifts.

Independent technical authority and review is also needed outside the projects and programs. For

    example, authority for tailoring or relaxing of safety standards should not rest with the project manager or even the program. The amount and type of safety applied on a program should be a decision that is also made outside of the project. In addition, there needs to be an external safety review process. The Navy, for example, achieves this review partly through a project-independent board called the Weapons System Explosives Safety Review Board (WSESRB) and an affiliated Software Systems Safety Technical Review Board (SSSTRB). WSESRB and SSSTRB assure the incorporation of explosives safety criteria in all weapon systems by reviews conducted throughout all the system‘s life cycle phases. Similarly, a

    Navy Safety Study Group is responsible for the study and evaluation of all Navy nuclear weapon systems. An important feature of these groups is that they are separate from the programs and thus allow an independent evaluation and certification of safety

    5.3 Safety Oversight: As contracting of Shuttle engineering has increased, safety oversight by NASA civil servants has diminished and basic system safety activities have been delegated to contractors. The CAIB report noted:

    Aiming to align its inspection regime with the ISO 9000/9001 protocol, commonly used in industrial

    environmentsenvironments very different than the Shuttle Programthe Human Space Flight

    Program shifted from a comprehensive `oversight‘ inspection process to a more limited `insight‘

    process, cutting mandatory inspection points by more than half and leaving even fewer workers to 18 make `second‘ or `third‘ Shuttle system checks.

    According to the CAIB report, the operating assumption that NASA could turn over increased responsibility for Shuttle safety and reduce its direct involvement was based on the mischaracterization in 19the 1995 Kraft report that the Shuttle was a mature and reliable system. The heightened awareness that characterizes programs still in development (continued ―test as you fly‖) was replaced with a view that

    less oversight was necessarythat oversight could be reduced without reducing safety. In fact, increased reliance on contracting necessitates more effective communication and more extensive safety oversight

    processes, not less.

     18 CAIB, ibid, p. 181 19 Christopher Kraft, Report of the Space Shuttle Management Independent Review Team, February 1995 Available online at http://www.fas.org/spp/kraft.htm

     10

Report this document

For any questions or suggestions please email
cust-service@docsford.com