DOC

FYxx Plan for ILCAccelerator ControlsILCTA Support

By Virginia Scott,2014-06-17 17:47
7 views 0
FYxx Plan for ILCAccelerator ControlsILCTA Support ...

FY10 Plan for / Grid / Grid Services

    Prepared by: Gabriele Garzoglio, Philippe Canal, Burt Holzman,

    Andrew Baranovski, Parag Mhashilkar, Eileen Berman Date: Aug 19, 2009

    Relevant Strategic Plans Strategic Plan for Grids, Strategic Plan for Scientific Facilities,

    Computing Division Strategic Plan (2010 2012)

Grid Services Goals

    o Provide leadership in the area of middleware development for Fermilab and the

    Open Science Grid (OSG).

    o Provide a middleware infrastructure for Fermilab and the OSG, with focus on

    interoperations with major peer grids, such as Enabling Grids for E-sciencE

    (EGEE), TeraGrid, etc., supporting the needs of Fermilab’s scientific community.

Grid Services Strategy

    o To enhance and expand the body of grid software, business methods, and

    deployment community that is broadly accepted by the FNAL site and FNAL

    based virtual organizations.

    Grid / Grid Services

Tactical Objectives for FY09

    1. Maintain the infrastructure for VO membership registration, focusing on the

    convergence of VOMRS and VOMS-admin. Investigate new mechanisms for VO and

    site policy definition, publication, and enforcement.

    2. Work closely with stakeholders to identify and appropriately prioritize their

    maintenance needs from the Gratia software stack (text and graphical reports, probes

    and collectors).

    3. Provide expertise and code updates as needed by the groups operating the production

    instance of Gratia Collectors within OSG and the Fermilab Computing Divisions.

    4. Ensure that (potential) Gratia Extension provided by external projects are well

    integrated into the existing code base, test, release and support mechanisms. Improve

    quality assurance process for the software.

    5. Develop and deploy a metrics analysis and correlation service to prepare dynamic

    reports on the scientific use cases of Grid Services. Focus on US CMS, RunII, and the

    future neutrino experiment.

    6. In the context of the CEDPS program, provide tools and services for on-demand

    collection of diagnostic information generated by storage software on OSG, including

    dCache and Hadoop. Interface this software to general purpose troubleshooting

    middleware, including netlogger.

    7. Finish development activities for the Resource Selection Service (ReSS) project.

    Move all software to maintenance. Close the project. Provide second-level support to

    the FermiGrid operations of the service for OSG and CD.

    8. Provide maintenance and support for the Glidein Workload Management System

    (glideinWMS) for CMS, Fermigrid, CDF, OSG, and other stakeholders. Enhance and

    further develop glideinWMS based on stakeholder input.

    9. In the context of GlideIn WMS, package gLexec authorization software in

    collaboration with gLite developers along with OSG-specific components. Provide

    maintenance and support to OSG.

    10. CMS Info System: IS IT HERE?

    11. Perform security-focused reviews of several software projects.

    12. Develop and maintain the SAZ service to enable user/vo/role/ca banning on campus

    grid facilities, in particular on FermiGrid, and to provide support to customers of the

    SAZ software.

FY09 Accomplishments:

    1. Improve usability and operability of the Virtual Organization (VO) Services

    infrastructure.

    The VO Services project has followed his program of improvements to the

    authorization and registration infrastructures, in particular on GUMS and gLExec. For

    details, see project closing report: docdb 3249.

    2. Deploy and support the VO Services infrastructure for the stakeholders on OSG.

    Focus on reducing maintenance and on fostering interoperability of the authorization

    systems.

    The Authorization Interoperability project was successfully completed. This allows

    software developed in the US to be seamlessly deployed in the EU and vice versa.

    Maintenance is reduced by providing a common code base for authorization call-out

    modules between OSG, EGEE, and Globus. See details at the Authorization

    Interoperability project closing report: docdb 3238.

    3. Integrate emerging standards and increasingly complex use cases in the VO Service

    infrastructure. These include new mechanisms for identity management, support for

    finer-grain storage privileges, VO and site policy definition, publication, and

    enforcement.

    The VO Services projects has supported storage groups in defining the next

    generation storage authorization models through the authorization interoperability

    project, it has fostered the convergence of VOMS-admin 2.5 with VOMRS, it has

    investigated mechanisms to define and enforce VO and Site Authorization Policies as

    part of an SBIR Phase II grant, and it has evaluated VOMS-signed attribute validation

    mechanisms for OSG. For details, see project closing report: docdb 3249. 4. “Provide maintenance and support for the Resource Selection Service (ReSS)

    Workload Management System (WMS) for OSG and FermiGrid VO’s. Focus on the

    operational qualities of the infrastructure.”

    Implemented and deployed support for MPI jobs. Improved support for advertising

    Storage Elements. Released test suite to identify common deployment/configuration

    issues in Cemon for ReSS. Verified compliance of ReSS with OSG 1.2. Deployed

    ReSS services in FermiGrid in High Availability Mode and, for this mode,

    implemented classad monitoring.

    5. Develop new accounting reports and enhance existing ones for the Gratia system.

    Work closely with stakeholders to identify and appropriately prioritize their needs.

    Met weekly with stakeholders to insure proper prioritizing of the addition of new

    features and reports resulting in the delivery of expected feature within the agreed

    upon time line.

6. Provide support for the production instance of the Gratia accounting system for OSG

    and Computing Division (CD).

    Provided patch releases and expertise as needed to insure smooth operation of the

    OSG and Computing Division (CD) instances of Gratia.

    7. “Develop and deploy a science-dashboard infrastructure to display customized

    metrics of running Grid services. Focus on the use cases of storage for US CMS.”

    The MCAS project has developed and deployed a prototypical service to prepare and

    display metrics reports for US CMS storage, DZero Montecarlo and DZero

    production.

    8. “Transition Glide-in Workload Management System to maintenance and operation

    mode. Focus on deployment, maintenance, and support of the infrastructure for

    CMS.”

    Stable versions 1.6 and 2.0 were released and deployed. CMS glideinWMS

    installations transitioned to maintenance and operations, executing over 10,000 jobs

    concurrently.

    9. “Continue to improve interoperability of EGEE and OSG information systems and

    move to maintenance and operations mode. Begin investigation of interoperation

    between other peer grids and campus grid infrastructures.”

    EGEE-OSG interoperability activity transitioned to maintenance mode. Initial

    proposals circulated on end-to-end information system work allowing interoperation

    between peer and campus grid infrastructures.

10. Plan and coordinate Fermilab OSE working group”

    Meetings of the OSE working group transitioned to an 'as needed' basis, in response

    to the completion of many of the docket items. In the past year the OSE working

    group met several times and discussed the following items -

    ? grid incident response procedure which was agreed to by the CSExec

    ? reviewed the OSG trust documentation to insure its alignment with Fermilab

    policies

    ? finalized the OSE baseline and completed the process to get it accepted by the

    CSExec

    ? began discussing D0 compliance with the OSE baseline

    ? began development of the Fermilab VO trust policies and procedures

    The group's web pages are kept up-to-date with minutes and docket items.

    11. “Implement a software security review process.”

    The process “A code inspection process for security reviews” (cd-docdb 3021) was

    developed. The process was presented as a poster at CHEP09. A Paper on the process

    was written and is to be published in the Journal of Phys. Conf. Ser.

    12. “Perform security-focused reviews of several software projects.”

    We used the code inspection process for the security review of the Site AuthoriZation

    Service.

    13. “CEDPS: In the context of the dCache/SRM and CEDPS troubleshooting projects,

    interface existing or collaboratively develop implementations for collection of storage

    service events and supplemental logging information for general purpose

    troubleshooting and operations control middleware”

As a first step in building coherent end to end event reporting infrastructure, we have

    designed and implemented common session id protocol between SRM client, SRM

    server, and dCache . These common ids are now used to trace the user activity from

    the front end service (user job) to any back-end services (dCache pool, mover, name

    space database). This helps to quickly identify the context of the problem report. In

    the context of troubleshooting, CEDPS effort was spent to design the adaptation of

    the MCAS infrastructure to the present and future use cases of storage, with focus on

    the US CMS T1 use cases.

    Not Accomplished in FY09:

    1. “CEDPS: Lead in establishing requirements of a Data Placement service and the

    characteristics of underlying storage and movement services, in order to provide a

    general dynamic storage service for advertising storage resource and accessibility.”

    2. “CEDPS: Research and prototype quality of service negotiation tools to mitigate

    vulnerabilities of storage systems to overload and resource exhaustion.”

    The primary focus of this activity was to provide a formal description of dCache

    managed resources in order to enable their optimal use by “future” storage-computing

    workflow planners. After our investigation, we have found that the current state-of-

    the-art Grid has not reached the point where optimal data and computing co-

    placement can make an impact on the quality of the user experience on OSG. This

    has lead to a shift of focus in the effort to more immediate payoff activities, such as

    troubleshooting and metric analysis.

    3. “Integrate emerging standards and increasingly complex use cases in the VO Service

    infrastructure. These include new mechanisms for identity management, support for

    finer-grain storage privileges, VO and site policy definition, publication, and

    enforcement.” Authorization validation was one of the activities from last year. While the

    development was ready to be deployed, the alarming infrastructure (RSV) could not

    support the error reporting use cases required by the validation probe. Responsibility

    for the end-to-end deployment of this functionality has been given to the Software

    Tools Group at the closure of the VO Services project.

    4. “Integrate software security best practices and procedures into the software

    development life cycle.”

    The security-related best practices appropriate for the Grid Services development

    environment were analyzed. The integration of the practices was discussed with the

    Office of Project Management. In order to encompass multiple domains, the current

    project management processes are high-level and not limited to software development.

    We are planning to evaluate the integration of best practices only for the software

    domain in FY10.

    5. “Provide maintenance and support for the Resource Selection Service (ReSS)

    Workload Management System (WMS) for OSG and FermiGrid VO’s. Focus on the

    operational qualities of the infrastructure.”

    Due to increased responsibilities in the GlideIn WMS area, the following two ReSS

    work items were deferred to FY10: (1) finalize ReSS compliance with the Fermigrid

    Software Acceptance process; (2) improve security for resource registration.

Activities and Work Definition

ADD % OF EFFORT CHARGED TO DIFFERENT FUNDS

Grid / Grid Services / Authorization / Maintenance and Consultation

     Activity type: Service

     Description: Maintenance and consultation for Authorization and VO Registration

     Timescale: Continuous through FY10

     Metrics: Number of VOMRS releases in response to bug reports. Number of VO

    and sites interested in evaluating SVOPME for policy publication and verification.

Grid / Grid Services / Authorization / SAZ / Development

     Activity type: Project

     Description: Development of the Site AuthoriZation Service

     Timescale: Continuous through Spring FY10

     Milestones: user input validation; improved DB connection management, rewrite

    calls to shell commands using APIs, simplify code (Oct 09). Address Resource

    Exhaustion for Sockets and Threads (Nov 09). Integrate XACML call-out

    protocol (Dec 09). Address OSG user-banning requirements (Spring 10).

     Metrics: -------

Grid / Grid Services / Authorization / SAZ / Management

     Activity type: Project

     Description: Coordination of the Site AuthoriZation Service project

     Timescale: Continuous through FY10

     Milestones: Quarterly internal meetings

     Metrics: -------

Grid / Grid Services / Authorization / SAZ / Support

     Activity type: Service

     Description: Support for the Site AuthoriZation Service

     Timescale: Continuous through FY10

     Milestones: -------

     Metrics: Number of users banned through SAZ. Number of recommended

    security improvements implemented.

Grid / Grid Services / WMS / ReSS / Development and Maintenance

     Activity type: Project

     Description: Software development and maintenance of the ReSS WMS system

     Timescale: Foreseen project closure on Dec 2009

     Milestones: full compliance with FermiGrid Software Acceptance process (Oct

    09); improved resource registration functions (Nov 09); project closure (Dec 09).

     Metrics: ---

Grid / Grid Services / WMS / ReSS / Management and Outreach

     Activity type: Project

     Description: Management of the ReSS WMS system and engagement of new

    communities

     Timescale: Foreseen project closure on Dec 2009 Milestones: Quarterly Stakeholders Meeting.

     Metrics: -------

    Grid / Grid Services / WMS / ReSS / Support and Deployment

     Activity type: Service

     Description: Support of the ReSS WMS system and assistance with deployment

    activities

     Timescale: Continuous through FY10

     Milestones: ------

     Metrics: Number of support tickets for ReSS.

    Grid / Grid Services / WMS / GlideIn WMS / Development

     Activity type: Project

     Description: Development of the glideinWMS software Timescale: Continuous through FY10

     Milestones: Project release encompassing stakeholder requirements Metrics: -------

    Grid / Grid Services / WMS / GlideIn WMS / Maintenance and Support

     Activity type: Service

     Description: Maintenance of code and user support Timescale: Continuous through FY10

     Milestones: -------

     Metrics: Job throughput, efficiency, and uptime of deployments.

    Grid / Grid Services / WMS / GlideIn WMS / Management and Outreach

     Activity type: Project

     Description: Project management and outreach to new potential stakeholders Timescale: Continuous through FY10

     Milestones: Quarterly stakeholders meeting

     Metrics: --------

    Grid / Grid Services / WMS / GlideIn WMS / Corral

     Activity type: Project

     Description: Integration of the GlideIn WMS system with Corral as per funded

    NFS grant.

     Timescale: Continuous through FY10

     Milestones: Quarterly stakeholders meeting

     Metrics: --------

    Grid / Grid Services / Information System

     Activity type: Service

     Description: Grid information system (GIP/ BDII)

Timescale: Continuous through FY10

     Milestones: --------

     Metrics: Reduction on the number of tickets on Information System

    Grid / Grid Services / Accounting / Maintenance

     Activity type: Service

     Description: Software maintenance and improved quality assurance processes of

    the Gratia accounting system, including US CMS use cases and an integration of

    CD accounting use cases.

     Timescale: Continuous through FY10

     Milestones: ------

     Metrics: Number of issues about non-working or inaccurate report/data. Turn-

    around time to resolve these issues.

    Grid / Grid Services / Accounting / Management

     Activity type: Project

     Description: Coordination of the activities for the Gratia accounting system Milestones: Quarterly stakeholder’s meetings

     Timescale: Continuous through FY10

     Milestones: Regular weekly stakeholders’ meetings.

     Metrics: -------.

    Grid / Grid Services / Metrics Management / MCAS / Development

     Activity type: Project

     Description: Development activities for the Metrics Correlation and Analysis

    Service

     Timescale: Continuous through FY10

     Milestones: Minos portal (Sep 09); Data source administration portal (Nov 09);

    Warehouse to production (Nov 09); Data source workflow engine/portal (Dec 09);

    Warehouse operations support tools (Dec 09); User friendly data analysis/query

    front end (Lower priority; Mar 10); MCAS to Google warehouse adaptation

    (Lower priority; Apr10); Additional operational tools (May 10). New rendering

    primitives (ongoing); Feature adjustment/support (ongoing). Metrics ------

    Grid / Grid Services / Metrics Management / MCAS / Management and Outreach

     Activity type: Project

     Description: MCAS management and engagement of new stakeholders Timescale: Continuous through FY10

     Milestones: monthly meetings with each CDF, DZero, CMS, Minos stakeholders

    individually. Bi-monthly meetings with other groups monitoring the project to

    resolve overlap in functions.

     Metrics ------

    Grid / Grid Services / SciDAC2 CEDPS / Storage

     Activity type: Project

     Description: Investigate, improve, and troubleshoot storage solutions

     Timescale: Continuous through FY10

     Milestones: Specification of the log forwarding service (Nov09). Prototype of the

    service (Jan10). Integration into OSG (Summer 10).

     Metrics: -------

Grid / Grid Services / SciDAC2 CEDPS / MCAS

     Activity type: Project

     Description: Participate to troubleshooting activities of CEDPS via relevant

    activities in the MCAS project

     Timescale: Continuous through FY10

     Milestones: Build a dashboard of custom metrics collected using dynamic log

    record forwarding (Jul 10)

     Metrics: -------

Grid / Grid Services / Security

     Activity type: Service

     Description: Security reviews for the Grid Services software and increased group

    acumen.

     Timescale: Continuous through FY10

     Milestones: -------

     Metrics: Number of security reviews performed.

Priorities: The activities in this tactical plan are independently managed projects, each

    with its personnel and internal priorities. In general, activities that affect operations, such

    as support, have higher priority than maintenance activities. Generally, maintenance

    activities have higher priority than development activities. For SAZ, depending on other

    higher operational priorities, we expect that some development milestones might be

    delayed.

    Staffing: We rely on the presence of a new hire for the development activities of GlideIn WMS program. The same new hire is also necessary for supporting the storage

    troubleshooting (VDT) and investigation programs, on different tactical plans.

Change control:

    Any changes to the current plan need to be communicated to the CD management and

    principal stakeholders, including, depending on the activities, OSG, CMS, and FermiGrid.

    Risk Assessment for Grid Services:

    1. Failure to make progress on the convergence of VOMRS and VOMS-admin results in

    the need to modernize VOMRS to allow low-cost maintenance; this is roughly

    estimated to a development effort of 0.3 FTE months for 6 months, and a continuous

    maintenance budget of 0.05 - 0.1 FTE months.

2. Failure to understand stakeholders’ maintenance needs for the Gratia accounting

    system will limit the ability of introducing new metrics to measure the functional

    properties of the OSG as a system.

    3. Failure to appropriately support Gratia, the system that gathers accounting data and

    provides reports for OSG and Fermilab, would result in difficulties for OSG and

    Fermilab to present proper information to its stakeholders and funding agencies. 4. Failure to appropriately coordinate external effort to expand Gratia will result in

    probable fork of the project. This and the lack of improved quality assurance

    processes will result in additional support effort by customer and stakeholder and

    possibly compatibility issues.

    5. The two key features of the MCAS project are

    a) providing quick and easy access to building custom user-level dynamic reports

    b) documenting the schema of user/system metrics

    Lack of common services to generate reports on metrics will encourage users to do

    uncoordinated development/integration to achieve objectives similar to the ones of

    MCAS. These activities will inevitably lead to duplication of effort and potential

    software vulnerabilities, due to the lack of common development standards. Failure to

    formalize and catalog metrics will hinder the ability to report the performance of

    computing services operations to stakeholders, including to CD management. 6. For CEDPS, without a dashboard of custom metrics collected using dynamic log

    record forwarding, the project incurs in the low risk of not being able to efficiently

    demonstrate the results of the CEDPS effort to the upcoming funding reviews. 7. Failure to provide support to ReSS WMS may adversely affect FermiGrid and OSG

    VO operations, including operations of the DZero, Engagement, and DES. Also,

    ReSS currently supports only the Glue Schema v1.3 for resource description; the

    budgeted level of effort assumes that stakeholders will not request compliance with

    the Glue Schema v2 in FY10.

    8. Failure to operate and support glideinWMS may have a major effect on the efficiency

    of ongoing data analyses for CMS, CDF, and the Minos experiment. Failure to

    respond to new stakeholder requests may delay deployment by Fermigrid and OSG,

    effectively lowering efficiency across those cyberinfrastructures. 9. Failure to package new gLexec releases for OSG may prohibit wide-spread use of

    pilot-based workload management systems (glideinWMS, ATLAS's PANDA) on

    OSG, lowering efficiency and possibly decreasing the pool of potentially available

    resources.

    10. Risk on CMS Info System: IS IT HERE?

    11. Many production level software products perform security related functions as part of

    their normal operations. Insuring that these products conform to security practices

    decreases the possibility of security incidents.

    12. Failure to maintain and support the SAZ service will result in increased operational

    complexity in reacting to security incidents for the FermiGrid facility.

Report this document

For any questions or suggestions please email
cust-service@docsford.com