• NSCC "Beta" Phase

     

    "Beta" - where certain policies of typically occurring in beta phase will be deployed on a test basis

  • What is NSCC "Beta" Test?

    The NSCC supercomputer is undergoing its final phases of tests and refinements before it is released for production use. In this "Beta" (Alpha 2) phase, certain policies of typically occurring in beta phase will be deployed on a test basis. Until issues are resolved and we are satisfied with the state of the deployment, it is NOT a full beta yet.

     

  • Warning

    Because "Beta" Test Users are challenged to stress/break the system, please note that all your accounts and storage provided in the "Beta" Test carry a minor risk of disruption and corruption from unexpected system issues.

     

    You will be required to sign an acceptable usage policy and a declaration not to misuse the system pursuant to the laws of Singapore, or for the design of weapons of mass destruction or nuclear weapons, or for attacking other computer systems, etc.

     

    By signing up, you hereby undertake to adhere to all aspects of the testbed as indicated in this webpage.

  • Timeline

    Proposed "Beta" Test Period: From 27 June 2016 until further notice.

     

    We aim to recover all the lost data from the archival system failure in May, to resolve all outstanding system issues (including the GPU performance and the storage system issues), and to test out all beta policies before we roll into the full-blown Beta phase. 

  • Post Alpha Testbed Announcement

    The Alpha Testbed will officially end on 17 June (Fri) 12 noon.

    What is going to happen?

    Leading up to 17 June, we will be gradually closing the job queues in order for existing jobs to drain. The system will then be taken offline for a series of system tests and re-configuration before going online on 27 June (Monday) no later than 4pm.

     

    The good news is that we have decided to retain your existing files in your home directory. However, you are encouraged to take this opportunity to do housekeeping of your home directory.

    Town Hall Meetings

    We are organising a series of identical town hall meetings to explain the changes that we will be making, as well as listening to your concerns and suggestions:

    1. 15 June (Wed) 10am-12pm Fusionopolis, MPH 2, Level 1, Tower A, Innovis
    2. 15 June (Wed) 2-4pm NTU, SPMS COMP LAB 1 (SPMS-MAS-03-02)

    3. 16 June (Thu) 10am-12pm Biopolis, Creation Theatre, Level 4, Matrix

    4. 16 June (Thu) 2-4pm NUS, CIT Auditorium, L2 Computer Centre

    Implementation Phases Explained

    Alpha - this is the "free for all" stage where we configure the system based on initial inputs gathered long ago and in consultation with our stakeholders. From the way the system is used we can then used the statistics to fine-tune the system to serve you better. This is also the phase where we try to identify and rectify any previously undetected issues. As this is a very initial setup, the system is provided "as-is" without any guarantees. In addition, some services may not be available yet.

    "Beta" (Alpha 2) - certain policies of typically occurring in beta phase will be deployed on a test basis.
    Beta - by now there should be enough knowledge gleamed from the Alpha phase to fine-tune the system setup to an almost final configuration. This is the "burn-in" phase where we test the system for correctness and reliability before making all the necessary fine-tunes and officially going live.
    Production - the system is fully commissioned and ready for full-scale production job runs, along with the associated guarantees.

    Transition into Beta Phase

    As we transition into the Beta phase, we will be making a number of changes to the system based on the lessons learnt plus your feedbacks received during the Alpha phase. The main changes which will affect you directly as end-uses are listed below.

    Once again, these changes are intended to allow us to serve you better. Therefore please let us know as soon as possible if you have any concerns and we will try to address them in a timely manner.

    1. Home Quota
    We will be implementing a quota system for the individual home directories only in the full-blown Beta phase. After much deliberation we decided that this quota will be set at 50GB. This amount will be more than sufficient for the casual user with low computational requirements (e.g. students learning MPI programming). Files on the home disk will be backed daily for the first seven days, and then weekly for another 3 weeks, for a total retention window of 4 weeks. Users with larger requirements are invited to apply for a project quota (please see Section 2 below). Large/common databases are addressed in Section 4 below.

    Users with more than 50GB of files currently who are not applying for any project quota any time soon or who are not making use of the common databases should try to download their files as soon as possible. Those who face difficulties downloading their files can write to help@nscc.sg for help.

    In any case, please be assured that your existing files will not be removed during this transition period.

    2. Project Quota
    (Only in the full-blown Beta phase,) users who are embarking on larger projects may apply for a larger quota per project. These are time based quotas linked to each project with start and end dates. All such requests will be evaluated by NSCC's Resources Allocation Committee (RAC) which comprises representatives from the main NSCC stake-holders (i.e. A*STAR, NTU and NUS). We will are in the midst of setting up the Project Quota Application web portal and will be sending out all the necessary information very soon.

    3. Scratch Disk
    The scratch disk will be directly accessible from the login nodes. Each user will be allocated a directory on the scratch disk indicated by:

    • a new environmental variable $SCRATCH, and
    • a new softlink "scratch" on their home directories.

    A nominal quota of 10TB per user will be in place. A 30-day purge policy will be strictly enforced.

    4. Large/Common Databases
    We will make separate arrangements for users who need to make use of large databases. For users who have already installed these databases we will be contacting you directly regarding these arrangements.

    5. Queue Names
    We will be streamlining the number of queues to just five, namely

    • large - for large memory jobs
    • normal - for most jobs
    • gpu - for jobs requiring GPUs
    • vis - for visualisation jobs
    • ime - for users who wish to experiment with the DDN IME burst buffer.

    More details is available in Job Queue Implementation.

     

    6. qstat Output
    Currently everybody is able to see all the entries in the queues. However, as NSCC engages the industry, there will be a number of users with sensitive jobs which should not be visible to all users. Therefore the default behaviour of qstat will be changed to being able to see only your own jobs.

    7. System Drain Time
    In the event of an emergency shutdown, we will endeavour to drain the system for 24 hours before shutting down. This will be sufficient for the normal jobs that are running to complete cleanly but any long job may inadvertently be killed. Therefore as far as possible the long queues should be avoided unless absolutely necessary.

    8. Job Exclusive Nodes
    This feature will be disabled. There are two reasons for this. The first reason is is to ensure that resources are as fully available to all users as possible. A user who reserves nodes but does not make use of all the available cores deprives other users from the unused cores. The second reason is proper accounting in that a user who reserves an entire node should be logged as using the entire node and not just the number of cores specified. Users who need entire nodes for performance purposes should therefore request for entire nodes and then use the required number of cores/sockets.

    9. Memory Limits
    All jobs in normal and gpu queues will be limited to 4GB of memory per core. Users who need larger amounts of per core memory can specify more cores or make use of the large memory queue.

    10. Job Limits
    As a policy each user without any projects will be allowed to use up to 3,000 cores concurrently. Over-provisioning for performance or memory limits issues (Sections 8 and 9 respectively above) will count towards this limit also. This is ensure that all users have a fair access to the system. Notwithstanding, we encourage users to run a smaller number of large jobs. In other words, a single job which requires 3,000 cores will be accorded higher priority than 3,000 single-core jobs.

    11. Queue Limits
    By default each user will be limited to 100 jobs in queue. This is to ensure that the scheduler is not overloaded.

    12. Unsatisfiable Jobs
    Currently jobs with unstatisfiable requirements (e.g. requests for 64 cores on one node, etc) will be left to queue forever. The system will be checking every job submitted and try to flag out any such jobs together with the reason/s so that the user is able to make the appropriate changes to the job script immediately. While this method is not fool-proof, we hope to keep the number of unsatisfiable jobs in the queue down to a barest minimum.

    13. Login Nodes
    A*STAR, NTU and NUS users will be able to log in to the system via their own local login nodes, i.e. astar.nscc.sg, ntu.nscc.sg and nus.nscc.sg, respectively. These nodes reside physically in the respective organisations' campus networks and will therefore offer a higher bandwidth from within the campuses than the existing login.nscc.sg which most of you have been using.
     

  • Job Queue Implementation

    Moving into the "Beta" phase, we will revamp the job queue implementation.

    We will be streamlining the number of queues to just five, namely

    • large - for large memory jobs
    • normal - for most jobs
    • gpu - for jobs requiring GPUs
    • vis - for visualisation jobs
    • ime - for users who wish to experiment with the DDN IME burst buffer.

    Internally, these jobs will be further split into: