Web portal for a supercomputer management system SCMS
For today user access to supercomputer resources, besides of knowledge of applied software properties and their specific languages, assumes rather in-depth knowledge of, as a matter of fact, purely technical details about cluster OS, job submission, compilators, runtime environments etc.
Appearance of grid technologies does not simplified the situation, indeed one more level of complexity was added, which require knowledge of grid command line tools, new job submission syntax, inter-cluster compatibility of runtime environments and so on.
On the other hand, the problem of maintenance of a computing clusters remains a difficult and time-consuming task. Our work is an attempt of all-inclusive solution of both problems. The integrated solution in a form of web portal for cluster management system was proposed which allows users to control their task flow, without need to study numerous details of a supercomputer operational environment. A convenient service for cluster administration is also developed.
Our project is implemented and successfully used in a number of leading supercomputer centers of Ukraine.
Keywords: supercomputer, management system, web portal.
Introduction
Practice of supercomputer usage in Ukrainian academic institutions show that there is a number of problems which hinder wide spread of high-performance computations in various areas of science and technology. Let us highlight some of them.
The majority of supercomputers is developed using Linux operation system. It is caused by high flexibility and reliability of Linux, its support of the high-performance hardware and modern computer architectures and technologies. But studying Linux in higher schools is rather limited, causing low level of Linux knowledge among scientists.
A common user interface to supercomputer is a console (a command line). Console interface allows to use as much operating system possibilities as possible, but it is difficult in learning, and at startup level assumes profound knowledge of wide number of commands. For newbie user console appears to be a barrier in mastering of supercomputer.
For access to a supercomputer from Windows environment user needs to install additional software like WinSCP, Putty for console emulation. This programs, while being build in any Unix OS, are absolutely alien creatures for ordinary Windows user.
Gaining popularity Academic grid had difficult low-level access interface. Development of means for its simplification has only just started.
In academic environment scientists, besides their scientific activity, are forced to be engaged in cluster maintenance. Cluster maintenance normally requires daily routine operations which are possible and necessary to automate.
Absence of supercomputer self-diagnostics capabilities allows errors to exist for a long time. It heavy impacts quality of provided services.
Often a supercomputer has no means for resources usage statistics collection and automatic report generation. Company management does not have an overall picture of resources consumption structure.
Often there are no means of notification. Supercomputer cannot inform about errors, critical situations, overheating or failures that can lead to serious consequences, up to damage of electronic components. In the article we offer a system combining user and administrative interfaces.
The purpose of development of this system is solution of listed above problems. System SCMS (SuperComputer Management System) consists of web portal and middleware which interacts supercomputer system software.

Main SCMS features
The system has the following features:
- SCMS supports installation on virtually any supercomputer with resource managers Slurm, Torque or similar.
- SCMS has intuitive GUI intended for unprepared user, requiring minimum studying of documentation.
- Working environment of SCMS is a Web browser. It is platform independent solution, and is installed at almost all desktop systems. SCMS works fine in all popular modern browsers: Internet Explorer 6.0+, FireFox 2.0+, Opera 9.0+, Google Chrome.
- SCMS widely uses modern Web 2.0 technologies, in particular, Ajax. Data transmission is encrypted using OpenSSL, user has access only to his/her own files and tasks.
- SCMS GUI has multilingual support. This allows scientists from many countries to work together on one supercomputer.
- System status information is periodically updated in background while user works with other tasks. The full information about cluster status is accessible after a couple of mouse clicks. Error alerts are displayed without page reload.
- Urgent messages about critical errors are sent via e-mail or SMS. This could be: nodes overheating, failure of cooling system, storage failure.
- SCMS allows transparent operation in grid, which is similar to operation on a local supercomputer.
Structure and capabilities of SCMS 4.0
The system core consists of middleware scripts for interaction with cluster hardware, resource manager, grid software etc. Scripts perform all service requests from the user interface, monitoring and diagnostics tools.
SCMS web portal incorporates company news, articles, documentation, user interface and management elements. Capabilities of user and administrator interfaces are shown in the table below.
User operation in grid is conducted in the same way as operation on a local cluster.
Cluster administrator can improve existing functionality by adding new monitoring and diagnostics modules.
Portal content management
The portal management system allows administrator to create new sections, to add articles, news, and documentation. It has built-in visual text editor similar to MS Word.
Graphic user interface
The interface is intended to perform all possible user operations only by its means. It should match user needs as close as possible.
The majority of scientists work with shelf software packages. They need easy and convenient environment for editing of input files, launching of parallel programs, online viewing of task outputs. Application programmers use cluster as a tool for developing and testing of parallel programs.
They also need environment for compilation with support of popular compilers and application-oriented libraries, and editor of program source texts with syntax highlighting.
The main operations which are performed by cluster users:
- file operations;
- task submission;
- tracking task execution process and viewing task results;
- users and administrators communication;
- operation in grid.
Let's consider user operations in more detail.
User files are located in personal directories protected from unauthorized access. File management tool provides usual file operations: creation, editing, removal etc. List of files can be sorted by name, size or creation time.
Large volume of cluster file system causes necessity of means for searching in file system. Search by regular expressions and file creation date is provided in the graphic interface.
The data interchange between a workstation and a cluster is carried out by loading and sending files. Directories are uploaded and downloaded using archives. Text files editor supports syntax highlighting for popular programming languages.
Task submission to resource manager queues is performed by appropriate interface form. It allows user to set all necessary parameters of the computing task.
For source files an intellectual system for compilation is provided. It detects appropriate programming language and selects corresponding compilation scenario. Scenarios for various compilers such as Intel and GNU are supported.
There is special operation mode for parallel software packages (Gamess, Gromacs, Abinit etc.) in the interface. In this mode some launch parameters are automatically filled with package default values that considerably simplifies operation with them.
Main tool for parallel task control is review of its runtime logs. Log reflects program status and occurred errors. For this purpose log viewer has separate mode with file changes tracking similar to tailf. Additional monitoring tools show level of processor and memory load on occupied nodes.
Built-in messaging service provided for communication between users and administrators.
After addition of grid-certificate to his/her account user can work with grid-resources as with a local supercomputer. Operation with remote file system is integrated in the file manager. Launch of grid-tasks is similar to launch on a local supercomputer. After task completion background copying to local supercomputer of its results is performed.
Administrator interface
Administrator organizes computing process on a supercomputer. Like ordinary users administrators have their login accounts but with enhanced features.
Switching to other user
Such means are necessary to help users to solve their difficulties. Switch to a specific user allows to recreate errors, localize them in the environment where they arise.
Task queues administration
Task queues demand constant supervising. Interface has possibilities to review task queues and cancel tasks in case of error or other reasons.
Cluster hardware status monitoring
Hardware status requires constant attention of administrator. Early informing about failures is one of primary goals of the system.
The monitoring subsystem interacts with resource manager, receives information about nodes' status. It has modules for checking hardware status and software components of a supercomputer:
- nodes are checked by means of IPMI;
- hard disks and RAIDs of servers by means of operation system;
- file system Lustre;
- error counters of Ethernet and Infiniband network switches;
- receives health status of UPS batteries;
- monitors nodes temperature;
- grid software operability.
Cluster resource management
Computer nodes are main cluster resource. Administrator can quickly change total number of available nodes. There are possibilities to switch on/off, lock nodes, possibility to block task submission on specific queue or all queues.
User accounts database management
Administrator has full range of user accounts management capabilities: from user registration request acceptance, editing login account to user removal.
LDAP and file PASSWD user databases are supported. Authentification of users is performed by means of LDAP or PAM functions.
Launching diagnostic tasks
Diagnostic tasks are special class of tasks. They allow to obtain performance characteristics of cluster, to check up reliability of a system as a whole. Launch of this tasks can be carried out both by schedule and on demand. An intellectual results analysis with detection of weak components is provided.
Diagnostic tools check nodes productivity, Infiniband network, and Lustre filesystem health. Special overheating protection unit shuts down nodes with temperature considerably exceeded the critical limit.

Viewing system logs
Viewing cluster logs allows to reveal unknown problems, detection of which is not provided in appropriate sections of monitoring.
Log viewer is equipped with capability of filtration on certain keywords that simplifies analysis of large volumes of text.
Analyzing resources usage statistics
The system collects information about completed tasks and reports from monitoring sensors. Statistics is accessible to review by administrator. Resource usage statistics can be grouped by users and organizations. Statistics can be exported to either csv or Excel format.
Notification about dangerous situations
The system notifies administrators about noncritical errors by e-mail, and about critical by SMS, for example: nodes overheating, cooling system or hard disks failure.
Peculiarities of system implementation
Web portal of computing center
Graphic interface should be accessible from an arbitrary user and administrator workstation. Therefore it should be cross-platform and should not require installation of additional software. For this reason we selected web services as a technological basis for implementation of our project. They are most cross-platform for today and their capabilities are enough for solution of highlighted tasks.
Portal style and its functionality can be arbitrary, they are defined at design stage for a certain supercomputer.
Structure of interface software
The interface has modular structure and consists of two layers of the software: web part which is responsible for dialogue with a user, and middleware for interaction with system core of supercomputer.

Let us review the following modules in detail.
Authorization module
Authorization is supported by means of PAM and LDAP subsystem. Authorization is carried out directly, without any intermediate software. User database operations include user authorization, user adding and removal, editing of their accounts. For interaction with LDAP its settings are specified in the configuration file. For operation with PAM php-auth-pam module is used.
Diagnostics Tools
Diagnostics tests analyze key parameters of a supercomputer: hard disks' status, network and Infiniband health, nodes temperatures, and IPMI sensor data. Diagnostics and monitoring scripts are included in the middleware. They are executed with superuser privileges on a cluster gateway. System scheduler cron carries out monitoring scripts. Schedule is set according to administrator needs.
Performing administrative operations
Some of administrative operations are executed with superuser privileges. Such operations are: editing of user database, task cancelling. For such tasks an auxiliary superuser is created on a supercomputer, which can execute given commands through sudo.
Separation of Web server from supercomputer
High productive filesystems are often insufficiently reliable. Failures in Lustre FS, NFS-RDMA are often reasons of supercomputer downtime as they lead to fatal failures in operation of components: servers, nodes, in particular a Web server.
To solve a problem of Web server reliability we developed specialized client-server software. This software allows web interface scripts to initiate execution of commands on a cluster gateway on behalf of users and administrators, and also to exchange data. It works in the following modes:
- execution of user commands and viewing results;
- reading data from file;
- writing data, received by a Web server, into a file on cluster filesystem (reverse data transmission mode);
- executing permitted list of administrator commands with superuser privileges.
Thus, even in case of the considerable failure, Web server will work, users can receive the information about reasons of a problem, repair time estimates etc.
Task launch by means of cluster resource manager
Tasks launch is carried out through the dedicated form of the interface. As shown earlier on fig. menu of task launching with this form is selected. The user sets parameters of the task, selects program file, command line parameters, number of processors, time limit, selects MPI environment or custom task launch script.
Additional compilation of source files is performed on demand. In this case source compilation scenario is created automatically, and after its successful completion newly compiled program is started.
Parameters of previously submitted tasks are remembered in a unique launch profiles helping to easily repeat the same action. Next time it is sufficient to select saved profile from a list, edit some fields if needed, and click ''Launch'' button. Later on a kind of library of these profiles is formed, reflecting specificity of repeating action of a user.

On the other hand for beginners such library can be prepared by administrator, allowing user not to go into details of task launching, but to focus on package language and task results.
Launch is performed through corresponding modules of GUI. User sets task parameters on ''Task Launch'' page. All the data is transferred to launch module which performs the launch through a cluster resource manager. The type of particular resource manager is set in configuration file. Task is set into queue and starts execution immediately after allocation of requested resources.
Interaction with a user
Some components of the interface display information which is subject to frequent changes. These are: resource modules, tasks queues, viewer of running task log file. The interface updates information by means of AJAX technology, without necessity to constantly reload browser window.
Such approach allows users to perform interactive operation with computing tasks on a cluster that is very important for many research areas in physics and chemistry with incompletely formalized algorithms.
Conclusion
In this article we described a web based interface for supercomputer management system. From our point of view the described environment promotes wider usage of multiprocessor computing systems as it drastically simplifies their usage by scientists and programmers.
Our project is implemented and successfully used on grid-clusters at V.M. Glushkov Institute of Cybernetics of NAS of Ukraine, Kiev, at Institute for Low Temperature Physics And Engineering of NAS of Ukraine, Kharkov, at Institute for Scintillation Materials NAS of Ukraine, and also in a number of other academic institutions of Ukrainian Academic Grid (UAG). The system is permanently improved due to tight interaction with users of supercomputers.
Bibliography
- http://en.wikipedia.org/wiki/Ajax_(programming)
- http://www.lustre.org
- http://ru.wikipedia.org/wiki/LDAP
- http://nfs-rdma.sourceforge.net
- http://icybcluster.org.ua
- http://cluster.ilt.kharkov.ua
- https://grid.isma.kharkov.ua
- http://lcg.bitp.kiev.ua


