2019 dedicated monitoring and alerting server Zenoss
2020 upcoming changes and updates
Tuesday's (1/21) power outage removed BLCR's kernel modules from the compute nodes kernels. If you need to do checkpointing the new tool is Distributed MultiThreaded Checkpointing (DMTCP). Details on how to use DMTCP can be found here DMTCP, if you need help let me know (the “tails” also have DMTCP installed for debugging)
The HPCC has invested in a new solution for our Home Directories file server. The TrueNAS/ZFS solution selected is described here Home Dir Server. We will implement with very large user quotas. The storage is 190 TB usable with inline compression (475 TB effective usable if compression ratio achieved is 2.5x, scalable to 1.2 P raw). Other features include; unlimited snapshots (point in time restores), read cache SSD, write cache SSD, self-healing (checksums on reads and writes and per schedule), RAIDZ2 protection, high availability (dual controllers). We will not be implementing de-duplication. Maybe add replication in the future. This will take along time to deploy.
The HPCC has also invested in more GPU and CPU compute capacity. At the time of this writing, 12 nodes are crossing Iowa from CA headed our way. A total for 48 gpus (model rtx2080s with 384 GB memory), 24 cpus (228 physical cores with 1,152 GB memory). Details of the selection process can be found here Turing/Volta/Pascal
With the additional gpu nodes we are also launching and committing to the Nvidia GPU Cloud. We will deploy their cloud Docker Containers albeit on premise. Since I did not know much about this an overview can be found here and more details will be provided later on NGC Docker Containers
Nvidia GPU Cloud (browse the online Catalog)
Lots of work! Lots to learn!