ARC System Changes - January 2023

Notes and Guidance for January 2023 cluster changes

Changes to /home on Cascades and Dragonstooth

HOME decoupled

While /home has been “universal” across ARC clusters in recent history, the Cascades/Dragonstooth /home is being decoupled from the others starting January 17, 2023.

Prior to the start of the January maintenance, the /home filesystem was universal across Tinkercliffs, Dragonstooth, Cascades, and Infer cluster. This is because the same network-attached storage system was mounted on /home on all the clusters.

During the maintenance outage, a larger, faster replacement system for this purpose was brought online to serve /home for Tinkercliffs and Infer. Data was synchronized between the old and new systems to make the transition transparent for continued use of those systems.

Since they’re being decommsioned, Cascades and Dragonstooth remain on the previous /home where the old data is still intact and are not connected to the new one.

As a result, any file actions (added files, removed files, changes to files) performed on Tinkercliffs/Infer /home will not be reflected on the Cascades/Dragonstooth /home directories. The converse is true as well: any file actions performed on Cascades/Dragonstooth will not be reflected on Tinkercliffs/Infer.

New policies for /fastscratch

Quota to be implemented on FASTSCRATCH

Starting in January 2023, quota limits on the usage of /fastscratch will be put in place.

All ARC systems down for maintenance

During a maintenance outage in January 2023, the Cascades and Dragonstooth clusters will be decommissioned. This means that jobs will no longer be accepted or start on the compute nodes.

/work and /groups on Cascades and Dragonstooth will also be decommissioned in the following weeks.

The login nodes will remain accessible for a limited time (tentatively, for about 3 weeks or until February 7th) to allow people the opportunity to retrieve data from those systems.

A new backend storage system will come online to host /home directories on all current mainline ARC systems (Tinkercliffs and Infer).

  • At the time of transition, all data from the previous system will be replicated to the new system. No user action is needed.

  • The previous storage system for HOME directories will still serve the Cascades and Dragonstooth login nodes while they remain online. Changes on these nodes will not be reflected in /home on Tinkercliffs/Infer or vice-versa.

Rationale

We would prefer to keep the other clusters online until the new resource is available, but these older systems have rapidly become a liability as

  • their compute nodes fail (25% loss at this point) and are unsupported by manufacturers anymore

  • storage has endured a startling number of component failures and replacements recently

  • their provisioning/configuration management/administration systems are defunct and

  • the software stacks are outdated (OS kernel, glibc, compilers, libraries, software deployment system).

To reduce the risk of catastrophic failure during operations and to align engineering time and effort toward new systems and services, these clusters are being taken offline.

A new CPU system in the works

As of December 2022, ARC is in the final phases of purchasing a new CPU system to replace these, but this new system is not likely to be available (due to acquisition, engineering, and testing timelines) before Summer 2023.

What is NOT directly affected?

The Tinkercliffs and Infer clusters and storage systems will resume normal operations in their current state after the end of the maintenance. The /projects, /fastscratch, and /home storage on those systems will remain in operation.

Actions you may need to take

The 3-6 weeks after the mid-January maintenance will be available for people to migrate any data they need to keep from those storage systems.

A copy of all the /groups directories was made to /projects when Tinkercliffs was launched in fall 2020 and ARC will not make another bulk copy like this. ARC personnel are available to consult with PIs and labs as needed to assist with archiving older data sets and merging those in active use. We have a page here with information about data transfers

Cleanup data in /groups and /work

The hardware hosting the data which is currently stored in /groups or /work on Cascades is due to be decommissioned. If you need to preserve any files from those locations, please consider the following steps.

Please audit data before moving it

Please avoid making bulk data transfers from /groups or /work until you have thoroughly reviewed the data. ARC systems are not intended for indefinite, permanent storage and keeping old, unused files greatly increases the cost of the filesystems and can cause performance degradation.

  • Check to see if your data is already in /projects on Tinkercliffs or in some other storage repository.

  • Delete any old, duplicate, or unneeded data and files.

  • Consolidate old results or data so that only the necessary elements are kept.

  • Package old results or data into larger, more managable files using tar and/or zip utilites. An ideal file size for archival or transfer across networks is often beween 1GB and 100GB. Data sets which are smaller than 1GB or larger than 100GB will often be more cumbersome to work with.

tar vs. zip

tar can package a directory tree into a single file, while zip utilities compress files. Test your data for compressibility before attempting to zip it. Many modern data formats do not compress well.

Get Help

ARC personnel can assist with assessing and performing these steps. The best way to request such help is via a 4Help ticket or by attending ARC office hours.