Migrating data off VT Archive

VT Archive has reached end-of-life. The hardware is out of warranty and is becoming increasingly unstable. The useful data on it needs to be copied somewhere in preparation for the hardware being decommissioned. All data which is not transferred somewhere else will be lost when the hardware goes away.

Current status: Call for deletion of unwanted datasets

As of March 2024, the extraction of bulk datasets from VTARCHIVE is hindered by lack of space in its front-end disk pool. This is where

metadata is stored
files entering/leaving tape are staged
and where files which are smaller than a configured threshhold are permanently stored.

Because of this, we are focusing first on the review and deletion from VTARCHIVE of datasets which are unwanted or duplicated elsewhere.

Deleting data from VTARCHIVE is a metadata-only operation which means that it is very fast
Metadata operations do not require physically loading any tapes so so this is low-impact on the system.
Deleting an unwanted dataset from VTARCHIVE will reduce the amount of metadata on the system
Deletion of files smaller than the threshhold will directly reduce the usage of the disk pool

Please check your storage on VTARCHIVE and delete any datasets which do not need to be retained.

Overview

Preparation is the key to minimizing the time it takes to migrate valuable data and making sure valuable data is retained. Following these steps are recommended:

Identify what needs to be saved
Make a plan
Transfer the data

Identify

ARC can provide a list of files on the archive to make it easier to classify the data. The listings include file sizes and last access times. Please give us the names of the directories on the archive containing your data and we will be happy to generate the list of files for you.

Getting a list of the files

There are nearly 530 million files on the archive. Hence ARC needs to know the archive partition and directories containing your files in order to produce a listing. Here is how you determine the partition.

Suppose you access the archive on tinkercliffs1.arc.vt.edu and your archive files are under the directory /vtarchive/users/hokiebird. The following set of commands will display the information needed for ARC to provide the file listings.

$ whoami
hokiebird
$ ls /vtarchive/users/hokiebird
...
$ hostname
tinkercliffs1
$ mount | awk '/vtarchive/{print $3, $1}'
/vtarchive arnfs2-isb.cc.vt.edu:/gpfs/archive/arc

From this you would send the following information in a request to the ARC help desk asking for a listing of the files on VT Archive:

ARC userid: hokiebird
directory: /vtarchive/users/hokiebird
host: tinkercliffs1
mount-info: /vtarchve arnfs2-isb.cc.vt.edu:/gpfs/archive/arc

The first of the mount-info items, /vtarchve, matches the root of the path to the directory. This should be the case if the archive is mounted at /vtarchive, as it is for all ARC resources. It may be mounted in a different location on the system you use. For example, the path may be /coldstorage/hokiebird. In that case, replace /vtarchive/ with /coldstorage/ in the awk command above.

The second of the mount-info items gives the server that the archive is being mounted from and the path on the remote server, in this case arnfs2-isb.cc.vt.edu:/gpfs/archive/arc. The information before the colon is the server which exports the archive (arnfs2-isb.cc.vt.edu) and the information after the colon (/gpfs/archive/arc) is the path to the archive partition. The last directory in the path is the partition name. In this case, the partition is arc.

Data may be located in other places on the archive partition. For example: /vtarchive/groups/<group>. Include all directories that possibly contain your data in your request to the ARC help desk. A listing of all the files in those directories will be generated.

Identify which files need to be transferred

The next step is to identify the files that need to be transferred and those that are no longer needed. This is the most critical step. It is also likely to be the most tedious. Please keep in mind that files that don’t have to be transferred make the process easier and faster. Some of the data stored in the archive is approaching a decade old, hasn’t been accessed in a long time, and is therefore probably not needed anymore.

Note: the transferring of data from the archive puts additional stress on the hardware and increases the possibility of hardware failure. It is therefore wise to order the files by priority such that the most important files will be transferred first. Likewise, it is wise to transfer the data off as soon as possible rather than waiting as the probability of failure is lower earlier in the process of getting everyone’s data off the archive.

Plan

It is always wise to create a plan of action before beginning a large scale task such as transferring large amounts of data. Here are the main things you need to consider.

Compression

Many modern data formats have built-in, native compression and this makes the resulting files relatively incompressible. For large datasets, compression can take hours or even days. If you’re not getting at least 50% compression then it may not be worth the effort/time.

You can use the du (disk usage) command to measure the size of a directory and it’s contents: du -s ./datadir

Use tar to create (-c) a single-file archive and compress (-z) it using gzip tar -zcf datadir.tar.gz ./datadir

If datadir is large (>10GB) and more than twice the size of datadir.tar.gz, then it’s probably not worth the cputime to compress similar directories.

Location

The first decision is where to put the data. We recommend creating an archive directory within your project’s space under /projects. Your current storage allocation may not be sufficient to accommodate the archive files, which may necessitate requesting a larger allocation. If so, please contact ARC help.

Time

The time it will take to transfer the data off of the archive depends on the amount of data and how busy the archive is at the time. The archive is a hierarchical storage system which is backed up to tape. Your data is very likely on tape and so retrieval time will be quite a bit longer than data which is stored on disk.

If you haven’t done so, please consider reducing the number of files that need to be transferred. Do you really need the files that were archived a decade ago and haven’t been touched since? A common data retention requirement is five years after the conclusion of a grant. Fewer files mean faster transfers so we strongly recommend reducing the set of files that need to be transferred.

Aggregation

File system performance suffers with large numbers of small files. We continue to strongly encourage aggregating sets of files that constitute a working set using tar or zip the same as when transferring files. Now is the time to improve the layout of the data by aggregating while transferring archived files to /projects.

Transfer

The transferring of the files off the archive is straight forward once the plan is in place. Once again, we strongly suggest that you transfer the data now rather than wait.

Copying

Copying is straight forward. cp has a bunch of options but you will probably only need -p (to preserve access time), -a (archive), and --sparse=always (if a file may have many zero bytes).

$ cp -pa --sparse=always /vtarchive/user/hokiebird/data /projects/hokiebird-project/archive/data

Aggregation

As mentioned before, aggregating large numbers of related files into a single file improves performance of the file system and the network so please aggregate where possible.

$ tar -zcf /projects/hokiebird-project/archive/data.tgz /vtarchive/user/hokiebird/data

or

$ zip -r /projects/hokiebird-project/archive/data.zip /vtarchive/user/hokiebird/data