Migrating data off VT Archive
VT Archive has reached end-of-life. The hardware is out of warranty and is becoming increasingly unstable. The useful data on it needs to be copied somewhere in preparation for the hardware being decommissioned. All data which is not transferred somewhere else will be lost when the hardware goes away.
Current status: Call for deletion of unwanted datasets
As of March 2024, the extraction of bulk datasets from VTARCHIVE is hindered by lack of space in its front-end disk pool. This is where
metadata is stored
files entering/leaving tape are staged
and where files which are smaller than a configured threshhold are permanently stored.
Because of this, we are focusing first on the review and deletion from VTARCHIVE of datasets which are unwanted or duplicated elsewhere.
Deleting data from VTARCHIVE is a metadata-only operation which means that it is very fast
Metadata operations do not require physically loading any tapes so so this is low-impact on the system.
Deleting an unwanted dataset from VTARCHIVE will reduce the amount of metadata on the system
Deletion of files smaller than the threshhold will directly reduce the usage of the disk pool
Please check your storage on VTARCHIVE and delete any datasets which do not need to be retained.
Overview
Preparation is the key to minimizing the time it takes to migrate valuable data and making sure valuable data is retained. Following these steps are recommended:
Identify
ARC can provide a list of files on the archive to make it easier to classify the data. The listings include file sizes and last access times. Please give us the names of the directories on the archive containing your data and we will be happy to generate the list of files for you.
Getting a list of the files
There are nearly 530 million files on the archive. Hence ARC needs to know the archive partition and directories containing your files in order to produce a listing. Here is how you determine the partition.
Suppose you access the archive on tinkercliffs1.arc.vt.edu
and
your archive files are under the directory /vtarchive/users/hokiebird
. The
following set of commands will display the information needed for ARC to provide
the file listings.
$ whoami
hokiebird
$ ls /vtarchive/users/hokiebird
...
$ hostname
tinkercliffs1
$ mount | awk '/vtarchive/{print $3, $1}'
/vtarchive arnfs2-isb.cc.vt.edu:/gpfs/archive/arc
From this you would send the following information in a request to the ARC help desk asking for a listing of the files on VT Archive:
ARC userid: hokiebird
directory: /vtarchive/users/hokiebird
host: tinkercliffs1
mount-info: /vtarchve arnfs2-isb.cc.vt.edu:/gpfs/archive/arc
The first of the mount-info
items, /vtarchve
, matches the root of the path
to the directory. This should be the case if the archive is mounted at
/vtarchive
, as it is for all ARC resources. It may be mounted in a different
location on the system you use. For example, the path may be
/coldstorage/hokiebird
. In that case, replace /vtarchive/
with
/coldstorage/
in the awk
command above.
The second of the mount-info
items gives the server that the archive is being
mounted from and the path on the remote server, in this case
arnfs2-isb.cc.vt.edu:/gpfs/archive/arc
. The information before the colon is
the server which exports the archive (arnfs2-isb.cc.vt.edu
) and the
information after the colon (/gpfs/archive/arc
) is the path to the archive
partition. The last directory in the path is the partition name. In this case,
the partition is arc
.
Data may be located in other places on the archive partition. For example:
/vtarchive/groups/<group>
. Include all directories that possibly contain your
data in your request to the ARC help desk. A listing of all the files in those
directories will be generated.
Identify which files need to be transferred
The next step is to identify the files that need to be transferred and those that are no longer needed. This is the most critical step. It is also likely to be the most tedious. Please keep in mind that files that don’t have to be transferred make the process easier and faster. Some of the data stored in the archive is approaching a decade old, hasn’t been accessed in a long time, and is therefore probably not needed anymore.
Note: the transferring of data from the archive puts additional stress on the hardware and increases the possibility of hardware failure. It is therefore wise to order the files by priority such that the most important files will be transferred first. Likewise, it is wise to transfer the data off as soon as possible rather than waiting as the probability of failure is lower earlier in the process of getting everyone’s data off the archive.
Plan
It is always wise to create a plan of action before beginning a large scale task such as transferring large amounts of data. Here are the main things you need to consider.
Compression
Many modern data formats have built-in, native compression and this makes the resulting files relatively incompressible. For large datasets, compression can take hours or even days. If you’re not getting at least 50% compression then it may not be worth the effort/time.
You can use the du
(disk usage) command to measure the size of a directory
and it’s contents:
du -s ./datadir
Use tar
to create (-c
) a single-file archive and compress (-z
) it using gzip
tar -zcf datadir.tar.gz ./datadir
If datadir
is large (>10GB) and more than twice the size of datadir.tar.gz
,
then it’s probably not worth the cputime to compress similar directories.
Location
The first decision is where to put the data. We recommend creating an archive
directory within your project’s space under /projects
. Your current storage
allocation may not be sufficient to accommodate the archive files, which may
necessitate requesting a larger allocation. If so, please contact ARC help.
Time
The time it will take to transfer the data off of the archive depends on the amount of data and how busy the archive is at the time. The archive is a hierarchical storage system which is backed up to tape. Your data is very likely on tape and so retrieval time will be quite a bit longer than data which is stored on disk.
If you haven’t done so, please consider reducing the number of files that need to be transferred. Do you really need the files that were archived a decade ago and haven’t been touched since? A common data retention requirement is five years after the conclusion of a grant. Fewer files mean faster transfers so we strongly recommend reducing the set of files that need to be transferred.
Aggregation
File system performance suffers with large numbers of small files. We continue
to strongly encourage aggregating sets of files that constitute a working set
using tar
or zip
the same as when transferring files. Now
is the time to improve the layout of the data by aggregating while transferring
archived files to /projects
.
Transfer
The transferring of the files off the archive is straight forward once the plan is in place. Once again, we strongly suggest that you transfer the data now rather than wait.
Copying
Copying is straight forward. cp
has a bunch of options but you will probably
only need -p
(to preserve access time), -a
(archive), and --sparse=always
(if a file may have many zero bytes).
$ cp -pa --sparse=always /vtarchive/user/hokiebird/data /projects/hokiebird-project/archive/data
Aggregation
As mentioned before, aggregating large numbers of related files into a single file improves performance of the file system and the network so please aggregate where possible.
$ tar -zcf /projects/hokiebird-project/archive/data.tgz /vtarchive/user/hokiebird/data
or
$ zip -r /projects/hokiebird-project/archive/data.zip /vtarchive/user/hokiebird/data