Storage Resources

Overview

ARC offers several different storage options for users’ data:

Persistent Storage

Name

Intent

File System

Environment Variable

Per User Maximum

Data Lifespan

Available On

Home

Long-term storage of files

Qumulo

$HOME

640 GB 1 million files

As long as the user account is active

Login and Compute Nodes

Project

Long-term storage of shared group files

GPFS

- n/a -

50 TB, ~10 million files per faculty researcher (Expandable)

As long as the project account is active

Login and Compute Nodes

Archive

Long-term storage for infrequently-accessed files

IBM Cloud Object Storage

- n/a -

Available for purchasing

Length of the purchase agreement

Managed by ARC staff

Scratch (temporary) storage

Name

Intent

Per User Maximum

Data Lifespan

File System

Environment Variable

Available On

Scratch

Short-term access to working files. Automatic deletion.

No size limits enforced

90 days

Vast

- n/a -

Login and compute nodes

Local Scratch (TMPDIR)

Fast, temporary storage. Auto-deleted when job ends

Size of node hard drive

Length of Job

Local disk hard drives, usually spinning disk or SSD

$TMPDIR

Compute Nodes

Local Scratch (TMPNVME)

Fast, temporary storage. Auto-deleted when job ends

Size of node hard drive

Length of Job

Local disk hard drives, usually spinning disk or SSD

$TMPDIR

Compute Nodes

Memory (TMPFS)

Very fast I/O

Size of node memory allocated to job

Length of Job

Memory (RAM)

$TMPFS

Compute Nodes

Home

Home provides long-term storage for system-specific data or files, such as installed programs or compiled executables. Home can be reached the variable $HOME, so if a user wishes to navigate to their Home directory, they can simply type cd $HOME. Each user is provided a maximum of 640 GB in their Home directories (across all systems). Home directories are not allowed to exceed this limit. Note that running jobs fail if they try to write to a Home directory when the hard limit is reached.

Note

Avoid reading/writing data to/from HOME in a job or using it as a working directory. Stage files into a “scratch” location to keep unnecessary I/O off of the HOME filesystem and improve performance, use /scratch and Local Scratch

Project

Project provides long-term storage for files shared among a research project or group, facilitating collaboration and data exchange within the group. Each Virginia Tech faculty member can request group storage up to the prescribed limit at no cost by requesting a storage allocation via ColdFront. Additional storage may be purchased through the investment computing or cost center programs.

Data ownership passes to the shared directory creator/owner

A project PI requests a shared storage directory and gives access to others. When given access to the shared directory, they will have the ability to add, modify, and delete files from the directory according to the mode (set of permissions) of the directory.

Modes vary in many ways, and this sometimes results in group members, even the group owner, not having access to some files or subdirectories in their shared directory. The owner(s) of such files can fix this on their own with some chmod and/or chown commands. ARC encourages shared directory owners to work with their group members to establish best practices and to make a point to ensure their files are properly culled, organized, and managed by the group before leaving the group. ARC personnel can be consulted on the commands to use and best practices for implementing such a transfer.

Tip

By default, new files or folders created under a project can be read and modified by all members in the group. This means that if a user creates a folder in the project, other users can add new files to that folder. Similarly, if a user places a file on the folder, other users can modify the file. To prevent other users from creating or modifying files and folders, the owner must explicitly deny. The owner of the file or folder must run the command chmod g-w file_or_folder_name to remove write permissions to others in the group. Note that permissions apply individually per file or folder. Therefore, subsequent files or folders added to the directory may need to be also granted/removed permissions to explicltly control who can modify data and prevent unintended data loss. You can also run chmod -R g-w folder_name to remove write permissions to the group to every file and folder recursively.

New files and folders created under /projects will be, by default:

  • User (owner): read & write (and execute for directories)

  • Group: read & write (and execute for directories)

  • Others: no permissions

Below is an overview of Linux permissions, examples of common folder‐permission layouts, how to modify permissions/ownership, an FAQ, and a few additional recommendations.

Basic Linux permissions

Every file or directory on Linux has the following permissions:

  • Type: read (r), write (w), and execute (x)

  • Level: user (u), group (g), and others (o)

Therefore, it can be explicitly controlled who (user or group) can do what (read or write) on each file or directory. This is useful to manage varied types of permissions on different folders within the same /projects shared directory.

New default umask and its effect

  • Old umask 002 prior to August 20, 2025 → new files 664, directories 775

  • New umask 007 after August 20, 2025 → new files 660, directories 770

Thus, when you run e.g. touch /projects/project_123/foo.txt, its permissions will be -rw-rw---- (660), owned by you and the project_123 group. Every member in the project_123 group will be able to modify the foo.txt file.

If you wish to remove the writable permissions of the group on foo.txt you can use the command chmod g-w /projects/project_123/foo.txt

Common /projects folder layout examples

Here are common examples for folders and permissions under /projects/project_123/

Read & write by everyone in the project (default behavior)

mkdir -p /projects/project_123/folder_read_write_everyone

chmod 770 /projects/project_123/folder_read_write_everyone

Readable (but not writeable) by everyone in the project. Only the owner can write.

mkdir -p /projects/project_123/folder_readable_by_everyone

chmod 750 /projects/project_123/folder_readable_by_everyone

Only readable & writable by the owner (e.g. user johndoe)

mkdir -p /projects/project_123/private_folder_johndoe

chmod 700 /projects/project_123/private_folder_johndoe

chown johndoe:johndoe /projects/project_123/private_folder_johndoe

Users and groups can create a hierarchy of folders with increasingly restricted permissions as needed by the research group.

Other examples for changing permissions & ownership

Remove group write:

chmod g-w file_or_folder

Add group write recursively:

chmod -R g+rwX folder_name

Change group ownership recursively:

chown -R :project_123 folder_name

Change owner & group:

chown johndoe:project_123 file_or_folder

FAQ

  1. Will existing files/folders be affected? No. This change only applies to new files and directories. To update existing objects, use chmod/chown manually (see examples above).

  2. Does this affect the visibility of my /home or /scratch files and directories? No. Despite the umask change, de facto privileges for /home and /scratch remain the same since those areas are intended for private user space. Files and folders under /home and /scratch are by default created with the user’s self group. Optionally, users can use chmod and chown to grant additional read/write permissions to other users/groups for files/folders on /scratch

  3. Can I retain the old default behavior? Yes. To revert to umask 002 for your login shell, add the following to your ~/.bashrc: echo 'umask 002' >> ~/.bashrc

  4. What happens when I move (mv) files into /projects? The file keeps its original group owner. If you need it to belong to the project group, run: chown :project_123 moved_file_or_folder

Note that mv preserves the original permissions and group.

Additional recommendations

Set setgid bit on shared directories:

chmod g+s /projects/project_123/shared_folder

Ensures that new items inherit the directory’s group automatically.

Consider POSIX ACLs for finer control:

setfacl -m g:othergroup:rwX /path/to/dir

Especially useful when you need per‑user or per‑group rules beyond standard Unix bits.

Check your group membership:

Use groups or id to verify you are in the correct project group.

Update scripts and workflows:

If you have automated jobs that assume world‑readable output, please adjust them or explicitly chmod where needed.

Be mindful of sensitive data:

With stricter “other” permissions, reduce the risk of accidental exposure of private data.

When a project owner removes a user from their shared directory via ColdFront, that user will no longer be able to access the directory to make such changes.

Note: when a user is removed a script which will ensure the group id (gid) and mode (permissions) of every file that user owned in the shared directory is set to allow allow all group members access.

Best Practices for long-term storage

ARC policy limits the number of files and the total size of data stored in these filesystems which are designed for long-term storage. In order to avoid violating these policies, users should tar and compress data stored in these filesystems.

Tip

To “tar” means to package into single file, thereby reducing inodes. We “extract” data from a tar or zip file. To “compress” is to reduce byte size. We “decompress/unzip” data from a compressed/zipped file.

Consider the fundamental tradeoff between compression and time. The best protocol depends heavily on the research workflow. Some questions to ask yourself:

  1. How compressible is the data?

    • Because compression algorithms take advantage of redundancy to reduce size, some data are more compressible than others. Additionally there are many options for compressing that affect the runtime and the size reduction.

    • Take the following example:

      uncompressed plaintext

      gzip

      gzip -2

      gzip -9

      xz

      xz -2

      xz -9

      xz -T 24

      xz -T 24 -9

      Size

      954M

      309M

      347M

      308M

      220M

      266M

      204M

      223M

      206M

      Time

      N/A

      0m44.706s

      0m20.155s

      1m0.805s

      7m59.689s

      1m35.004s

      11m39.037s

      0m34.547s

      2m40.020s

    Here we are testing different tools, gzip and xz, with different options. We are varying the “compression value”, where a higher value means more compression (and more time). We also see that parallelizing reduces the time cost.

    [slima2@tc049 text_data]$ time bash -c "xz -T 24 enwik9.tar"
    
    real    0m34.547s
    user    10m21.351s
    sys     0m1.509s
    
    • While these operations are acceptable on a login node, a compute node may be desirable, especially if parallelizing across more than the limits on the login node.

    • Every case should appreciate the gains in size over time cost. See the figure below for differences between a selection of example datasets. compression benchmark

  2. How often do you plan to extract/transfer the data?

    • If your workflow involves copying to Local Scratch for better IO performance, it may be worthwhile to compress the data.

    • Some workflows use software that work directly on compressed files. This is especially the case in many bioinformatics software such as cellranger and bedtools with gzipped files in particular. Luckily, the sequencing center probably prepared your data correctly before they sent it to you.

Archive

Archive provides users with long-term storage for data that does not need to be frequently accessed i.e. storing important/historical results and for data preservation purposes to comply with mandates of federal grants (data retention policies). Archive is not mounted on the clusters. Archive is accessible only through ARC staff. Researchers can compress their datasets on the clusters and ARC staff will transfer to the archive.

Archive storage may be purchased through the investment computing or cost center programs. Please reach out to us to acquire archive storage.

Scratch Filesystems

VAST - Scratch

Note

Files and directories stored here are subject to automatic deletion. Do not use it for long term storage.

Local scratch storage options (see below) generally provide the best performance, but are constrained to the duration of a job and are strictly local to the compute node(s) allocated to a job. In constrast, we also have a VAST storage system which provides storage for temporary staging and working space with better performance characteristics than HOME or PROJECTS. It is “global” in the sense that it is accessible from any node on the cluster.

It is a shared resource and has limited capacity, but individual use at any point in time is unlimited provided the space is available. A strict automatical deletion policy is in place wherein any file will be automatically deleted when it has reached an age of 90 days on /scratch.

Best practices

  • Create a directory for yourself mkdir /scratch/<username>.

  • Stage files for a job or set of jobs.

  • Be aware that some file transfer tools such as mv and rsync with -a will preserve timestamps and other file metadata which could trigger the 90-day auto-deletion.

  • Check timestamps using ls -l.

  • Keep the number of files and directories relatively small (i.e., less than 10,000). It is a network-attached filesystem and incurs the same performance overhead for file operations that you would get with /home or /projects.

  • Immediately copy any files you want to keep to a permanent location to avoid accidental deletion

  • Always remember the 90-day automatic deletion policy

Tips for managing filestamps

  • rsync gives new timestamps by default. Do not use the -t --times and -a --archive options which will preserve source timestamps.

  • cp gives new timestamps by default. Avoid the -p --preserve option which will preserve source timestamps.

  • mv preserves source timestamps by default and there are no options to override this. Use cp instead. This is a general best practice for inter-filesystem transfers anyway.

  • wget preserves source timestamps by default. Override this with wget --no-use-server-timestamps ...

Automatic Deletion Details

As mentioned above, files and directories in /scratch will be automatically deleted based on aging policies. Here is how that works:

  1. The storage system runs an hourly job to identify files which have exceeding the aging policy (90 days) and adds these to the deletion queue.

  2. The storage system runs an automated job at 12:00am UTC (7:00PM EST) every day to process the deletion queue.

  3. Additionally, the storage system will detect and delete all empty directories regardless of age.

Restoring files

In some situations, deleted files and directories may be restored from “snapshots”. Snapshots are an efficient way to keep several instances of the status of a file system at regular points in time.

For the /scratch file system, these are kept in the “hidden directory” /scratch/.snapshots which contains a set of snapshots named according to the type (daily, weekly, or monthly) and the date-time when they were recorded. For example:

/scratch/.snapshot/week_2023-11-13_12_00_00_UTC

is an instance of a weekly snapshot which was recorded on 2023-11-13 at 12:00:00PM UTC.

Snapshots may be recorded in daily, weekly, and monthly cycles, but ARC reserves the right to adjust the frequencies and quantities of snapshots which are retained. Changes in the frequencies and quantities may occasionally be needed to adjust how much of the storage system capacity is dedicated to snapshot retentions.

Note

While snapshots provide some level of protection against data loss, they should not be viewed as a “backup” or as part of a data retention plan.

Local Scratch

Running jobs are given a workspace on the local drives on each compute node which are allocated to the job. The path to this space is specified in the $TMPDIR environment variable. This provides a higher performing option for I/O which is a bottleneck for some tasks that involve either handling a large volume of data or a large number of file operations.

Note

Any files in local scratch are removed at the end of a job, so any results or files to be kept after the job ends must be copied to another location as part of the job. /scratch is a good choice for most people.

Local Drives

Running jobs are given a workspace on the local drives on each compute node. The path to this space is specified in the $TMPDIR environment variable.

Solid State Drives (SSDs)

Solid state drives do not use rotational media (spinning disks/platters) but memory-like flash storage which gives it better performance characteristics. The environment variable $TMPSSD is set to a directory on an SSD accessible to the owner of a job when SSD is available on compute nodes allocated to a job.

Memory as storage

Running jobs have access to an in-memory mount on compute nodes via the $TMPFS environment variable. This should provide very fast read/write speeds for jobs doing I/O to files that fit in memory. Please note that these files are removed at the end of a job, so any results or files to be kept after the job ends must be copied to Work or Home.

NVMe Drives

Same idea as Local Scratch, but on NVMe media which “has been designed to capitalize on the low latency and internal parallelism of solid-state storage devices.” Running jobs are given a workspace on the local NVMe drive on each compute node if it is so equipped. The path to this space is specified in the $TMPNVME environment variable. This provides another option for users who would prefer to do I/O to local disk (such as for some kinds of big data tasks). Please note that any files in local scratch are automatically removed at the end of a job, so any results or files to be kept after the job ends must be copied to Home or Project.

Checking Usage

You can check your current storage usage (in addition to your compute allocation usage) with the quota command:

[mypid@tinkercliffs2 ~]$ quota
USER       FILESYS/SET                         DATA (GiB)   QUOTA (GiB) FILES      QUOTA      NOTE 
mypid      /home                               584.2        596         -          -           

           GPFS                                                                              
mypid      /projects/myproject1                109.3        931                                
mypid      /projects/myproject2                2648.4       25600