Data transfer tools

SCP

Use scp to transfer a file or directory from/to ARC systems

Call the scp command from the shell command line. This is available in default installation of Linux, Windows (PowerShell), and MacOS. SCP significantly outperforms GUI-based tools on Windows systems such as MobaXterm and WinSCP.

The basic syntax is scp <source> <destination>. Both the source and destination can be local or remote. When specifying a remote source or destination, you need to provide the hostname and the full path to the file or directory like this (ie. colon separated):

user@host.domain:/path/to/file

ARC clusters share the /home and /projects mountpoints. Therefore, any files you transfer are automatically visible on all clusters.

We recommend to use the host datatransfer.arc.vt.edu to improve the performance of the data tranfer.

Note: We strongly recommend using Globus for transferring large data sets as it transfers files in parallel while scp transfers files one at a time and hence does not perform nearly as well.

Example copy from ARC to your computer

In this example we “pull” data onto the local computer (eg. a laptop, workstation or even a shell on another ARC node) from ARC systems. So the <source> uses hostname:filename format and the <destination> is the current working directory which is referenced with a period “.”. Replace myVTpid with your actual username.

scp myVTpid@datatransfer.arc.vt.edu:/home/myVTpid/filename.zip .

Example copy from your computer to ARC

In this example we push a directory dirname and its contents from the local system to a /projects/mygroup/ directory:

scp -r dirname myVTpid@datatransfer.arc.vt.edu:/projects/mygroup/

The “-r” option is for a “recursive” transfer which means the referenced directory and all of its contents will be transferred. The directory structure on the destination will be identical to the source. If the “-r” is not specificied, but the source is a directory, then scp will fail with an error like:

scp: omitting directory ‘dirname‘

RSYNC

rsync “a fast, versatile, remote (and local) file-copying tool” is a standard tool on linux and unix system which has a long list of options you can turn on or off to customize the nature of the transfer. It is particularly well-suited for performing a synchronizition where different versions of a data collection reside in two locations because it can minimize the amount of data transferred and being able to resume a partially completed transfer. scp or cp, on the other hand, will always perform an entire copy from source to destination, even if files and directories already exist at the destination.

Best practices for transfers

Package datasets with a large number of files before transferring

If you need to transfer a dataset which has a large number of small files, use tar or zip to package the dataset into a smaller number of larger files. Most tools will process files in a dataset sequentially and there is significant overhead from the OS, network, and storage devices when many small files are transferred this way. A single, large-file transfer, on the other hand, will only incur this overhead latency once and the rest of the time will be spent in high-bandwidth transfers.

For context in these scenarios

  • “small files” means files smaller than 10MB

  • “large number of files” means thousands or more: 1000+

This is applicable for any transfer of a large number of small files, even intra-cluster. In many cases, it can be very effective for copying a data set (for example AI/ML training data) to local scratch space on compute nodes. See this example for more details.

Parallelize data transfers when possible

Most, if not all, of ARC’s networked storage systems (eg. /home, /scratch, /projects) are capable of managing many simultaneous data flows such that a single data transfer has much lower performance than the aggregate of several streams running in parallel. Standard tools like cp, mv, scp, and rsync will process the source arguments in serial which means only one file is copied at a time. To engage the full bandwidth of the networked storage system, we need to parallelize or use multiple simultaneous transfers.

In this example benchmark, GNU parallel is used to launch a varying number of simultaneous copies from /scratch to the $TMPNVME on a DGX compute node. Performance improves dramatically by parallizing, but does plateau at around eight simultaneous copies.

image

rclone

  • Login to OnDemand: https://ood.arc.vt.edu

  • Start Remote Desktop

  • Start shell via link in the job card

tmux ls

Start tmux for job. For example, this is for job with id 447439:

tmux a -t OOD.tmux.447439
module load rclone

Now follow: https://rclone.org/drive/

Example: Config rclone for metfaces

As an example, to download the metfaces dataset (big, so beware):

rclone config
> n
> metfaces
storage> 11
client_id> {blank}
client_secret> {blank}
scope> 1

Next is folder id, for instance for metfaces: https://drive.google.com/drive/folders/1w-Os4uERBmXwCm7Oo_kW6X3Sd2YHpJMC

root_folder_id>1w-Os4uERBmXwCm7Oo_kW6X3Sd2YHpJMC
service_account_file> {blank}
service_account_file> {blank}
Y/n> y

Now, copy the address shown “http://127.0.0.1:53682/auth” for instance.

Go to the Remote Desktop, start Firefox, and head to that web address.

Go back to the rclone config

Y/n> n
Y/e/d> y
/n/d/r/c/s/q> q

To start using the rclone you just setup, you can do for instance:

Get a listing of files

rclone ls metfaces:

Download the data in the metfaces google drive to current dir

rclone copy metfaces: ./

FileZilla

FileZilla is a popular tool with a more intuitive user interface for transfering data. With FileZilla installed, the application can securely move data in and out of ARC’s storage systems.

Example

Use the quickconnect bar to setup the local and remote systems. The host can be datatransfer.arc.vt.edu (recommended) or any login node of the clusters. Enter your PID username and your password, and use port 22.

filezilla setup

After entering the fields, complete 2 Factor Authentication (2FA) required for making the secure connection. Once the connection is made, transfers can be made using the interface.

filezilla connected

Globus

ARC purchased a High Assurance Subscription to Globus and established a Globus Data Transfer Node globus.arc.vt.edu. Detailed documentation about using Globus in ARC is available here.