Artificial Intelligence

Large Language Models

ARC offers Large Language Models (LLMs) for research. There are three main ways researchers can use LLMs in ARC:

https://llm.arc.vt.edu offers a no-effort web interface for a LLM. Users can use this tool for casual queries. Researchers should not use this tool for high-risk data. It does not offer API access. ARC will continously update it to run the best performing model publicly available from Hugging Face. The inference runs on GPUs within the ARC infrastructure. No data is sent to any 3rd party outside of the university. Data, prompts, and logs are preserved within the user’s space. Access is restricted through the VT network or VPN. It does not require to have an account with ARC.
https://ood.arc.vt.edu offers a dedicated LLM via Open OnDemand. Users can use this tool for intensive queries. It offers web and API access secured via tokens using the OpenAI API. Users can select their preferred model to run from a list of models publicly available on Hugging Face. The inference runs on GPUs within the ARC infrastructure. No data is sent to any 3rd party outside of the university. Data, prompts, and logs are preserved within the user’s space. Access is restricted through the VT network or VPN. It requires to have an account with ARC and a compute allocation.
Advanced custom development via personalized Slurm scripts. Users can download any software to their user directory or run centrally-installed software (e.g. vLLM and Ollama) combined with custom Slurm scripts. Models downloaded from Hugging Face are available at /common/data/models/. Access is restricted through the VT network or VPN. It requires to have an account with ARC and a compute allocation. An example is provided below.

Running your own LLM using vLLM

The following example Slurm script launches a vLLM instance running the model openai/gpt-oss-120b using 2 NVIDIA L40s GPUs on the Falcon cluster. Specifications include a job duration limited to 1 day, the model listens on port 8000, and the OpenAI API endpoint key is a3b91d38-6c74-4e56-b89f-3b2cfd728d1a. You should adjust the settings to select the model, number of GPUs, context length, port, and API key you need.

#!/bin/bash
#SBATCH --account=<account_name>
#SBATCH --partition=l40s_normal_q
#SBATCH --time=1-0:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=32
#SBATCH --gres gpu:l40s:2
#SBATCH --output=gpt-oss-120b.log

module load vLLM

vllm serve /common/data/models/openai--gpt-oss-120b \
--served-model-name gpt-oss-120b \
--tensor-parallel-size 2 \
--max-model-len 32768 \
--max-seq-len-to-capture 32768 \
--swap-space 16 \
--port 8000 \
--api-key a3b91d38-6c74-4e56-b89f-3b2cfd728d1a

Run the Slurm script using sbatch myscript.sh and monitor the status of the job using squeue. Once the job runs, please allow a few minutes for the model to spin up. Once the endpoint is ready, the log file will indicate.

INFO:     Started server process
INFO:     Waiting for application startup.
INFO:     Application startup complete.

At this point, you can use the REST OpenAI API to submit queries. Please note the compute node where the instance is running (see squeue). In this example, the node is fal036, the port is 8000, and the API key is a3b91d38-6c74-4e56-b89f-3b2cfd728d1a. Therefore, the following query will work from anywhere within the Falcon cluster.

curl -v http://fal036:8000/v1/completions   -H "Content-Type: application/json"   -H  "Authorization: Bearer a3b91d38-6c74-4e56-b89f-3b2cfd728d1a"   -d '{
    "prompt": "This is a cake recipe:\n\n1.",
    "max_tokens": 200,
    "temperature": 1,
    "top_p": 0.9,
    "seed": 10
  }'

If you wish to connect sofware running on your computer to the LLM running on the compute node of the cluster, you must run SSH port forwarding to redirect the network traffic from your computer to the compute node via the login node. For example:

ssh -N -L 8000:fal036:8000 user@falcon2.arc.vt.edu

At this point, you can use the REST OpenAI API to submit queries via localhost on your computer.

curl -v http://localhost:8000/v1/completions   -H "Content-Type: application/json"   -H  "Authorization: Bearer a3b91d38-6c74-4e56-b89f-3b2cfd728d1a"   -d '{
    "prompt": "This is a cake recipe:\n\n1.",
    "max_tokens": 200,
    "temperature": 1,
    "top_p": 0.9,
    "seed": 10
  }'

OpenAI chat completion example. Run this from your computer:

import argparse

from openai import OpenAI

# Modify OpenAI's API key and API base to use the server.
openai_api_key = "a3b91d38-6c74-4e56-b89f-3b2cfd728d1a"
openai_api_base = "http://localhost:8000/v1"

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is Virginia Tech known for?"},
]


def parse_args():
    parser = argparse.ArgumentParser(description="Client for API server")
    parser.add_argument(
        "--stream", action="store_true", help="Enable streaming response"
    )
    return parser.parse_args()


def main(args):
    client = OpenAI(
        api_key=openai_api_key,
        base_url=openai_api_base,
    )

    models = client.models.list()
    model = models.data[0].id

    # Chat Completion API
    chat_completion = client.chat.completions.create(
        messages=messages,
        model=model,
        stream=args.stream,
    )

    print("-" * 50)
    print("Chat completion results:")
    if args.stream:
        for c in chat_completion:
            print(c)
    else:
        print(chat_completion)
    print("-" * 50)


if __name__ == "__main__":
    args = parse_args()
    main(args)

More examples of usage using OpenAI API (multimodal, reasoning, embeddings, tools, etc)