Using Hugging Face on ARC systems

Hugging Face is a platform and database for developers in the fields of artificial intelligence (AI), machine learning (ML), and data science. It provides tools for creating, training, and deploying ML and natural language processing (NLP) models, and is known for its library of open-source models. In this tutorial, we would learn the basics of Hugging Face to deploy and use a pre-trained AI model on ARC.

Installing Hugging Face through a conda environment

There are different methods to install Hugging Face, but during this tutorial we will use a conda environment to install Hugging Face. First create your own custom conda environment on the partition you’re planning to us in your computation. For more information on how to manage conda enviroments on ARC please refer to this page Using Anaconda on ARC systems. Then activate that environment and install Hugging Face inside it as follows:

[jdoe2@tinkercliffs2 ~]$ source activate torch_env
(torch_env) [jdoe2@tinkercliffs2 ~]$ conda install -c huggingface transformers

We installed the Transformers package - an integral part of Hugging Face - which is an open-source Python library that provides access to thousands of pre-trained models for deep learning. These models support common tasks in different modalities, such as natural language processing, computer vision, audio, and multi-modal applications. The library supports multiple deep learning frameworks, including PyTorch and TensorFlow, and provides an easy-to-use interface for training and fine-tuning models, making it a popular choice for developers and researchers working on AI projects.

Setting Hugging Face environment variables

You can configure Hugging Face in your ARC account through environment variables. Hugging Face downloads and stores its pre-trained models and datasets by default into your home directory under ~/.cache/huggingface/, which would consume tens or hundreds of gigabytes of your home directory 640GB quota, especially if you’re using one of Large Language Models (LLMs). Hence, a good use case of using such environment variables is to set where Hugging Face caches its data. For example, you can set the path of where Hugging Face caches its datasets using an environment variable called HF_DATASETS_CACHE. The following python code llustrates how to set such a variable:

import os

# Set the environment variable HF_DATASETS_CACHE
os.environ['HF_DATASETS_CACHE'] = '/path/to/your/cache/directory'

# Verify the environment variable is set
print(f"HF_DATASETS_CACHE is set to: {os.environ['HF_DATASETS_CACHE']}")

Replace ‘/path/to/your/cache/directory’ with the path to the directory where you want to cache your datasets. This will set the HF_DATASETS_CACHE variable for the duration of the Python session. If you need this setting to persist, you can add the os.environ line to your script or application. We recommend to use any of the large ARC storage systems to accommodate your models and datasets, such as the long-term Project storage system, where you can have up to 25TB of free storage that you can share with your students or collaborators.

The following important variables for this are:

command

purpose

TRANSFORMERS_CACHE

The path to where pre-trained models are downloaded and locally cached

HF_HOME

The path to where data will be locally stored. Defaults to “~/.cache/huggingface”

HF_HUB_CACHE

The path to where repositories from the Hub will be cached locally (models, datasets and spaces). Defaults to “$HF_HOME/hub” (e.g. “~/.cache/huggingface/hub” by default)

HF_TOKEN_PATH

To configure the User Access Token to authenticate to the Hub

For a list of Hugging Face environment variables, please refer to the official page Environment variables - Hugging Face.

The following code lists the values of all Hugging Face environment variables:

import os

# Get all environment variables
env_vars = os.environ

# Print only Hugging Face related environment variables and their values
for var, value in env_vars.items():
    if var.startswith('HF_'):
        print(f"{var}: {value}")

Dealing with Access Tokens

Hugging Face uses access tokens to authenticate and authorize users when interacting with Hugging Face services, such as accessing private models and datasets or using the Hugging Face Hub API. You can generate and access token through the settings page of your Hugging Face account Access Tokens.

You can set the token generated by either of the following methods:

  1. Add the following line to your ~/.bashrc:

export HF_TOKEN="HF_XXXXXXXXXXXXX" # replace HF_XXXXXXXXXXXXX with your own token
  1. In your Python script, set the token as follows:

import os

# Set the environment variable HF_TOKEN
os.environ['HF_TOKEN'] = 'HF_XXXXXXXXXXXXX' # replace HF_XXXXXXXXXXXXX with your own token

The first method could be also used to set other Hugging Face environment variables inside your ~/.bashrc.

Downloading and using a pre-trained model

After installing and configuring Hugging Face on ARC, the following Python code shows how to download and use a pre-trained BERT model from Hugging Face. That code performs named entity recognition (NER) on a sample text using the bert-base-cased model:

import os
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

# Set the environment variable HF_TOKEN
os.environ['HF_TOKEN'] = 'HF_XXXXXXXXXXXXX' # replace HF_XXXXXXXXXXXXX with your own token

# Load the pre-trained model and tokenizer using AutoTokenizer and AutoModel
model_name = 'dbmdz/bert-large-cased-finetuned-conll03-english'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Define a sample text for NER
text = 'Hugging Face Inc. is a company based in New York City.'

# Tokenize the input text
inputs = tokenizer(text, return_tensors="pt")

# Perform inference to get the predictions
outputs = model(**inputs).logits

# Convert predictions to labels
predictions = outputs.argmax(dim=2)

# Decode the tokenized input back to original text and get labels
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'].squeeze())
labels = model.config.id2label

# Collect recognized entities
entities = []
for token, label_id in zip(tokens, predictions.squeeze().tolist()):
    label = labels[label_id]
    if label != 'O':  # Filter out non-entity tokens
        entities.append((token, label))

# Print the recognized entities
print("Recognized entities:")
for entity in entities:
    print(f"Token: {entity[0]}, Label: {entity[1]}")