Skip to main content
Version: v0.6

Deploy llama-2 on AWS

This tutorial demonstrates how to deploy llama-2 using Walrus on AWS with CPU, and utilize it through a user-friendly web UI.

Prerequisites

To follow this tutorial, you will need:

  1. An AWS account with associated credentials, and sufficient permissions to create EC2 instances.
  2. Walrus installed.

Note: While using CPU is cheaper than GPU, it still incurs costs corresponding to the EC2 instance.

The Simple Way

With Walrus, you can have a running llama-2 instance on AWS with a user-friendly web UI in about a minute. Just follow these steps:

Add the llama-2 Template

  1. Log in to Walrus, click on Operations in the left navigation, go to the Templates tab, and click the New Template button.
  2. Enter a template name, e.g., llama-2.
  3. In the source field, enter https://github.com/walrus-tutorials/llama2-on-aws.
  4. Click Save.

llama2-add-template

Configure AWS Credentials

  1. In the left navigation, click on Operations and click the Connectors tab.
  2. Click the New Connector button and select the Cloud Provider type.
  3. Enter a connector name, e.g., aws.
  4. Choose Development for the Applicable Environment Type option.
  5. Choose AWS for the Type option.
  6. Select Tokyo (ap-northeast-1) for the Region option.
  7. Click Save.

Note: The specified region is used here because the subsequent steps involve using an AMI from that region. If you want to use a different region, you can export the AMI to your region or refer to the following sections on how to build the llama-2 image from scratch.

llama2-add-connector

Configure Environment

  1. In the left navigation, click on Applications and click the default project in breadcrumb on the top.
  2. In the Environments tab, click the New Environment button.
  3. Enter an environment name, e.g., dev.
  4. Choose Development for the Environment Type option.
  5. Click the Add Connector button and select the aws connector created in the previous step.
  6. Click Save.

llama2-add-environment

Create the llama-2 Service

  1. In the Environments tab, click on the name of the dev environment to enter its view.
  2. Click the New Resource button.
  3. Enter a resource name, e.g., my-llama-2.
  4. Enable the Use Template option.
  5. Choose llama-2 in the Template option.
  6. Click Save.

Note: The default configuration assumes your AWS account has a default VPC in the corresponding region. If you don't have a default VPC, create a new VPC, associate a subnet and a security group with it in the AWS VPC console. The security group needs to open port 7860 TCP (for accessing the llama-2 web UI). You can set your VPC name and security group name in the configuration.

Accessing the llama-2 Web UI

You can see the deployment and running status of the llama-2 service on its details page. Once the llama-2 service deployment is completed, you can access its web UI by clicking the access link of the resource in the Walrus UI.

llama2-resource-detail

llama2-webui

Deep Dive: Building the llama-2 Image from Scratch

The above instructions utilized a pre-built llama-2 image. This approach saves time as you don't need to download the large language model (often with a significant file size) or build the inference service when creating a new llama-2 instance. This section explains how such a llama-2 image is built.

You can find the complete build process here.

Key steps include:

# get text-generation-webui
git clone https://github.com/oobabooga/text-generation-webui && cd text-generation-webui
# configure text-generation-webui
ln -s docker/{Dockerfile,docker-compose.yml,.dockerignore} .
cp docker/.env.example .env
sed -i '/^CLI_ARGS=/s/.*/CLI_ARGS=--model llama-2-7b-chat.ggmlv3.q4_K_M.bin --wbits 4 --listen --auto-devices/' .env
sed -i '/^\s*deploy:/,$d' docker/docker-compose.yml
# get quantized llama-2
curl -L https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/resolve/main/llama-2-7b-chat.ggmlv3.q4_K_M.bin --output ./models/llama-2-7b-chat.ggmlv3.q4_K_M.bin
# build and run
docker compose up --build

In essence, this process downloads the quantized llama-2-7b-chat model, then builds and utilizes text-generation-webui to launch the llama-2 service.