Meta released Llama 2 in the summer of 2023. The new version of Llama is fine-tuned with 40% more tokens than the original Llama model, doubling its context length and significantly outperforming other open-sourced models available. The fastest and easiest way to access Llama 2 is via an API through an online platform. However, if you want the best experience, installing and loading Llama 2 directly on your computer is best.

With that in mind, we’ve created a step-by-step guide on how to use Text-Generation-WebUI to load a quantized Llama 2 LLM locally on your computer.

Install-Desktop-Development-With-C++

Why Install Llama 2 Locally

There are many reasons why people choose to run Llama 2 directly. Some do it for privacy concerns, some for customization, and others for offline capabilities. If you’re researching, fine-tuning, or integrating Llama 2 for your projects, then accessing Llama 2 via API might not be for you. The point of running an LLM locally on your PC is to reduce reliance onthird-party AI toolsand use AI anytime, anywhere, without worrying about leaking potentially sensitive data to companies and other organizations.

With that said, let’s begin with the step-by-step guide to installing Llama 2 locally.

Select operating system

Step 1: Install Visual Studio 2019 Build Tool

To simplify things, we will use a one-click installer for Text-Generation-WebUI (the program used to load Llama 2 with GUI). However, for this installer to work, you need to download the Visual Studio 2019 Build Tool and install the necessary resources.

Download:Visual Studio 2019(Free)

Now that you have Desktop development with C++ installed, it’s time to download the Text-Generation-WebUI one-click installer.

Step 2: Install Text-Generation-WebUI

The Text-Generation-WebUI one-click installer is a script that automatically creates the required folders and sets up the Conda environment and all necessary requirements to run an AI model.

To install the script, download the one-click installer by clicking onCode>Download ZIP.

Selecting GPU hardware installed

Download:Text-Generation-WebUI Installer(Free)

However, the program is only a model loader. Let’s download Llama 2 for the model loader to launch.

There are quite a few things to consider when deciding which iteration of Llama 2 you need. These include parameters, quantization, hardware optimization, size, and usage. All of this information will be found denoted in the model’s name.

How to launch-Text-Generation-WebUI

Note that some models may be arranged differently and may not even have the same types of information displayed. However, this type of naming convention is fairly common in theHuggingFaceModel library, so it’s still worth understanding.

In this example, the model can be identified as a medium-sized Llama 2 model trained on 13 billion parameters optimized for chat inferencing using a dedicated CPU.

Text-Generation-WebUI

For those running on a dedicated GPU, choose aGPTQmodel, while for those using a CPU, chooseGGML. If you want to chat with the model like you would with ChatGPT, choosechat, but if you want to experiment with the model with its full capabilities, use thestandardmodel. As for parameters, know that using bigger models will provide better results at the expense of performance. I would personally recommend you start with a 7B model. As for quantization, use q4, as it’s only for inferencing.

Download:GGML(Free)

Download:GPTQ(Free)

Now that you know what iteration of Llama 2 you need, go ahead and download the model you want.

In my case, since I’m running this on an ultrabook, I’ll be using a GGML model fine-tuned for chat,llama-2-7b-chat-ggmlv3.q4_K_S.bin.

After the download is finished, place the model intext-generation-webui-main>models.

Now that you have your model downloaded and placed in the model folder, it’s time to configure the model loader.

Step 4: Configure Text-Generation-WebUI

Now, let’s begin the configuration phase.

Congratulations, you’ve successfully loaded Llama2 on your local computer!

Try Out Other LLMs

Now that you know how to run Llama 2 directly on your computer using Text-Generation-WebUI, you should also be able to run other LLMs besides Llama. Just remember the naming conventions of models and that only quantized versions of models (usually q4 precision) can be loaded on regular PCs. Many quantized LLMs are available on HuggingFace. If you want to explore other models, search for TheBloke in HuggingFace’s model library, and you should find many models available.