This post will explore setting up Ollama to run a text based chat on one of our new GPU backed VPSs. Ollama is a free, open-source tool that simplifies running large language models (LLMs) locally on your computer.
What is an LLM
An LLM is a complex neural network trained on massive datasets of text and code. This training allows them to learn patterns, relationships, and nuances in language, enabling them to perform a variety of tasks like translating languages, generating instructions, and answering questions in an informative way.

The “Large” in LLM refers to the sheer size of these models, specifically the number of parameters – adjustable variables within the neural network that dictate its ability to learn. Models with more parameters generally mean a more complex and more capable network, able to capture more subtle linguistic patterns. These models do not just memorize text, but develop a statistical understanding of language, allowing them to generalize to new situations and create original content.
LLMs are backed by large server farms, with weighted models that can be installed locally and use video cards to improve performance and behaviour.
They can be powerful tools for automating tasks that traditionally required human language skills. There is ongoing research focusing on improving their accuracy, reducing bias, and expanding their capabilities. While they do have limitations, they represent a significant advancement in the field of Artificial Intelligence and are rapidly changing how we interact with technology.
Setup Requirements
To try out the examples here, we started with a Debian 12 VPS, with an Nvidia A10 GPU. You can order your own VPS from our site now, if you want to follow along. See https://rimuhosting.com/order/v2orderstart.jsp#variable_plan

LLM models and the system libraries may need a fair amount of disk space. We recomend allocating at least 8G RAM and 30G disk to your order.
Ideally, a GPU should have as much video memory (VRAM) as the size of the language model you want to run. So it can all be loaded in at once. The A10 Nvidia GPU used for this document has 24G VRAM, so the majority of the language models that are offered via ollama can be fully loaded into memory.
Setup nvidia drivers and cuda
Installing the cuda package will also setup the needed nvidia driver needed by the Debian kernel.
wget https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/cuda-keyring_1.1-1_all.deb
dpkg -i cuda-keyring_1.1-1_all.deb
apt update
apt-get -y install cuda
At the time of writing, this will use about 10G disk space. You might prefer to more selectively install just the nvidia-driver and required cuda packages.
You should reboot your VPS at this point, to make sure the gpu drivers are loaded properly. And then verify the gpu is available at this point by running ‘nvidia-smi’
Setup Ollama
Ollama provides a command line interface to a number of publically accessible models.
Ollama can be installed as described at https://ollama.ai/downloads but we recomend quickly checking the install script before running it.
curl https://ollama.ai/install.sh
sh install.sh
Ollama locally saves the requested weighted models, and loads them into GPU memory. Running the respective model connects to a service api on the internet to help generate responses.
One of the benefits of ollama is that it can provide a local api so you can easily integrate it in to your own website or application, using the development language of your choice.
Which model should I use
There are a few considerations when selecting which LLM to run.
- Smaller LLMs will often respond faster
- Larger LLMs can handle more complex tasks and may provide better interactions
- LLMs are often distributed under specific licenses. It is important to understand the terms of use before using a model, especially for commercial applications. Check the license details in the Ollama model library.
If you want to quickly check what models you already have avilable locally, run this
ollama list
You can see a list of downloadable LLM models at https://ollama.com/search
Start Experimenting
Example: an basic interactive chatbot
This model needs about 4G on disk. It is good for more formal conversations, and might be fun to use as a foundation for your own custom LLMs.
ollama run neural-chat

Example: chat with a larger (smarter) interactive model
This model is larger at 17G on disk, though it has smaller variants. In our testing it was significantly better at carrying out conversations and figuring out how to be useful. It even handles bad grammer and spelling with grace.
ollama run gemma:27b

Example: run a programming assistant
This model is great for natural language-to-code chat and instruction. You might find it useful generating or converting code in a number of well known languages to solve specific functions.
ollama run codegemma:instruct

Beyond the Basics
LLM models can be fine tuned for specific use cases. There are plenty of great guides online, but some ideas flesh out what is possible …
- use a specific personality, such as your favourite superhero.
- behave certain ways, being pedantic, or shy, or silly, or speaking only in koans
- incoroporate specialised data such as your offline documentation.
- focus responses on specific knowledge areas.
You may just be able to give an LLM instructions to behave a certain way.
Ollama also supports custom model configuration, here is just a quick peek at how that can be done.
ollama show codegemma:instruct --modelfile > mymodel.modelfile
edit mymodel.modelfile
ollama create mymodel --file mymodel.modelfile
ollama run mymodel
There are also open-source tools that can be used to acheive similar results programatically. One well known tool is to the Hugging face transformers python library.
Mixed media and other AI fun things.
- Using Stable Diffusiuon to generate unique images: https://blog.rimuhosting.com/2023/06/23/stable-diffusion-rimuhosting-vm/
- Other interesting things to do with AI modeling: https://blog.rimuhosting.com/2023/04/28/gpu-options/