I have been playing around with Koboldcpp for writing stories and chats. It’s really easy to setup and run compared to Kobold ai. The best part is it runs locally and depending on the model, uncensored. The only downside is the memory requirements for some models and generation speed being around 65s with a 8gb model. You can use the included UI for stories or chats, but can be connected to Tavern AI for a Character AI like experience. This thread could be for model recommendations and character/story sharing.

Setting up Koboldcpp:

  1. Download Koboldcpp and put the .exe in its own folder to keep organized.

  2. Download a ggml model and put the .bin with Koboldcpp. I’ve used gpt4-x-alpaca-native-13B-ggml the most for stories but your can find other ggml models at Hugging Face. Generally the bigger the model the slower but better the responses are.

  3. Open Koboldcpp and if you have a GPU select CLBLast GPU #1 for faster generation. If it crashes during the first generation relaunch and leave it on OpenBLAS. Click launch and select the ggml model, after a little while it will open a new tab with the UI.


  • Memory and Author’s note are important for coherent stories. Memory is for things that the Ai should always remember like character descriptions and places. Author’s note are for directing the Ai such as theme and story direction.

  • You can select presets under setting to change how the Ai reacts, I personally use Godlike for better descriptions.

  • In Scenarios there are some build-in ones but you can import ones from aetherroom.club

That is all, have fun with your own AI!

I will later on make a Tavern AI tutorial.


will have to install this sometime. I have kobold AI installed but I’m a bit bothered that it is written in python instead of a non-prototyping language, so the upgrade would be nice.
I should probably also try to benchmark it just to see if it is faster

I’m staying away from this until they find a way to make it more efficient, my computer can’t even run this in it’s current form.

It isn’t as fast because it only uses the CPU to generate. The prompt ingestion can be on either the CPU or GPU, but that only shaves off a few seconds. Koboldcpp is kind of like a Nintendo switch, it isn’t the fastest but is small and portable.

It isn’t that hard to run, it can even run on smart phones. You just need a small model (less that 6B) with 4-bit quantization. All together should be around 5Gb of storage and ram if your using the OpenBLAS.