I used LLMUnity which incorporates llamma cpp.
CakeMix uses a fixed internal context size of 4096 tokens. That isn’t user-editable because, in my testing, pushing it higher doesn’t reliably improve results and can actively hurt coherence or performance depending on the model and quantization. It’s the sweet spot for this game.
For the kinds of models most people run with CakeMix (roughly 6B–13B):
- 4096 tokens is stable on 8 GB GPUs with some headroom
- 8192 can work in some setups, but leaves very little safety margin and tends to increase latency and drift
- 16384+ isn’t realistic on 8 GB cards
The character limit when typing to the AI isn’t a hard model limit, it’s a soft UI cap. Before I added the counter, people were pasting in full paragraphs, which could confuse the AI. The counter is there as a reminder to keep things concise, but you can type more if you want.
So there’s no way for a user to change the context size, but you can change various other settings for each llm in their config ini file if you like:
[LLMCharacter]
; If deleted, this file will be created with default values.
; maximum number of tokens that the LLM will predict (-1 = infinity). Note: Values lower than 512 may result in replies being cut off.
numPredict = 1024
; slot of the server to use for computation (affects caching)
slot = 0
; grammar file used for the LLMCharacter (.gbnf format)
grammar =
; grammar file used for the LLMCharacter (.json format)
grammarJSON =
; seed for reproducibility (-1 = no reproducibility).
seed = -1
; LLM temperature, lower values give more deterministic answers.
temperature = 0.8
; Top-k sampling: higher = more diverse, lower = focused. 0 = disabled.
topK = 40
; Top-p sampling: cumulative probability cutoff. 1.0 = disabled.
topP = 0.95
; Minimum probability for token to be used.
minP = 0.05
; Penalty for repeated tokens.
repeatPenalty = 1.1
; Penalty based on token presence in previous responses. (0.0 = disabled)
presencePenalty = 0.7
; Penalty based on token frequency in previous responses. (0.0 = disabled)
frequencyPenalty = 0.5
; Locally typical sampling. 1.0 = disabled.
typicalP = 1
; Last n tokens to consider for repeat penalty. 0 = disabled.
repeatLastN = 64
; Penalise newline tokens during repeat penalty.
penalizeNl = True
; Prompt used for penalty evaluation. Null/empty = original prompt.
penaltyPrompt =
; Mirostat sampling: 0 = off, 1 = Mirostat, 2 = Mirostat 2.0
mirostat = 0
; Mirostat target entropy (tau)
mirostatTau = 5
; Mirostat learning rate (eta)
mirostatEta = 0.1
; If > 0, response includes top N token probabilities.
nProbs = 0
; Ignore end-of-stream token and keep generating.
ignoreEos = False