Request Tool KoboldAI + Erebus Model for text-based adventure

smirk

Member
Jul 2, 2019
145
197
Have you guys tried generating adventures with Meta's leaked llama.cpp. I've started last week and am having a blast.
I've tried Alpaca and Vicuna. None of them were great with Adventure mode, but Alpaca seemed to work well enough for story writing. It seemed to need a lot of hand-holding when it comes to X-rated stuff (which is understandable, it has not been trained on that stuff).
 

rance4747

Member
May 25, 2018
235
538
I've tried Alpaca and Vicuna. None of them were great with Adventure mode, but Alpaca seemed to work well enough for story writing. It seemed to need a lot of hand-holding when it comes to X-rated stuff (which is understandable, it has not been trained on that stuff).
Try this
its a 4 bit version of pyggmalion 7b fork made specifically for adventuring. Fair warning trough, I had to change webui.ry manually ( line 163: run_cmd("python server.py --chat --model-menu --model_type llama") for it to work and used oogabooga instead of kobold, since its a bit less bugged for me. You can still use ooga with tavern like kobold you jast need to check "api" in ooga parametrs and click relaucnh to get tavern link. Best part, it needs only about 8GB VRam to run locally.
 
  • Like
Reactions: HomeInvasion04

Lakius

Member
Mar 22, 2019
158
650
I've had some pretty great results with a merge of Pygmalion and Vicuna I found through the KoboldAI group. I've only tried it in the NSFW story context so far.

It uses just shy of 16GB VRAM (~14GB) so it fits fully on VRAM for a lot of cards, meaning pretty damn fast outputs. (15-20 tokens per second for me) I didn't have to do anything crazy to get KoboldAI to recognize it, just had to make a folder for the model in the right spot. It's more prone to re-using phrases, but honestly it's as good or better than Erebus 13B, the lows aren't nearly as bad (less incoherent freak outs that no non-schizophrenic author would write) and doesn't chew up all of my VRAM.

Standard AI stuff honestly for phrase re-use. Or girls suddenly calling their partner sir. Or if you accidentally typo someone's name, making them different in the AI. Or if you miss a tag on a world entry, and the AI can't identify/remember who this "captain" is.

Only downside to this model is that it sometimes will generate only a few words in a story instead of the full output length specified. I like starting sentences for the AI to finish, and sometimes it finishes those and goes "Right, my job here is done" after like 3 words and a period. Given that it generates so fast, though, clicking submit again ain't bad.
 

Lakius

Member
Mar 22, 2019
158
650
Oh, and there is a KoboldAI dev branch with a new UI that had a debut about 5 months ago. If you're writing on PC, the new experience is a lot, lot better, though they took out the scroll bar and that kinda sucks. You will need to force the install_requirements.bat script to delete dependencies and redownload to be sure you can update to the new branch with the update-koboldai.bat script, as pytorch doesn't update properly for some older releases.
 
  • Like
Reactions: Dir.Fred

Gluttonous

Newbie
Feb 18, 2018
47
12
I've tried Alpaca and Vicuna. None of them were great with Adventure mode, but Alpaca seemed to work well enough for story writing. It seemed to need a lot of hand-holding when it comes to X-rated stuff (which is understandable, it has not been trained on that stuff).
If your video memory is around 11G, I recommend you use KoboldAI's 4bit branch to experience this model, both adventure mode and chat mode are excellent, NSFW aspect is also great.
 
  • Like
Reactions: Lakius

Lakius

Member
Mar 22, 2019
158
650
If your video memory is around 11G, I recommend you use KoboldAI's 4bit branch to experience this model, both adventure mode and chat mode are excellent, NSFW aspect is also great.
4 bit models are great! Chronos-Hermes-13B 4bit fits under 10G total for me, and is my new favorite by a longshot. Normally a 13B model overflows even 20GB of VRAM. I use Occam's KobodlAI fork (haven't seen any missing features compared to main), but there are others out there.
 

Lakius

Member
Mar 22, 2019
158
650
There's also some 8K context size models hitting the streets of Huggingface lately. I'll be very interested to see how those turn out... needing to sacrifice either character complexity, scene complexity, or world context is a pain.
 

Lakius

Member
Mar 22, 2019
158
650
There's also some 8K context size models hitting the streets of Huggingface lately. I'll be very interested to see how those turn out... needing to sacrifice either character complexity, scene complexity, or world context is a pain.
So I tried some 8K SuperHOT merges out... and either the hoops I needed to jump through to get them to work didn't work, or the merge of SuperHOT for 8K with existing models is not good currently. They're just less coherent overall, and as the context goes up it gets worse, but not nearly as "AI screaming in pain" as trying to extend a 2K context model. Give it another month, KoboldAI isn't ready for it, Oobabooga is, though the UI for Ooba is unbearable copycat shite and not meant for stories.

I'd like to give an honest shoutout to SillyTavern, which is quite good for chatting and requires a backend like Ooba or KoboldAI. It's a fun side trip to try out models in a more conversational context. With character cards! It's like hopping into a curated adventure with a character, and some are places or groups instead. This reads like an ad, sure, but I genuinely enjoyed a change of pace.

SillyTavern:

Character cards:


As for 4-bit... If you've got 24GB VRAM, you can fit a 4bit 33B model with 2K context, if you don't run anything else. Airoboros blows Erebus out of the water, no competition. Airoboros 33B is consistently competent and I find myself adding flavor to it instead of just deleting sentences. Every other output, I'll correct a weirdly worded line, or clarify to help steer and maintain a character's image and stop them from monologuing. I'd never go back to any of the base models for KoboldAI from the AI picker, they're quite dated and too much of a chore by comparison. There is only one reason to not use 4bit GPTQ, and that's to use GGML models designed for CPU and Apple silicon. (or training, or merging, but let's be real if you did that you'd be using a different site) For creative writing, I feel like the larger models write as if there was an editor involved and the writer is focused, whereas some of the smaller models feel like a first draft written on a bus, where every page was written in an entirely different mindset. Yes, you can get sparks of truly wonderful writing from smaller models.

I've heard good things about Airoboros 13B as well, if you're a little more VRAM limited. I suspect that will fit on a 12GB card, or overflow a bit but still be usable on 10, and probably not too great on 8GB. Try out a bunch of models, they each have their own flavour you may or may not like.

Airoboros 33B:

Airoboros 13B:


Between larger and smaller models, your mileage may vary, but I find you either go with a smaller, faster model and resubmit a bunch then edit down to keep what you like, or you take a bigger model with a slower speed and resubmit less. Don't get me wrong, there are still some issues, but for my standards I feel Erebus 6.7B is a 5-7/10 (now that the rose tinted honeymoon glasses are off), and Airoboros 33B is a solid 8/10. Hard to believe how much has changed for me and the software involved in half a year. 33B used to require like 100GB VRAM to run.

With everyone going hardcore SFW and locking down their models to their detriment in flexibility, I'm so glad this stuff exists. Even distinctly non-smut writing prompts often get "As an AI Language Model..." bullshit, if there is a hint of violence, gore, or sex involved. Or you just get your output rejected and deleted.
 

Lakius

Member
Mar 22, 2019
158
650
Make sure you don't update beyond NVIDIA driver 531.79 if you're using a big model. (compared to your GPU VRAM) They fucked up VRAM use bigtime. It'll aggressively offload VRAM to shared memory as you near the top of the limit, which essentially acts like a pagefile for RAM, letting you avoid crashes but fucking you over for speed. If you were safely under the limit by a smidge, you're screwed with newer drivers.

Keep your eyes peeled for Stable Diffusion in the patch notes if you want to know when to upgrade, they're a bigger market for NVIDIA, have the same issue, and will be the indicator for when this is properly fixed.
 
  • Like
Reactions: Dir.Fred

Gluttonous

Newbie
Feb 18, 2018
47
12
So I tried some 8K SuperHOT merges out... and either the hoops I needed to jump through to get them to work didn't work, or the merge of SuperHOT for 8K with existing models is not good currently. They're just less coherent overall, and as the context goes up it gets worse, but not nearly as "AI screaming in pain" as trying to extend a 2K context model. Give it another month, KoboldAI isn't ready for it, Oobabooga is, though the UI for Ooba is unbearable copycat shite and not meant for stories.

I'd like to give an honest shoutout to SillyTavern, which is quite good for chatting and requires a backend like Ooba or KoboldAI. It's a fun side trip to try out models in a more conversational context. With character cards! It's like hopping into a curated adventure with a character, and some are places or groups instead. This reads like an ad, sure, but I genuinely enjoyed a change of pace.

SillyTavern:

Character cards:


As for 4-bit... If you've got 24GB VRAM, you can fit a 4bit 33B model with 2K context, if you don't run anything else. Airoboros blows Erebus out of the water, no competition. Airoboros 33B is consistently competent and I find myself adding flavor to it instead of just deleting sentences. Every other output, I'll correct a weirdly worded line, or clarify to help steer and maintain a character's image and stop them from monologuing. I'd never go back to any of the base models for KoboldAI from the AI picker, they're quite dated and too much of a chore by comparison. There is only one reason to not use 4bit GPTQ, and that's to use GGML models designed for CPU and Apple silicon. (or training, or merging, but let's be real if you did that you'd be using a different site) For creative writing, I feel like the larger models write as if there was an editor involved and the writer is focused, whereas some of the smaller models feel like a first draft written on a bus, where every page was written in an entirely different mindset. Yes, you can get sparks of truly wonderful writing from smaller models.

I've heard good things about Airoboros 13B as well, if you're a little more VRAM limited. I suspect that will fit on a 12GB card, or overflow a bit but still be usable on 10, and probably not too great on 8GB. Try out a bunch of models, they each have their own flavour you may or may not like.

Airoboros 33B:

Airoboros 13B:


Between larger and smaller models, your mileage may vary, but I find you either go with a smaller, faster model and resubmit a bunch then edit down to keep what you like, or you take a bigger model with a slower speed and resubmit less. Don't get me wrong, there are still some issues, but for my standards I feel Erebus 6.7B is a 5-7/10 (now that the rose tinted honeymoon glasses are off), and Airoboros 33B is a solid 8/10. Hard to believe how much has changed for me and the software involved in half a year. 33B used to require like 100GB VRAM to run.

With everyone going hardcore SFW and locking down their models to their detriment in flexibility, I'm so glad this stuff exists. Even distinctly non-smut writing prompts often get "As an AI Language Model..." bullshit, if there is a hint of violence, gore, or sex involved. Or you just get your output rejected and deleted.

My friend tried this one, stronger comprehension and associative skills beyond the 13B of the past, though it's still hard to catch up with the 30B model.
 
  • Like
Reactions: Lakius

19o8

Newbie
Oct 7, 2018
22
72
yeah the Model is great
clara went to the shop with jake and the two of them came back with a bag of chips. I asked her what she was doing and she said "I'm buying some chips." I told her that I had to go buy some chips too because i didn't have any. She asked me if I wanted to share a chip. I said yes and we shared one together. It was so nice.
i love this :ROFLMAO:
 

littleknight

Member
May 28, 2017
101
887
I actually tested the 33Gb airoboros, it seemed to work fine at first, until the chat bot started to go insane. maybe I gave too much complicated data to the conversation between the two of us, and the chat bot started going crazy and talking nonsense. Ridiculous. I am still deciding between templates and presets for chatbots. I don't know which type is suitable for role-play chat, and can the chatbot respond to the chatbot's character?
 

Lakius

Member
Mar 22, 2019
158
650
I actually tested the 33Gb airoboros, it seemed to work fine at first, until the chat bot started to go insane. maybe I gave too much complicated data to the conversation between the two of us, and the chat bot started going crazy and talking nonsense. Ridiculous. I am still deciding between templates and presets for chatbots. I don't know which type is suitable for role-play chat, and can the chatbot respond to the chatbot's character?
There are new Llama 2 models available that perform better than (or at least equivalent to) 33b models at 13b. Chronos-Beluga v2 13B or Huginn 13b are great as far as I've heard for RP, while I'd personally use some flavor of Airoboros 2 or its mixes for writing. The other huge advantage of Llama v2 is 4096 token context size, AKA double the memory of Llama 1 or older OPT or GPT-J. Llama 2 doesn't have a 30/33b implementation currently, only 7b, 13b, or 70b. Llama 2 is, in my opinion, just better at understanding context. I couldn't get any Llama 1 models to actually remember and follow instructions that apparently ChatGPT does easily, but Llama 2, in my limited testing, does.

As for insanity and consistency, you really, really need something for the AI to build off of in order for it to generate what you want. Short shots with limited context are just not what it can handle. This is where getting or looking at other character cards is helpful, but really, you need examples of how they talk and act. Using the same model, I've had stories I wouldn't take money to read, and others I wish I could recreate on my own. But you gotta steer and trim messages, because the AI will fixate on things you've glossed over, or subtle things that have more meaning than their literal interpretation. You or I could omit words, while the AI kinda needs them or it makes the wrong assumption.


In other news, I tried out some new backends (or modules). ExLlama was fucking fast compared to HF GPTQ but absolutely batshit crazy when I tried it, making it pretty worthless for me. It was literally 10-20x faster, but again, kinda pointless to receive more useless results that felt as bad as doubling context size in an unsupported way. It also used more VRAM, nudging 33b 4-bit models above the 24GB mark. The real kicker was trying out KoboldCPP for real, and learning how to set it up. GGML models aren't just for CPU inference, apparently. With CuBLAS, you can run entire models on GPU, taking me from ~3.5-4 T/s on GPTQ to 33T/s! Crazy fast.

The major downside is that the UI is KoboldAI Lite, which does not have server-side story saving, so you're gonna want a new front end if you want to seamlessly switch from desktop to mobile. It outputs files to save, which is just the worst. Good luck getting a front end as good as KoboldAI is for stories, but there are plenty for chats and RP.
 
  • Like
Reactions: Dir.Fred

Gluttonous

Newbie
Feb 18, 2018
47
12
There are new Llama 2 models available that perform better than (or at least equivalent to) 33b models at 13b. Chronos-Beluga v2 13B or Huginn 13b are great as far as I've heard for RP, while I'd personally use some flavor of Airoboros 2 or its mixes for writing. The other huge advantage of Llama v2 is 4096 token context size, AKA double the memory of Llama 1 or older OPT or GPT-J. Llama 2 doesn't have a 30/33b implementation currently, only 7b, 13b, or 70b. Llama 2 is, in my opinion, just better at understanding context. I couldn't get any Llama 1 models to actually remember and follow instructions that apparently ChatGPT does easily, but Llama 2, in my limited testing, does.

As for insanity and consistency, you really, really need something for the AI to build off of in order for it to generate what you want. Short shots with limited context are just not what it can handle. This is where getting or looking at other character cards is helpful, but really, you need examples of how they talk and act. Using the same model, I've had stories I wouldn't take money to read, and others I wish I could recreate on my own. But you gotta steer and trim messages, because the AI will fixate on things you've glossed over, or subtle things that have more meaning than their literal interpretation. You or I could omit words, while the AI kinda needs them or it makes the wrong assumption.


In other news, I tried out some new backends (or modules). ExLlama was fucking fast compared to HF GPTQ but absolutely batshit crazy when I tried it, making it pretty worthless for me. It was literally 10-20x faster, but again, kinda pointless to receive more useless results that felt as bad as doubling context size in an unsupported way. It also used more VRAM, nudging 33b 4-bit models above the 24GB mark. The real kicker was trying out KoboldCPP for real, and learning how to set it up. GGML models aren't just for CPU inference, apparently. With CuBLAS, you can run entire models on GPU, taking me from ~3.5-4 T/s on GPTQ to 33T/s! Crazy fast.

The major downside is that the UI is KoboldAI Lite, which does not have server-side story saving, so you're gonna want a new front end if you want to seamlessly switch from desktop to mobile. It outputs files to save, which is just the worst. Good luck getting a front end as good as KoboldAI is for stories, but there are plenty for chats and RP.
I'm not sure I agree on the ll2 13B model over the old 33B model, in my personal experience the 13B is still just a 13B after all, and it's comprehension is in many ways no match for the 33B model, and comprehension is actually very important for erotic roleplaying. One of the wackiest erotic roleplays I've done in the last couple weeks was to give the protagonist the ability to convert other people into onaholes, and the LL2 13B model still showed no comprehension of the setting, but the 33B model was able to do it perfectly. And now the mainstream 33B model can also reach 6K or 8K memory, I'm very much looking forward to the release of LL2 34B model, but at the moment the 13B model still can't compare to the 33B model.
 

Lakius

Member
Mar 22, 2019
158
650
I'm not sure I agree on the ll2 13B model over the old 33B model, in my personal experience the 13B is still just a 13B after all, and it's comprehension is in many ways no match for the 33B model, and comprehension is actually very important for erotic roleplaying. One of the wackiest erotic roleplays I've done in the last couple weeks was to give the protagonist the ability to convert other people into onaholes, and the LL2 13B model still showed no comprehension of the setting, but the 33B model was able to do it perfectly. And now the mainstream 33B model can also reach 6K or 8K memory, I'm very much looking forward to the release of LL2 34B model, but at the moment the 13B model still can't compare to the 33B model.
I do not have any comprehension issues with 13b airoboros L2 beyond what 33b L1 already had. Niche content is always going to be a crapshoot with AI. The variety between finetunes is enough to make some entirely incapable of anything other than blowjobs and missionary sex. Mixes can do weird things for overall model intelligence as well. The super smutty finetunes tend to have some real fuckin insane sentence structures, in my experience, in exchange for learning about niche content.

Meta has been pretty damn quiet about the LL2 33b version, to the point of scrubbing its mention from everything but internal comparison whitepapers, so I wouldn't get too excited for more. The only charts I've seen didn't paint a pretty picture for it, so it's likely still in the metaphorical oven.

4-bit 33b with 6k or 8k context uses way too much VRAM, most I could fit before OoM on a 3090 was 4.5k. Granted that was GPTQ, I might be able to stomach a big performance hit with GGML and some CPU layers. There's also a 16k airoboros 33b out there, if you're real horny for extra context. SUPERHOTs were hot trash for me though. And you can usually boost context by 50% with minimal perplexity gain, if you want to stick to what you've got.
 

Gluttonous

Newbie
Feb 18, 2018
47
12
I do not have any comprehension issues with 13b airoboros L2 beyond what 33b L1 already had. Niche content is always going to be a crapshoot with AI. The variety between finetunes is enough to make some entirely incapable of anything other than blowjobs and missionary sex. Mixes can do weird things for overall model intelligence as well. The super smutty finetunes tend to have some real fuckin insane sentence structures, in my experience, in exchange for learning about niche content.

Meta has been pretty damn quiet about the LL2 33b version, to the point of scrubbing its mention from everything but internal comparison whitepapers, so I wouldn't get too excited for more. The only charts I've seen didn't paint a pretty picture for it, so it's likely still in the metaphorical oven.

4-bit 33b with 6k or 8k context uses way too much VRAM, most I could fit before OoM on a 3090 was 4.5k. Granted that was GPTQ, I might be able to stomach a big performance hit with GGML and some CPU layers. There's also a 16k airoboros 33b out there, if you're real horny for extra context. SUPERHOTs were hot trash for me though. And you can usually boost context by 50% with minimal perplexity gain, if you want to stick to what you've got.
I don't agree with this at all, I'm using an Airoboros GPT4 m2.0 33B Q5_K_M and koboldcpp, and without fully loading the GPU it only takes about three minutes to complete a response of around 100 tokens at 6K context. And the accuracy and obedience are far better than the GPTQ13BLL2 model and do not require additional context expansion. Another feature in terms of comprehension, aside from some quirky situations that the LL213B model cannot comprehend, is the problem of not being able to play a more qualified role in an adventure according to the character's personality.
 

Lakius

Member
Mar 22, 2019
158
650
I don't agree with this at all, I'm using an Airoboros GPT4 m2.0 33B Q5_K_M and koboldcpp, and without fully loading the GPU it only takes about three minutes to complete a response of around 100 tokens at 6K context. And the accuracy and obedience are far better than the GPTQ13BLL2 model and do not require additional context expansion. Another feature in terms of comprehension, aside from some quirky situations that the LL213B model cannot comprehend, is the problem of not being able to play a more qualified role in an adventure according to the character's personality.
Let's just agree to disagree and move on. :)

You on Linux or Windows? KCPP has a significant hit for performance on Windows due to kernel overhead. Consider switching to Occam's new ExLlama KoboldAI fork, which is actually pretty great, especially if you like the vanilla/United kobold UI. It also uses a fair amount less VRAM compared to HF GPTQ. The huge ExLlama perplexity issues seem to have been fixed, and for whatever reason the ExLlama module was not merged into KoboldAI with the other GPTQ code. (likely some bugs, such as not being able to load another model without errors) I'm getting ~0.7 gb VRAM remaining for a 33B model on 2048 context and a very unclean desktop, with 13.7 T/s. Or 25.4 T/s for a 13B model with double the context. HF GPTQ, meanwhile, was ~5 T/s for a 13B, and I believe KCPP got me ~10, cratering to <4 with even 1 layer off of VRAM. 4 is about the limit of my patience, any slower and I can write faster myself. At least until I get stuck.
 

Gluttonous

Newbie
Feb 18, 2018
47
12
Let's just agree to disagree and move on. :)

You on Linux or Windows? KCPP has a significant hit for performance on Windows due to kernel overhead. Consider switching to Occam's new ExLlama KoboldAI fork, which is actually pretty great, especially if you like the vanilla/United kobold UI. It also uses a fair amount less VRAM compared to HF GPTQ. The huge ExLlama perplexity issues seem to have been fixed, and for whatever reason the ExLlama module was not merged into KoboldAI with the other GPTQ code. (likely some bugs, such as not being able to load another model without errors) I'm getting ~0.7 gb VRAM remaining for a 33B model on 2048 context and a very unclean desktop, with 13.7 T/s. Or 25.4 T/s for a 13B model with double the context. HF GPTQ, meanwhile, was ~5 T/s for a 13B, and I believe KCPP got me ~10, cratering to <4 with even 1 layer off of VRAM. 4 is about the limit of my patience, any slower and I can write faster myself. At least until I get stuck.
I'm using windows, I mostly have 11GB of video memory, which allows me to load as much 30B+ models into memory as I can with KCPP, and end up generating around 2T, which in my opinion makes the 34B model perform great, but it still lacks enough training to make a good level of writing.
 

Itsdatboi1

New Member
Sep 9, 2023
1
1
I managed to get pygmalion 6b to work in koboldAi, then used tavernAi to run it.
But for what ever reason I can't seem to get anything else to work.

I tried using

and even


And it just won't let me load it in koboldAi. Anyone know why that is?
I did have an issue getting koboldAi to work in the new united branch.
I even tried to do what Lakius said to do to get it to work but didn't work for me.
Is that why I can't load these other models?
 
  • Like
Reactions: toroduro

toroduro

New Member
Aug 3, 2017
2
2
I managed to get pygmalion 6b to work in koboldAi, then used tavernAi to run it.
But for what ever reason I can't seem to get anything else to work.

I tried using

and even


And it just won't let me load it in koboldAi. Anyone know why that is?
I did have an issue getting koboldAi to work in the new united branch.
I even tried to do what Lakius said to do to get it to work but didn't work for me.
Is that why I can't load these other models?
It's probably the software. A lot of people have been trying 0cc4m because it's easy to run on windows but it seems that there are other ways to run it, so It's just not made to run on this version or with KoboldAI at all. Models from "TheBloke" usually work, if you manage to find the right ones (GPTQ / 4-bit). There are also the "usual models" in 4-bit version available for download directly in KoboldAI, but in my opinion I only had fun using 13B versions of those models and for it to work I have to use system RAM/CPU to help the GPU, meaning it's extremely slow!
 
  • Like
Reactions: Itsdatboi1