Improving Llama.cpp Model Output for Agent Environment with WizardLM and Mixed-Quantization Models
Llama.cpp is a powerful tool for generating natural language responses in an agent environment. One way to speed up the generation process is to save the prompt ingestion stage to cache using the --session parameter and giving each prompt its own session name. Furthermore, using the impressive and fast WizardLM 7b (q5_1) and comparing its results with other new fine tunes like TheBloke/wizard-vicuna-13B-GGML could also be useful, especially when prompt-tuning. Additionally, adding the llama.cpp parameter --mirostat has been found to improve model output. (Mirostat is a type of adaptive learning rate optimization algorithm, which can be used to train deep learning models.)
It's worth noting that some new mixed-quantization model files are likely coming which will also help improve model quality and speed. However, open source LLMS are still not suitable for the agent environment and require more fine-tuning.
When working with Vicuna to control NPCs in an open world RPG, it's important to go over all instructions/prompts one by one and tune them to work better. For example, when summarizing speaker memories, it's important to ensure that the AI knows what you mean by "the most possible relationship between...and..." and "embellish." A more scientific approach would be to work through the individual parts of your agent one by one and test and re-phrase them independently rather than trying to do everything at once.
As for whether this approach would work with something like AutoGPT, it's difficult to say without testing. However, given the success of fine-tuning with WizardLM, it's possible that similar approaches could be successful with other language models.