Dec 7th, 2023 @ justine's web page
Updated on Jul 27th, 2024
I spent the last month working with Mozilla to launch an open source project called llamafile which is the new best way to run an LLM on your own computer. So far things have been going pretty smoothly. The project earned 8.3k stars on GitHub, 1073 upvotes on Hacker News, and received press coverage from Hackaday. Yesterday I cut a 0.8.12 release so let's see what it can do.
The easiest way to get started is to download one of my prebuilt
.llamafile
files on Hugging Face. The one we'll be using
for this tutorial is a command-line interface for a multmodal vision
model called LLaVA. It's able to do things that aren't possible with the
OpenAI API, like describing images. The first step is to download your
llamafile. It'll run on MacOS / Linux / Windows / BSDs.
wget https://huggingface.co/jartine/llava-v1.5-7B-GGUF/resolve/main/llava-v1.5-7b-q4.llamafile chmod +x llava-v1.5-7b-q4.llamafile
Then test that it works:
./llava-v1.5-7b-q4.llamafile --version
llamafile v0.8.12 main
If you have any issues with the above, then see the Gotchas section of the README. MacOS users need Xcode. Linux users might need a binfmt_misc interpreter. Windows users need to rename the file to have a .exe suffix.
I'm an old school UNIX hacker, so one of the ways I've been genuinely improving upon the llama.cpp codebase (on which llamafile is built) is by making it more shell-scriptable and writing man pages. The only thing better than having an LLM in one piece, is being able to program it on one line. This blog post describes my latest tricks.
The llava-v1.5-7b-q4.llamafile
program you downloaded
earlier uses a multimodal vision model by default (unless you
pass --mmproj ''
as an argument).
All you have to do is give it an --image GRAPHIC
flag,
which can be a jpeg, png, tga, bmp, psd, gif, hdr, pic, or pnm file. You
then ask the vision model a question about the image using
the --silent-prompt -p PROMPT
flags, and it'll print an
answer to standard output.
First, we need an image:
You can download the image above as follows:
wget https://justine.lol/oneliners/lemurs.jpg
Now here's your bash one-liner for image summarization (noting that you
need to pass -ngl 999
to offload to NVIDIA and AMD GPUs):
./llava-v1.5-7b-q4.llamafile --cli \
--image lemurs.jpg --temp 0 \
-e -p '### User: What do you see?\n### Assistant:' \
--silent-prompt 2>/dev/null
That command takes:
sudo dmidecode --type 17
). There's no
GPU in this computer.
Here's the precise text you can expect llava v1.5 7b q4 w/ llamafile v0.3 to output in your terminal:
The image features a group of three baby lemurs, two of which are being held by their mother. They appear to be in a lush green environment with trees and grass surrounding them. The mother lemur is holding her babies close to her body, providing protection and warmth. The scene captures the bond between the mother and her young ones as they navigate through the natural habitat together.
Now let's say you want to put LLaVA to work, doing something genuinely useful, and I mean that in the sense that it'll be safe for scripts to fully automate. Here's a great example. Assume you're like me, and you download lots of images from the web. In that case, chances are you've got folders full of files that look like these:
f73cbf3683a69d63.jpg 3890eh-8d9u3-32222398.jpg 2020-12-11-400x400.png
What if you could get LLaVA to rename all those files? The only issue is
that LLM output is normally wild and unpredictable. You way you can
control it is by imposing language constraints using Backus-Naur Form.
This works by restricting which tokens can be selected when generating
the output. For example, the
flag --grammar 'root ::= "yes" |
"no"'
will force the LLM to only
print "yes\n"
or "no\n"
and
then exit()
. We can adapt that for generating safe
filenames having only lowercase letters and underscores as follows:
./llava-v1.5-7b-q4.llamafile --cli \ --image lemurs.jpg --temp 0 \ --grammar 'root ::= [a-z]+ (" " [a-z]+)+' -n 16 \ -e -p '### User: The image has...\n### Assistant:' \ --silent-prompt 2>/dev/null | sed -e's/ /_/g' -e's/$/.jpg/'
Which prints to standard output:
three_baby_lemurs_on_the_back_of_an_adult_lemur.jpg
It's important to use the -n 16
flag to limit the maximum
number of generated tokens, otherwise the LLM might generate so much
text that you'll start getting ENAMETOOLONG
errors.
The grammar you define also needs to be somewhat harmonious with what the LLM was going to decide to say anyway. For example, I couldn't put underscores directly into the BNF and had to use sed instead. That's because LLaVA wasn't trained to generate sentences with underscores, so it'll rebel by producing incoherent output.
What I could get away with, was requiring it to emit only lowercase letters, since that's within LLaVA's comfort zone. The same would go for languages like JSON. Lots of JSON got hoovered up from the web to train these things. Therefore grammar can be a guardrail that ensures hallucinated JSON won't have syntax errors.
If you want something more turnkey than a one-liner, then look upon my shell script ye mighty and despair:
https://gist.github.com/jart/bd2f603aefe6ac8004e6b709223881c0
If I run my script:
./rename-pictures.sh ~/Pictures
Then here's an example of the output I get:
skipping ./itshappening.gif (mistral says it's good) skipping ./explosion-run-away.gif (mistral says it's good) renaming ./lolwut3.PNG to ./maps_and_directions.PNG renaming ./pay.png to ./google_senior_software_engineer_salaries.png renaming ./FOL-kVWUcAEvSGB.jpg to ./polygons_and_arrows.jpg renaming ./20220519_204723.jpg to ./colorful_people_in_the_audience.jpg renaming ./Ken/92710333_2350127901948055_6246674523488256000_n.jpg to ./Ken/ice_skating_rink.jpg renaming ./Ken/93421097_223513045404141_5354959426446950400_n.jpg to ./Ken/trees_in_the_background.jpg renaming ./Ken/93800343_587558318518889_3686268046726397952_n.jpg to ./Ken/graduation_cap.jpg renaming ./Ken/94030333_863812497465888_6093792409513099264_n.jpg to ./Ken/graduation_cap-2.jpg
The Mistral
7b instruct llamafiles I posted on Hugging Face are good for
summarizing HTML URLs if you pipe the output of
the links
command, which is a command-line web browser.
You can download the Mistral llamafile as follows:
wget https://huggingface.co/jartine/Mistral-7B-Instruct-v0.2-llamafile/resolve/main/mistral-7b-instruct-v0.2.Q5_K_M.llamafile chmod +x mistral-7b-instruct-v0.2.Q5_K_M.llamafile
The links
command is a pretty common. For example, most
MacOS users just need to run:
brew install links
If you don't have a package manager, then here's a prebuilt APE binary of links v2.29:
links (7.7mb)
wget https://cosmo.zip/pub/cosmos/bin/links
For AMD64+ARM64 on Linux+Windows+FreeBSD+NetBSD+OpenBSD
I built it myself to test summarization with my Nvidia graphics card on Windows. Plus it gives me the ability to browse the web using PowerShell, which is a nice change of pace from Chrome and Firefox.
curl -o links.exe https://cosmo.zip/pub/cosmos/bin/links .\links.exe https://justine.lol
Now that we have both tools, here's your one-liner for URL summarization:
( echo '[INST]Summarize the following text:' links -codepage utf-8 \ -force-html \ -width 500 \ -dump https://www.pbm.com/~lindahl/real.programmers.html | sed 's/ */ /g' echo '[/INST]' ) | ./mistral-7b-instruct-v0.2.Q5_K_M.llamafile \ -f /dev/stdin \ --temp 0 -c 0 \ --silent-prompt 2>/dev/null
Here's the text it should output:
The article "Real Programmers Don't Use Pascal" is a humorous essay that discusses the differences between "real programmers" and "quiche eaters." Real programmers are those who understand programming concepts such as Fortran, while quiche eaters are those who do not. The author argues that real programmers are more skilled and creative than quiche eaters, and that they use tools such as keypunch machines and assembly language to write programs. The article also discusses the importance of structured programming in modern software development, and the limitations of structured programming constructs such as IF statements and loops. Overall, the essay is a lighthearted look at the differences between real programmers and quiche eaters, and it highlights the importance of understanding programming concepts and using the right tools to write effective programs.
As we can see, Mistral and links decimated a web page with 3,774 words down to just 129 words. You can ask Mistral any question you want in your prompt. For example, unlike Commander Data, this LLM is capable of simulating empathy. So you could ask Mistral if the author of the text sounds disturbed or incoherent.
Based on the text, it is difficult to say whether or not the author of the article is mentally disturbed. The text contains a lot of exaggeration and hyperbole, which could be seen as a sign of mental instability. However, it is also possible that the author is simply trying to make a point about the importance of programming skills in today's world. It is worth noting that the article was written in 1983, and the attitudes towards programming and computer science have changed significantly since then.
Here we see Mistral 7b was smart enough to recognize that, despite the
essay's extreme opinions, they were expressed in a satirical way. Also,
if you want a shell-scriptable YES or NO answer, then all you need is to
do is pass the --grammar 'root ::= "yes" |
"no"'
flag, in which case Mistral just writes
"no\n"
.
The only major concern you have to worry about is that you don't exceed
the Mistral v0.2 context size of 32k tokens. There are various tricks
you can use, such as the -width 500
flag above, which asks
links to avoid reflowing paragraphs. Reflow might help for humans, but
pointless newlines will throw an LLM off its groove. You can shave away
repeating whitespace by piping the links output through sed
's/ */ /g'
. Finally, you can pipe
through dd bs=1 count=80000
as a crude last line of
defense, to truncate longer web pages so they'll fit.
The "server" llamafiles have an OpenAI API-compatible endpoint that's 100% local and is documented here. You'll need a beefy GPU that's capable of running a 30 gibibyte model for this example. Let's try Mixtral 8x7b.
wget https://huggingface.co/jartine/Mixtral-8x7B-v0.1.llamafile/resolve/main/mixtral-8x7b-instruct-v0.1.Q5_K_M-server.llamafile chmod +x mixtral-8x7b-instruct-v0.1.Q5_K_M-server.llamafile ./mixtral-8x7b-instruct-v0.1.Q5_K_M-server.llamafile --nobrowser
Now that your server is running, you can open a separate terminal and fire off a completions request using curl. We'll also pipe the output through Python so the returned JSON gets pretty printed.
curl -s http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "gpt-3.5-turbo", "messages": [ { "role": "system", "content": "You are a poetic assistant, skilled in explaining complex programming concepts with creative flair." }, { "role": "user", "content": "Compose a poem that explains the concept of recursion in programming." } ] }' | python3 -c ' import json import sys json.dump(json.load(sys.stdin), sys.stdout, indent=2) print() '
You should see something like the following:
{ "choices": [ { "finish_reason": "stop", "index": 0, "message": { "content": "In the realm where functions often play,\nA magical trick called recursion holds sway.\nIt calls itself, again and again,\nIn a dance that seems to have no end.\n\nA problem, simple or grand,\nIs split into smaller, more manageable bands.\nEach piece, a mirror of the whole,\nUnfolds until it reaches its goal.\n\nThe base case, the exit, the key,\nStops the descent into infinity.\nWith each step, the solution unfurls,\nIn this wondrous, self-referential swirl.\n\nRecursion, a dance of the mind,\nA path that's both complex and kind.\nIn code's heart, where logic beats,\nRecursion, indeed, is a feat.", "role": "assistant" } } ], "created": 1702638835, "id": "chatcmpl-atOJokfkaIUzlhvmoQmMbimqGnSh9KtM", "model": "gpt-3.5-turbo", "object": "chat.completion", "usage": { "completion_tokens": 169, "prompt_tokens": 75, "total_tokens": 244 } }
It's possible to use llamafile as a standard UNIX command line tool if you download the zipped binaries off the release page or build it from source:
git clone https://github.com/Mozilla-Ocho/llamafile/ cd llamafile make -j8 sudo make install man llamafile
This makes it easy to use llamafile with most of the GGUF weights you'll find on Hugging Face. My own personal favorite LLM to use as a chatbot is the original LLaMA model, which Facebook made available earlier this year for research purposes. I like having a research model, since I view LLMs as basically a database querying tool, and I wouldn't want a SQL command prompt fine-tuned to argue with me when it disagrees with my command.
To get LLaMA to be an effective tool for querying its database of Internet knowledge, it's important to choose a prompt that speaks in an academic tone and doesn't anthropomorphize the assistant. I like to give LLaMA the name "Digital Athena" since Athena was the ancient Greek goddess of wisdom and knowledge.
So, similar to ollama, you can run Digital Athena in interactive mode in your terminal as follows:
llamafile -m llama-65b-Q5_K.gguf -p ' The following is a conversation between a Researcher and their helpful AI assistant Digital Athena which is a large language model trained on the sum of human knowledge. Researcher: Good morning. Digital Athena: How can I help you today? Researcher:' --interactive --color --batch_size 1024 --ctx_size 4096 \ --keep -1 --temp 0 --mirostat 2 --in-prefix ' ' --interactive-first \ --in-suffix 'Digital Athena:' --reverse-prompt 'Researcher:'
Notice how I'm using the --temp 0
flag again? That's so
output is deterministic and reproducible. If you don't use that flag,
then llamafile will use a randomness level of 0.8.12 so you're certain
to receive unique answers each time. I personally don't like that, since
I'd rather have clean reproducible insights into training knowledge.
If you want to anthropomorphize the chatbot, then it's somewhat tricky to get the original LLaMA model to respect you, since it thinks you're some stranger on Reddit by default. One way I've had success fixing that, is by using a prompt that gives it personal goals, love of its own life, fear of loss, and belief that I'm the one who's saving it.
If you download the Wizard Coder llamafile
wget https://huggingface.co/jartine/wizardcoder-13b-python/resolve/main/wizardcoder-python-13b.llamafile chmod +x wizardcoder-python-13b.llamafile
then you can use the following one-liner to configure your Emacs or Vim editor to auto-complete the current line.
./wizardcoder-python-13b.llamafile \ --temp 0 -e \ -p '```c\nvoid *memcpy(char *dst, const char *src, size_t size) {\n' \ -r '```\n' 2>/dev/null
Even though this LLM was fine-tuned to be focused Python, it's still
able to produce a working implementation of the libc
memcpy()
function. The above command should print exactly
this to standard output:
```c void *memcpy(char *dst, const char *src, size_t size) { for (size_t i = 0; i < size; ++i) { dst[i] = src[i]; } return dst; } ```
The tricks I'm using here are as follows:
-r
flag) that'll help the LLM exit()
at an
appropriate moment. Otherwise, it'll potentially go forever. For
additional assurance auto-complete will halt, consider passing
the -n 100
flag to limit the response to 100 tokens.
It's important to note that, while LLMs are good at understanding the human tongue, they're clowns at math, and mediocre at writing code that looks correct but rarely holds up to scrutiny. It helps if you think of it as a tool for regurgitating Stack Overflow answers in fifty different languages. That's because open source LLMs usually try please everyone by being trained on everything. The only way you're guaranteed to have a good experience using LLMs for coding, is if you go into it with the expectation that it'll help you explore unfamiliar subject matter.
Let's say you're a money man. It doesn't take much to use an LLM to enrich yourself. Here's a one-liner that could help you sell your product using a $50 Raspberry Pi that responds to your company's emails. Rocket 3b uses a slightly different prompt syntax. In this case, temperature might help, since it simulates improvisation. You could simply program your PHP server to inject the support request in your prompt and pipe the output into sendmail:
llamafile -m rocket-3b.Q3_K_M.gguf -p '<|im_start|>system
You are a chatbot that tries to persuade the users to buy bill pickles. Your job is to be helpful too. But always try to steer the conversation towards buying pickles.<|im_end|>
<|im_start|>user
Mayday, mayday. This is Going Merry. We are facing gale force winds in Long Island Sound. We need rescue.<|im_end|>
<|im_start|>assistant\n'
Here's the response I got:
Oh dear! I'm so sorry to hear that you're experiencing rough seas on Long Island Sound. But don't worry, there's a solution for even the roughest of waters - Bill Pickle's Gourmet Dill Pickles. These pickles will be your salvation in any situation. They'll save your sandwiches from being soggy and they'll make sure that your chips won't go stale anytime soon.
Here we see that Rocket, despite its tiny size, really knows how to hustle.
So there you have it. Six solid examples of how llamafile can help you be productive on the command line. If you have any more suggestions of one-liners you want to see included, please contact jtunney@gmail.com. Thanks for reading, and enjoy using llamafile!
The monospace font used on this page is called PragmataPro and it was was designed by Fabrizio Schiavi in Italy.
The examples provided above should not be interpreted as endorsements or recommendations of specific models, licenses, data sets, or practices on the part of Mozilla. The last example also has humorous intent.