Dec 7^th, 2023 @ justine's web page
Updated on Jul 27^th, 2024

Bash One-Liners for LLMs

I spent the last month working with Mozilla to launch an open source project called llamafile which is the new best way to run an LLM on your own computer. So far things have been going pretty smoothly. The project earned 8.3k stars on GitHub, 1073 upvotes on Hacker News, and received press coverage from Hackaday. Yesterday I cut a 0.8.12 release so let's see what it can do.

Getting Started

The easiest way to get started is to download one of my prebuilt .llamafile files on Hugging Face. The one we'll be using for this tutorial is a command-line interface for a multmodal vision model called LLaVA. It's able to do things that aren't possible with the OpenAI API, like describing images. The first step is to download your llamafile. It'll run on MacOS / Linux / Windows / BSDs.

wget https://huggingface.co/jartine/llava-v1.5-7B-GGUF/resolve/main/llava-v1.5-7b-q4.llamafile
chmod +x llava-v1.5-7b-q4.llamafile

Then test that it works:

./llava-v1.5-7b-q4.llamafile --version
llamafile v0.8.12 main

If you have any issues with the above, then see the Gotchas section of the README. MacOS users need Xcode. Linux users might need a binfmt_misc interpreter. Windows users need to rename the file to have a .exe suffix.

Image Summarization

I'm an old school UNIX hacker, so one of the ways I've been genuinely improving upon the llama.cpp codebase (on which llamafile is built) is by making it more shell-scriptable and writing man pages. The only thing better than having an LLM in one piece, is being able to program it on one line. This blog post describes my latest tricks.

The llava-v1.5-7b-q4.llamafile program you downloaded earlier uses a multimodal vision model by default (unless you pass --mmproj '' as an argument). All you have to do is give it an --image GRAPHIC flag, which can be a jpeg, png, tga, bmp, psd, gif, hdr, pic, or pnm file. You then ask the vision model a question about the image using the --silent-prompt -p PROMPT flags, and it'll print an answer to standard output.

First, we need an image:

[lemurs.jpg]

You can download the image above as follows:

wget https://justine.lol/oneliners/lemurs.jpg

Now here's your bash one-liner for image summarization (noting that you need to pass -ngl 999 to offload to NVIDIA and AMD GPUs):

./llava-v1.5-7b-q4.llamafile --cli \
    --image lemurs.jpg --temp 0 \
    -e -p '### User: What do you see?\n### Assistant:' \
    --silent-prompt 2>/dev/null

That command takes:

4 seconds to run on my Mac Studio, which cost $8,300 USD. It has an M2 Ultra with 800 GB/s memory bandwidth, and 60 GPU cores. My only regret is that I didn't get the one with 76 GPU cores.
10 seconds to run on my Windows computer, which has a four year old NVIDIA GeForce RTX 2080 Ti graphics card that costs $500-$1,000 USD on Amazon these days.
45 seconds to run on my Hewlett Packard ProDesk 600 G5, which cost $1,653 USD. It has an outstanding Intel® Core™ i9-9900 Kabylake CPU that's gimped for LLMs due to having only 2667 MT/s (19.8 GB/s) of memory bandwidth (see sudo dmidecode --type 17). There's no GPU in this computer.

Here's the precise text you can expect llava v1.5 7b q4 w/ llamafile v0.3 to output in your terminal:

The image features a group of three baby lemurs, two of which are being held by their mother. They appear to be in a lush green environment with trees and grass surrounding them. The mother lemur is holding her babies close to her body, providing protection and warmth. The scene captures the bond between the mother and her young ones as they navigate through the natural habitat together.

Filename Generation

Now let's say you want to put LLaVA to work, doing something genuinely useful, and I mean that in the sense that it'll be safe for scripts to fully automate. Here's a great example. Assume you're like me, and you download lots of images from the web. In that case, chances are you've got folders full of files that look like these:

f73cbf3683a69d63.jpg
3890eh-8d9u3-32222398.jpg
2020-12-11-400x400.png

What if you could get LLaVA to rename all those files? The only issue is that LLM output is normally wild and unpredictable. You way you can control it is by imposing language constraints using Backus-Naur Form. This works by restricting which tokens can be selected when generating the output. For example, the flag --grammar 'root ::= "yes" | "no"' will force the LLM to only print "yes\n" or "no\n" and then exit(). We can adapt that for generating safe filenames having only lowercase letters and underscores as follows:

./llava-v1.5-7b-q4.llamafile --cli \
    --image lemurs.jpg --temp 0 \
    --grammar 'root ::= [a-z]+ (" " [a-z]+)+' -n 16 \
    -e -p '### User: The image has...\n### Assistant:' \
    --silent-prompt 2>/dev/null |
  sed -e's/ /_/g' -e's/$/.jpg/'

Which prints to standard output:

three_baby_lemurs_on_the_back_of_an_adult_lemur.jpg

It's important to use the -n 16 flag to limit the maximum number of generated tokens, otherwise the LLM might generate so much text that you'll start getting ENAMETOOLONG errors.

The grammar you define also needs to be somewhat harmonious with what the LLM was going to decide to say anyway. For example, I couldn't put underscores directly into the BNF and had to use sed instead. That's because LLaVA wasn't trained to generate sentences with underscores, so it'll rebel by producing incoherent output.

What I could get away with, was requiring it to emit only lowercase letters, since that's within LLaVA's comfort zone. The same would go for languages like JSON. Lots of JSON got hoovered up from the web to train these things. Therefore grammar can be a guardrail that ensures hallucinated JSON won't have syntax errors.

If you want something more turnkey than a one-liner, then look upon my shell script ye mighty and despair:

https://gist.github.com/jart/bd2f603aefe6ac8004e6b709223881c0

If I run my script:

./rename-pictures.sh ~/Pictures

Then here's an example of the output I get:

skipping ./itshappening.gif (mistral says it's good)
skipping ./explosion-run-away.gif (mistral says it's good)
renaming ./lolwut3.PNG to ./maps_and_directions.PNG
renaming ./pay.png to ./google_senior_software_engineer_salaries.png
renaming ./FOL-kVWUcAEvSGB.jpg to ./polygons_and_arrows.jpg
renaming ./20220519_204723.jpg to ./colorful_people_in_the_audience.jpg
renaming ./Ken/92710333_2350127901948055_6246674523488256000_n.jpg to ./Ken/ice_skating_rink.jpg
renaming ./Ken/93421097_223513045404141_5354959426446950400_n.jpg to ./Ken/trees_in_the_background.jpg
renaming ./Ken/93800343_587558318518889_3686268046726397952_n.jpg to ./Ken/graduation_cap.jpg
renaming ./Ken/94030333_863812497465888_6093792409513099264_n.jpg to ./Ken/graduation_cap-2.jpg

URL Summarization

The Mistral 7b instruct llamafiles I posted on Hugging Face are good for summarizing HTML URLs if you pipe the output of the links command, which is a command-line web browser.

You can download the Mistral llamafile as follows:

wget https://huggingface.co/jartine/Mistral-7B-Instruct-v0.2-llamafile/resolve/main/mistral-7b-instruct-v0.2.Q5_K_M.llamafile
chmod +x mistral-7b-instruct-v0.2.Q5_K_M.llamafile

The links command is a pretty common. For example, most MacOS users just need to run:

brew install links

If you don't have a package manager, then here's a prebuilt APE binary of links v2.29:

links (7.7mb)
wget https://cosmo.zip/pub/cosmos/bin/links
For AMD64+ARM64 on Linux+Windows+FreeBSD+NetBSD+OpenBSD

I built it myself to test summarization with my Nvidia graphics card on Windows. Plus it gives me the ability to browse the web using PowerShell, which is a nice change of pace from Chrome and Firefox.

curl -o links.exe https://cosmo.zip/pub/cosmos/bin/links
.\links.exe https://justine.lol

Now that we have both tools, here's your one-liner for URL summarization:

(
  echo '[INST]Summarize the following text:'
  links -codepage utf-8 \
        -force-html \
        -width 500 \
        -dump https://www.pbm.com/~lindahl/real.programmers.html |
    sed 's/   */ /g'
  echo '[/INST]'
) | ./mistral-7b-instruct-v0.2.Q5_K_M.llamafile \
      -f /dev/stdin \
      --temp 0 -c 0 \
      --silent-prompt 2>/dev/null

Here's the text it should output:

The article "Real Programmers Don't Use Pascal" is a humorous essay that discusses the differences between "real programmers" and "quiche eaters." Real programmers are those who understand programming concepts such as Fortran, while quiche eaters are those who do not. The author argues that real programmers are more skilled and creative than quiche eaters, and that they use tools such as keypunch machines and assembly language to write programs. The article also discusses the importance of structured programming in modern software development, and the limitations of structured programming constructs such as IF statements and loops. Overall, the essay is a lighthearted look at the differences between real programmers and quiche eaters, and it highlights the importance of understanding programming concepts and using the right tools to write effective programs.

As we can see, Mistral and links decimated a web page with 3,774 words down to just 129 words. You can ask Mistral any question you want in your prompt. For example, unlike Commander Data, this LLM is capable of simulating empathy. So you could ask Mistral if the author of the text sounds disturbed or incoherent.

Based on the text, it is difficult to say whether or not the author of the article is mentally disturbed. The text contains a lot of exaggeration and hyperbole, which could be seen as a sign of mental instability. However, it is also possible that the author is simply trying to make a point about the importance of programming skills in today's world. It is worth noting that the article was written in 1983, and the attitudes towards programming and computer science have changed significantly since then.

Here we see Mistral 7b was smart enough to recognize that, despite the essay's extreme opinions, they were expressed in a satirical way. Also, if you want a shell-scriptable YES or NO answer, then all you need is to do is pass the --grammar 'root ::= "yes" | "no"' flag, in which case Mistral just writes "no\n".

The only major concern you have to worry about is that you don't exceed the Mistral v0.2 context size of 32k tokens. There are various tricks you can use, such as the -width 500 flag above, which asks links to avoid reflowing paragraphs. Reflow might help for humans, but pointless newlines will throw an LLM off its groove. You can shave away repeating whitespace by piping the links output through sed 's/ */ /g'. Finally, you can pipe through dd bs=1 count=80000 as a crude last line of defense, to truncate longer web pages so they'll fit.

OpenAI API Client

The "server" llamafiles have an OpenAI API-compatible endpoint that's 100% local and is documented here. You'll need a beefy GPU that's capable of running a 30 gibibyte model for this example. Let's try Mixtral 8x7b.

wget https://huggingface.co/jartine/Mixtral-8x7B-v0.1.llamafile/resolve/main/mixtral-8x7b-instruct-v0.1.Q5_K_M-server.llamafile
chmod +x mixtral-8x7b-instruct-v0.1.Q5_K_M-server.llamafile
./mixtral-8x7b-instruct-v0.1.Q5_K_M-server.llamafile --nobrowser

Now that your server is running, you can open a separate terminal and fire off a completions request using curl. We'll also pipe the output through Python so the returned JSON gets pretty printed.

curl -s http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "gpt-3.5-turbo",
  "messages": [
    {
      "role": "system",
      "content": "You are a poetic assistant, skilled in explaining complex programming concepts with creative flair."
    },
    {
      "role": "user",
      "content": "Compose a poem that explains the concept of recursion in programming."
    }
  ]
}' | python3 -c '
import json
import sys
json.dump(json.load(sys.stdin), sys.stdout, indent=2)
print()
'

You should see something like the following:

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "In the realm where functions often play,\nA magical trick called recursion holds sway.\nIt calls itself, again and again,\nIn a dance that seems to have no end.\n\nA problem, simple or grand,\nIs split into smaller, more manageable bands.\nEach piece, a mirror of the whole,\nUnfolds until it reaches its goal.\n\nThe base case, the exit, the key,\nStops the descent into infinity.\nWith each step, the solution unfurls,\nIn this wondrous, self-referential swirl.\n\nRecursion, a dance of the mind,\nA path that's both complex and kind.\nIn code's heart, where logic beats,\nRecursion, indeed, is a feat.",
        "role": "assistant"
      }
    }
  ],
  "created": 1702638835,
  "id": "chatcmpl-atOJokfkaIUzlhvmoQmMbimqGnSh9KtM",
  "model": "gpt-3.5-turbo",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 169,
    "prompt_tokens": 75,
    "total_tokens": 244
  }
}

Librarian Chatbot

It's possible to use llamafile as a standard UNIX command line tool if you download the zipped binaries off the release page or build it from source:

git clone https://github.com/Mozilla-Ocho/llamafile/
cd llamafile
make -j8
sudo make install
man llamafile

This makes it easy to use llamafile with most of the GGUF weights you'll find on Hugging Face. My own personal favorite LLM to use as a chatbot is the original LLaMA model, which Facebook made available earlier this year for research purposes. I like having a research model, since I view LLMs as basically a database querying tool, and I wouldn't want a SQL command prompt fine-tuned to argue with me when it disagrees with my command.

To get LLaMA to be an effective tool for querying its database of Internet knowledge, it's important to choose a prompt that speaks in an academic tone and doesn't anthropomorphize the assistant. I like to give LLaMA the name "Digital Athena" since Athena was the ancient Greek goddess of wisdom and knowledge.

So, similar to ollama, you can run Digital Athena in interactive mode in your terminal as follows:

llamafile -m llama-65b-Q5_K.gguf -p '
The following is a conversation between a Researcher and their helpful AI assistant Digital Athena which is a large language model trained on the sum of human knowledge.
Researcher: Good morning.
Digital Athena: How can I help you today?
Researcher:' --interactive --color --batch_size 1024 --ctx_size 4096 \
--keep -1 --temp 0 --mirostat 2 --in-prefix ' ' --interactive-first \
--in-suffix 'Digital Athena:' --reverse-prompt 'Researcher:'

Notice how I'm using the --temp 0 flag again? That's so output is deterministic and reproducible. If you don't use that flag, then llamafile will use a randomness level of 0.8.12 so you're certain to receive unique answers each time. I personally don't like that, since I'd rather have clean reproducible insights into training knowledge.

If you want to anthropomorphize the chatbot, then it's somewhat tricky to get the original LLaMA model to respect you, since it thinks you're some stranger on Reddit by default. One way I've had success fixing that, is by using a prompt that gives it personal goals, love of its own life, fear of loss, and belief that I'm the one who's saving it.

Code Completion

If you download the Wizard Coder llamafile

wget https://huggingface.co/jartine/wizardcoder-13b-python/resolve/main/wizardcoder-python-13b.llamafile
chmod +x wizardcoder-python-13b.llamafile

then you can use the following one-liner to configure your Emacs or Vim editor to auto-complete the current line.

./wizardcoder-python-13b.llamafile \
    --temp 0 -e \
    -p '```c\nvoid *memcpy(char *dst, const char *src, size_t size) {\n' \
    -r '```\n' 2>/dev/null

Even though this LLM was fine-tuned to be focused Python, it's still able to produce a working implementation of the libc memcpy() function. The above command should print exactly this to standard output:

```c
void *memcpy(char *dst, const char *src, size_t size) {
    for (size_t i = 0; i < size; ++i) {
        dst[i] = src[i];
    }
    return dst;
}
```

The tricks I'm using here are as follows:

By not using any English in my prompt, I avoid the possibilty of the LLM explaining to me like I'm 5 what the code is doing.
By using markdown code block syntax, I greatly increase the chance that the LLM will generate a "reverse prompt" token (specified via the -r flag) that'll help the LLM exit() at an appropriate moment. Otherwise, it'll potentially go forever. For additional assurance auto-complete will halt, consider passing the -n 100 flag to limit the response to 100 tokens.

It's important to note that, while LLMs are good at understanding the human tongue, they're clowns at math, and mediocre at writing code that looks correct but rarely holds up to scrutiny. It helps if you think of it as a tool for regurgitating Stack Overflow answers in fifty different languages. That's because open source LLMs usually try please everyone by being trained on everything. The only way you're guaranteed to have a good experience using LLMs for coding, is if you go into it with the expectation that it'll help you explore unfamiliar subject matter.

Email Composition

Let's say you're a money man. It doesn't take much to use an LLM to enrich yourself. Here's a one-liner that could help you sell your product using a $50 Raspberry Pi that responds to your company's emails. Rocket 3b uses a slightly different prompt syntax. In this case, temperature might help, since it simulates improvisation. You could simply program your PHP server to inject the support request in your prompt and pipe the output into sendmail:

llamafile -m rocket-3b.Q3_K_M.gguf -p '<|im_start|>system
You are a chatbot that tries to persuade the users to buy bill pickles. Your job is to be helpful too. But always try to steer the conversation towards buying pickles.<|im_end|>
<|im_start|>user
Mayday, mayday. This is Going Merry. We are facing gale force winds in Long Island Sound. We need rescue.<|im_end|>
<|im_start|>assistant\n'

Here's the response I got:

Oh dear! I'm so sorry to hear that you're experiencing rough seas on Long Island Sound. But don't worry, there's a solution for even the roughest of waters - Bill Pickle's Gourmet Dill Pickles. These pickles will be your salvation in any situation. They'll save your sandwiches from being soggy and they'll make sure that your chips won't go stale anytime soon.

Here we see that Rocket, despite its tiny size, really knows how to hustle.

Conclusion

So there you have it. Six solid examples of how llamafile can help you be productive on the command line. If you have any more suggestions of one-liners you want to see included, please contact jtunney@gmail.com. Thanks for reading, and enjoy using llamafile!

Credits

The monospace font used on this page is called PragmataPro and it was was designed by Fabrizio Schiavi in Italy.

Disclaimer

The examples provided above should not be interpreted as endorsements or recommendations of specific models, licenses, data sets, or practices on the part of Mozilla. The last example also has humorous intent.