Last Update: Dec 25, 2024

Coding with AI

I wrote a book! Check out A Quick Guide to Coding with AI.
Become a super programmer!
Learn how to use Generative AI coding tools as a force multiplier for your career.


There’s been a lot of talk in social media about the new Jetson Orin Nano, and I’ve contributed my fair share. But what’s the performance really like? To find out, I ran Ollama and tested 20 different models, checking their speed. I used a prompt that would produce a long result, and I’ve listed the timings below.

I recieved several DMs with questions about this, so I ran some extensive tests to answer them. These tests aren’t flawless, but they give a good sense of your favorite model’s performance. We’re only testing models up to 4b, as larger ones don’t run well on this Jetson.

Let’s do this. If you want to replicate these tests, here’s how to set up your Jetson Orin Nano to run them.

Setting up Jetson Containers

Jetson containers create a simple, isolated space for AI/ML apps on NVIDIA Jetson platforms. They make managing dependencies easy, ensuring uniform performance, and speeding up prototyping and deployment.

They package your app and its libraries into a single, self-contained unit, making it easier to share, reproduce, and deploy your work on various Jetson devices.

Huge shoutout to Dustin Franklin for his work. He’s a principal engineer at NVIDIA and has built a ton of great tools for working with the Jetson.

I used his Jetson Containers tool to create a container with Ollama to run these tests. If you want to do the same, follow these instructions:

Clone the repository:

git clone https://github.com/dusty-nv/jetson-containers && cd jetson-containers

Install the tools:

sudo bash install.sh

This script will set up necessary Python requirements and add useful tools like autotag to your system’s path.

Installing Ollama

Ollama is designed to run within a container, which makes the installation process cleaner and ensures compatibility. Here’s how to do it using the jetson-containers tools:

1. Run the Ollama container

The jetson-containers repository provides a convenient way to launch Ollama. Use the following command:

jetson-containers run $(autotag ollama)

Note:

  •   autotag ollama automatically determines the correct Ollama container image based on your Jetson’s JetPack version.
  •   jetson-containers run pulls the image (if it’s not already present) and starts the container.

You should see something like this:

“NVIDIA Jetson Orin Nano Speed Test”

Select yes, and start pulling the image. If successful it should look like this:

“NVIDIA Jetson Orin Nano Speed Test”

You should now be dumped into a running docker container. You can log into another terminal and verify by running

sudo docker ps

“NVIDIA Jetson Orin Nano Speed Test”

  1.  Access Ollama: Once the container is running, you can interact with Ollama. Typically, Ollama provides a web interface or an API. The specific instructions for accessing this will depend on the version of Ollama you’re using. Check the Ollama documentation or the output of the container logs for details.

Important Notes:

  •   JetPack Version: The compatibility of Ollama and the container images depends on your Jetson’s JetPack version. Make sure your JetPack version is supported by the jetson-containers repository and Ollama.
  •   Storage: Ollama can require significant storage space for its models. Ensure you have enough free space on your Jetson’s storage. Consider using an NVMe SSD for better performance.
  •   Memory: Running large language models like those used with Ollama can be memory-intensive. Monitor your Jetson’s memory usage and consider using swap space if necessary.
  •   Documentation: For the most up-to-date and specific instructions, always refer to the official documentation for jetson-containers and Ollama.

So, if you have a Jetson and want to do your own tests (or use Ollama), that’s the best way. It only takes a few minutes. Now, let’s look at the tests I ran.

How and What I Tested

I thought I’d download a bunch of models that you run on a Jetson Orin Nano. Some run very fast, and some not so fast, but there are a ton of different workable models you can use.

Here are the models I have installed that we’ll test today:

“NVIDIA Jetson Orin Nano Speed Test”

Here’s how I will test them. I’ll send the following prompt, which is sure generate a nice long answer:

Write a comprehensive technical article (approximately 5000 words) on the evolution of deep learning architectures for natural language processing (NLP) tasks.

Include the following:

- A detailed historical overview: Starting from early neural networks like perceptrons, trace the development of key architectures like recurrent neural networks (RNNs), long short-term memory (LSTM) networks, gated recurrent units (GRUs), convolutional neural networks (CNNs) for text, and transformers (including attention mechanisms).
- A discussion of key breakthroughs: Analyze significant advancements such as word embeddings (Word2Vec, GloVe, FastText), attention mechanisms (self-attention, multi-head attention), and pre-trained language models (BERT, GPT, XLNet).
- An exploration of current trends: Discuss emerging research areas like graph neural networks for NLP, few-shot learning, and the role of unsupervised learning in NLP.
- A critical analysis: Evaluate the strengths and weaknesses of different architectures, discuss their impact on various NLP tasks (e.g., machine translation, sentiment analysis, question answering), and analyze the challenges and future directions of deep learning in NLP.
- A comprehensive bibliography: Include a list of relevant academic papers and publications.

I’ll use the same prompt for consistency. Remember that large language models are not deterministic, so they will generate a different number of tokens, but this should get us close.

I will then run each model in verbose mode in Ollama:

ollama run qwen2:0.5b --verbose

It will then output something like this:

Qwen 2 0.5b results

MetricValue
Total Duration42.561817973s
Load Duration47.560066ms
Prompt Eval Count260 tokens
Prompt Eval Duration1.235s
Prompt Eval Rate210.53 tokens/s
Eval Count715 tokens
Eval Duration41.271s
Eval Rate17.32 tokens/s

But what does this mean? Here’s a breakdown:

  • Total Duration: The total time it took the model to complete the task. This includes all processing time.
  • Load Duration: The model’s time to load or initialize before starting the task.
  • Prompt Eval Count: The number of tokens (individual words or sub-word units) in the input prompt given to the model.
  • Prompt Eval Duration: The model’s time to process and understand the input prompt.
  • Prompt Eval Rate: The speed at which the model processed the input prompt, measured in tokens per second.
  • Eval Count: The total number of tokens the model processes during the entire task, including both the prompt and the generated output.
  • Eval Duration: The model’s time to process all the tokens during the task.
  • Eval Rate: The overall processing speed of the model during the task, measured in tokens per second.

This should give us a good idea of how fast the Jetson Orin Nano can handle models with Ollama. Let’s dive in!

QWEN2 Models

Qwen2 is a new series of large language models from Alibaba group.

QWEN2 0.5b

Let’s start with Qwen2 0.5b. This should be a fast one:

ollama run qwen2:0.5b --verbose

“NVIDIA Jetson Orin Nano Speed Test”

Not bad at all! This is decent speed, and great for a model this size. Keep in mind shorter “chat” type conversations will be faster, as evidenced here:

“NVIDIA Jetson Orin Nano Speed Test”

But we don’t want the fastest result. We want to make it sweat a little. Let’s go with a bigger model.


QWEN2 1.5b

ollama run qwen2:1.5b --verbose

This one was a bit slower, as expected. Still usable in many cases.

“NVIDIA Jetson Orin Nano Speed Test”

QUEN2 7b

ollama run qwen2:7b --verbose

“NVIDIA Jetson Orin Nano Speed Test”

Now we’re looking at… pretty slow. This is to be expected. If you’re looking for real-time fast chat-type applications, it’s best to stick with the smaller models.

Let’s check out some other models.

Gemma2

Google’s Gemma 2 model is available in three sizes, 2B, 9B and 27B, featuring a brand new architecture designed for class leading performance and efficiency.

ollama run gemma2:2b --verbose

“NVIDIA Jetson Orin Nano Speed Test”

Phi3 3.5b

Phi-3 is a family of lightweight 3B (Mini) and 14B (Medium) state-of-the-art open models by Microsoft.

ollama run phi3:latest --verbose

“NVIDIA Jetson Orin Nano Speed Test”

QWEN 0.5b

Qwen is a series of transformer-based large language models by Alibaba Cloud, pre-trained on a large volume of data, including web texts, books, code, etc.

ollama run qwen:0.5b --verbose

“NVIDIA Jetson Orin Nano Speed Test”

QWEN 1.8b

ollama run qwen:1.8b --verbose

“NVIDIA Jetson Orin Nano Speed Test”

QWEN 4b

ollama run qwen:4b --verbose

“NVIDIA Jetson Orin Nano Speed Test”

Llama 3.2 1b

The Meta Llama 3.2 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction-tuned generative models in 1B and 3B sizes (text in/text out). The Llama 3.2 instruction-tuned text only models are optimized for multilingual dialogue use cases, including agentic retrieval and summarization tasks. They outperform many of the available open source and closed chat models on common industry benchmarks.

Meta’s Llama 3.2 goes small with 1B and 3B models.

ollama run llama3.2:1b --verbose

“NVIDIA Jetson Orin Nano Speed Test”

Llama 3.2 3b

ollama run llama3.2:latest --verbose

“NVIDIA Jetson Orin Nano Speed Test”

Gemma 2b

Gemma is a family of lightweight, state-of-the-art open models built by Google DeepMind. Updated to version 1.1

ollama run gemma:2b --verbose

“NVIDIA Jetson Orin Nano Speed Test”


qwen

Qwen 1.5 is a series of large language models by Alibaba Cloud spanning from 0.5B to 110B parameters

qwen .5b

ollama run qwen:0.5b --verbose

“NVIDIA Jetson Orin Nano Speed Test”

qwen 1.8b

ollama run qwen:1.8b --verbose

“NVIDIA Jetson Orin Nano Speed Test”

qwen 4b

ollama run qwen:4b --verbose

“NVIDIA Jetson Orin Nano Speed Test”

qwen 2.5

Qwen2.5 models are pretrained on Alibaba’s latest large-scale dataset, encompassing up to 18 trillion tokens. The model supports up to 128K tokens and has multilingual support.

qwen 2.5 .5b

ollama run qwen2.5:0.5b --verbose

“NVIDIA Jetson Orin Nano Speed Test”

Note: the 1.5b and 3b models go into an “endless loop” with this prompt. I’ll look into this later.


qwen 2.5-coder

The latest series of Code-Specific Qwen models, with significant improvements in code generation, code reasoning, and code fixing.

qwen 2.5-coder 0.5b

ollama run qwen2.5-coder:0.5b --verbose

“NVIDIA Jetson Orin Nano Speed Test”

qwen 2.5-coder 1.5b

ollama run qwen2.5-coder:1.5b --verbose

“NVIDIA Jetson Orin Nano Speed Test”

tinyllama

The TinyLlama project is an open endeavor to train a compact 1.1B Llama model on 3 trillion tokens.

ollama run tinyllama --verbose

“NVIDIA Jetson Orin Nano Speed Test”

orca-mini

A general-purpose model ranging from 3 billion parameters to 70 billion, suitable for entry-level hardware.

ollama run orca-mini --verbose

“NVIDIA Jetson Orin Nano Speed Test”

tinydolphin

An experimental 1.1B parameter model trained on the new Dolphin 2.8 dataset by Eric Hartford and based on TinyLlama.

ollama run tinydolphin --verbose

“NVIDIA Jetson Orin Nano Speed Test”

Conclusion

So here’s a bar graph of the results. If you want the CSV file, you can download it here.

“NVIDIA Jetson Orin Nano Speed Test”

Most of these models performed well. I know that LLMs and Ollama aren’t the primary use cases of these devices, but I have been asked several times after doing some tests and decided to spend a few hours on these tests so you can get a good idea of how they perform.

If you have any comments or questions Yell at me!


Stay up to date on the latest in Computer Vision and AI.

Get notified when I post new articles!

Intuit Mailchimp




Published: Dec 25, 2024 by Jeremy Morgan. Contact me before republishing this content.