2024 llama.cpp 7840u: Difference between revisions
No edit summary |
No edit summary |
||
(5 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
I briefly had a Macbook M3 Max with 64GB. It was pretty good at running local LLMs, but couldn't stand the ergonomics and not being able to run Linux, so returned it. | I briefly had a Macbook M3 Max with 64GB. It was pretty good at running local LLMs, but couldn't stand the ergonomics and not being able to run Linux, so returned it. | ||
Line 9: | Line 10: | ||
<code>make -j16 LLAMA_HIPBLAS=1 LLAMA_HIP_UMA=1 AMDGPU_TARGETS=gfx1030 DLLAMA_HIP_UMA= | <code>make -j16 LLAMA_HIPBLAS=1 LLAMA_HIP_UMA=1 AMDGPU_TARGETS=gfx1030 DLLAMA_HIP_UMA= | ||
ON</code> | ON</code> | ||
run it like this: | run it like this: | ||
<code>./main -m /home/vid/jan/models/mistral-ins-7b-q4/mistral-7b-instruct-v0.2.Q4_K_M | <code>./main -m /home/vid/jan/models/mistral-ins-7b-q4/mistral-7b-instruct-v0.2.Q4_K_M | ||
.gguf -p "example code for a lit Web Component that reverses a string" -n 50 -e | .gguf -p "example code for a lit Web Component that reverses a string" -n 50 -e | ||
-ngl | -ngl 33 -n -1</code> | ||
ggml_cuda_init: found 1 ROCm devices: | ggml_cuda_init: found 1 ROCm devices: | ||
Device 0: AMD Radeon Graphics, compute capability 11.0, VMM: no | Device 0: AMD Radeon Graphics, compute capability 11.0, VMM: no | ||
llm_load_tensors: ggml ctx size = 0.22 MiB | llm_load_tensors: ggml ctx size = 0.22 MiB | ||
llm_load_tensors: offloading | llm_load_tensors: offloading 32 repeating layers to GPU | ||
llm_load_tensors: offloaded | llm_load_tensors: offloading non-repeating layers to GPU | ||
llm_load_tensors: ROCm0 buffer size = | llm_load_tensors: offloaded 33/33 layers to GPU | ||
llm_load_tensors: CPU buffer size = | llm_load_tensors: ROCm0 buffer size = 4095.05 MiB | ||
llm_load_tensors: CPU buffer size = 70.31 MiB | |||
.............................................................................................. | |||
llama_new_context_with_model: n_ctx = 512 | llama_new_context_with_model: n_ctx = 512 | ||
llama_new_context_with_model: n_batch = 512 | llama_new_context_with_model: n_batch = 512 | ||
Line 30: | Line 31: | ||
llama_new_context_with_model: freq_base = 1000000.0 | llama_new_context_with_model: freq_base = 1000000.0 | ||
llama_new_context_with_model: freq_scale = 1 | llama_new_context_with_model: freq_scale = 1 | ||
llama_kv_cache_init: ROCm0 KV buffer size = | llama_kv_cache_init: ROCm0 KV buffer size = 64.00 MiB | ||
llama_new_context_with_model: KV self size = 64.00 MiB, K (f16): 32.00 MiB, V (f16): 32.00 MiB | llama_new_context_with_model: KV self size = 64.00 MiB, K (f16): 32.00 MiB, V (f16): 32.00 MiB | ||
llama_new_context_with_model: ROCm_Host output buffer size = 0.12 MiB | llama_new_context_with_model: ROCm_Host output buffer size = 0.12 MiB | ||
llama_new_context_with_model: ROCm0 compute buffer size = | llama_new_context_with_model: ROCm0 compute buffer size = 81.00 MiB | ||
llama_new_context_with_model: ROCm_Host compute buffer size = 9.01 MiB | llama_new_context_with_model: ROCm_Host compute buffer size = 9.01 MiB | ||
llama_new_context_with_model: graph nodes = 1030 | llama_new_context_with_model: graph nodes = 1030 | ||
llama_new_context_with_model: graph splits = | llama_new_context_with_model: graph splits = 2 | ||
system_info: n_threads = 8 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | | system_info: n_threads = 8 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | | ||
Line 49: | Line 49: | ||
simple example code for a lit Web Component that reverses a string | simple example code for a lit Web Component that reverses a string | ||
```javascript | |||
import { LitElement, html } from 'lit'; | |||
import { customElement, property } from 'lit/decorators.js'; | |||
@customElement('reverse-string') | |||
class ReverseString extends LitElement { | |||
static styles = css` | static styles = css` | ||
:host { | |||
display: block; | |||
} | } | ||
`; | `; | ||
@property({ type: String }) input = ''; | |||
render() { | render() { | ||
return html` | return html` | ||
< | <input type="text" value=${this.input} @input=${this._handleInput} /> | ||
<button @click=${this._reverse}>Reverse</button> | |||
< p> | < p>${this._reversed}< /p> | ||
`; | `; | ||
} | } | ||
_handleInput(event) { | |||
this.input = event.target.value; | |||
} | } | ||
_reverse() { | |||
this. | this._reversed = this.input.split('').reverse().join(''); | ||
} | } | ||
@property private _reversed = ''; | |||
} | } | ||
``` | |||
```css | |||
:host { | |||
display: block; | |||
} | |||
``` | ``` | ||
This is a simple example of a Lit Web Component that reverses a string. The component has an input field for the user to enter a string, and a button to reverse the string when clicked. The reversed string is displayed below the button. | |||
The component uses the Lit library to define the custom element, and uses the `@customElement` decorator to define the element's name as 'reverse-string'. The `@property` decorator is used to define the input property, and the `static styles` property is used to define the component's styles. | |||
`` | In the `render` method, the input field and button are created using template literals, and the reversed string is displayed using a reactive property `_reversed`. | ||
llama_print_timings: load time = | The input field's value is updated in the `_handleInput` method when the user types in the field, and the string is reversed in the `_reverse` method when the button is clicked. The reversed string is then assigned to the `_reversed` property, which updates the displayed string. [end of text] | ||
llama_print_timings: sample time = | |||
llama_print_timings: prompt eval time = | llama_print_timings: load time = 2488.76 ms | ||
llama_print_timings: eval time = | llama_print_timings: sample time = 24.92 ms / 485 runs ( 0.05 ms per token, 19462.28 tokens per second) | ||
llama_print_timings: total time = | llama_print_timings: prompt eval time = 576.45 ms / 14 tokens ( 41.18 ms per token, 24.29 tokens per second) | ||
llama_print_timings: eval time = 38985.11 ms / 484 runs ( 80.55 ms per token, 12.41 tokens per second) | |||
llama_print_timings: total time = 39890.17 ms / 498 tokens | |||
Log end | Log end | ||
It's definitely not going to win any speed prizes even though is a smaller model, but it could be ok for non time sensitive results, or where using a tiny, faster model is useful. | It's definitely not going to win any speed prizes even though is a smaller model, but it could be ok for non time sensitive results, or where using a tiny, faster model is useful. | ||
{{Blikied|April 13, 2024}} | {{Blikied|April 13, 2024}} |
Latest revision as of 17:34, 14 April 2024
I briefly had a Macbook M3 Max with 64GB. It was pretty good at running local LLMs, but couldn't stand the ergonomics and not being able to run Linux, so returned it.
I picked up a Thinkpad P16s with an AMD 7840u to give Linux hardware a chance to catch up with Apple silicon. It's an amazing computer for the price, and can run LLMs. Here's how I set up llama.cpp to use ROCm.
Install ROCm, set an env variable for the 780m: export HSA_OVERRIDE_GFX_VERSION=11.0.0
clone llama.cpp and compile it:
make -j16 LLAMA_HIPBLAS=1 LLAMA_HIP_UMA=1 AMDGPU_TARGETS=gfx1030 DLLAMA_HIP_UMA=
ON
run it like this:
./main -m /home/vid/jan/models/mistral-ins-7b-q4/mistral-7b-instruct-v0.2.Q4_K_M
.gguf -p "example code for a lit Web Component that reverses a string" -n 50 -e
-ngl 33 -n -1
ggml_cuda_init: found 1 ROCm devices: Device 0: AMD Radeon Graphics, compute capability 11.0, VMM: no llm_load_tensors: ggml ctx size = 0.22 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: ROCm0 buffer size = 4095.05 MiB llm_load_tensors: CPU buffer size = 70.31 MiB .............................................................................................. llama_new_context_with_model: n_ctx = 512 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: ROCm0 KV buffer size = 64.00 MiB llama_new_context_with_model: KV self size = 64.00 MiB, K (f16): 32.00 MiB, V (f16): 32.00 MiB llama_new_context_with_model: ROCm_Host output buffer size = 0.12 MiB llama_new_context_with_model: ROCm0 compute buffer size = 81.00 MiB llama_new_context_with_model: ROCm_Host compute buffer size = 9.01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 2 system_info: n_threads = 8 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 sampling order: CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature generate: n_ctx = 512, n_batch = 2048, n_predict = -1, n_keep = 1 simple example code for a lit Web Component that reverses a string ```javascript import { LitElement, html } from 'lit'; import { customElement, property } from 'lit/decorators.js'; @customElement('reverse-string') class ReverseString extends LitElement { static styles = css` :host { display: block; } `; @property({ type: String }) input = ; render() { return html` <input type="text" value=${this.input} @input=${this._handleInput} /> <button @click=${this._reverse}>Reverse</button> < p>${this._reversed}< /p> `; } _handleInput(event) { this.input = event.target.value; } _reverse() { this._reversed = this.input.split().reverse().join(); } @property private _reversed = ; } ``` ```css :host { display: block; } ``` This is a simple example of a Lit Web Component that reverses a string. The component has an input field for the user to enter a string, and a button to reverse the string when clicked. The reversed string is displayed below the button. The component uses the Lit library to define the custom element, and uses the `@customElement` decorator to define the element's name as 'reverse-string'. The `@property` decorator is used to define the input property, and the `static styles` property is used to define the component's styles. In the `render` method, the input field and button are created using template literals, and the reversed string is displayed using a reactive property `_reversed`. The input field's value is updated in the `_handleInput` method when the user types in the field, and the string is reversed in the `_reverse` method when the button is clicked. The reversed string is then assigned to the `_reversed` property, which updates the displayed string. [end of text] llama_print_timings: load time = 2488.76 ms llama_print_timings: sample time = 24.92 ms / 485 runs ( 0.05 ms per token, 19462.28 tokens per second) llama_print_timings: prompt eval time = 576.45 ms / 14 tokens ( 41.18 ms per token, 24.29 tokens per second) llama_print_timings: eval time = 38985.11 ms / 484 runs ( 80.55 ms per token, 12.41 tokens per second) llama_print_timings: total time = 39890.17 ms / 498 tokens Log end
It's definitely not going to win any speed prizes even though is a smaller model, but it could be ok for non time sensitive results, or where using a tiny, faster model is useful.