2024 llama.cpp 7840u: Difference between revisions
No edit summary |
No edit summary |
||
(4 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
I briefly had a Macbook M3 Max with 64GB. It was pretty good at running local LLMs, but couldn't stand the ergonomics and not being able to run Linux, so returned it. | I briefly had a Macbook M3 Max with 64GB. It was pretty good at running local LLMs, but couldn't stand the ergonomics and not being able to run Linux, so returned it. | ||
Line 72: | Line 73: | ||
`; | `; | ||
} | } | ||
_handleInput(event) { | _handleInput(event) { | ||
this.input = event.target.value; | this.input = event.target.value; | ||
} | } | ||
_reverse() { | _reverse() { | ||
this._reversed = this.input.split('').reverse().join(''); | this._reversed = this.input.split('').reverse().join(''); | ||
} | } | ||
@property private _reversed = ''; | @property private _reversed = ''; | ||
} | } | ||
Line 107: | Line 108: | ||
It's definitely not going to win any speed prizes even though is a smaller model, but it could be ok for non time sensitive results, or where using a tiny, faster model is useful. | It's definitely not going to win any speed prizes even though is a smaller model, but it could be ok for non time sensitive results, or where using a tiny, faster model is useful. | ||
{{Blikied|April 13, 2024}} | {{Blikied|April 13, 2024}} |
Latest revision as of 17:34, 14 April 2024
I briefly had a Macbook M3 Max with 64GB. It was pretty good at running local LLMs, but couldn't stand the ergonomics and not being able to run Linux, so returned it.
I picked up a Thinkpad P16s with an AMD 7840u to give Linux hardware a chance to catch up with Apple silicon. It's an amazing computer for the price, and can run LLMs. Here's how I set up llama.cpp to use ROCm.
Install ROCm, set an env variable for the 780m: export HSA_OVERRIDE_GFX_VERSION=11.0.0
clone llama.cpp and compile it:
make -j16 LLAMA_HIPBLAS=1 LLAMA_HIP_UMA=1 AMDGPU_TARGETS=gfx1030 DLLAMA_HIP_UMA=
ON
run it like this:
./main -m /home/vid/jan/models/mistral-ins-7b-q4/mistral-7b-instruct-v0.2.Q4_K_M
.gguf -p "example code for a lit Web Component that reverses a string" -n 50 -e
-ngl 33 -n -1
ggml_cuda_init: found 1 ROCm devices: Device 0: AMD Radeon Graphics, compute capability 11.0, VMM: no llm_load_tensors: ggml ctx size = 0.22 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: ROCm0 buffer size = 4095.05 MiB llm_load_tensors: CPU buffer size = 70.31 MiB .............................................................................................. llama_new_context_with_model: n_ctx = 512 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: ROCm0 KV buffer size = 64.00 MiB llama_new_context_with_model: KV self size = 64.00 MiB, K (f16): 32.00 MiB, V (f16): 32.00 MiB llama_new_context_with_model: ROCm_Host output buffer size = 0.12 MiB llama_new_context_with_model: ROCm0 compute buffer size = 81.00 MiB llama_new_context_with_model: ROCm_Host compute buffer size = 9.01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 2 system_info: n_threads = 8 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 sampling order: CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature generate: n_ctx = 512, n_batch = 2048, n_predict = -1, n_keep = 1 simple example code for a lit Web Component that reverses a string ```javascript import { LitElement, html } from 'lit'; import { customElement, property } from 'lit/decorators.js'; @customElement('reverse-string') class ReverseString extends LitElement { static styles = css` :host { display: block; } `; @property({ type: String }) input = ; render() { return html` <input type="text" value=${this.input} @input=${this._handleInput} /> <button @click=${this._reverse}>Reverse</button> < p>${this._reversed}< /p> `; } _handleInput(event) { this.input = event.target.value; } _reverse() { this._reversed = this.input.split().reverse().join(); } @property private _reversed = ; } ``` ```css :host { display: block; } ``` This is a simple example of a Lit Web Component that reverses a string. The component has an input field for the user to enter a string, and a button to reverse the string when clicked. The reversed string is displayed below the button. The component uses the Lit library to define the custom element, and uses the `@customElement` decorator to define the element's name as 'reverse-string'. The `@property` decorator is used to define the input property, and the `static styles` property is used to define the component's styles. In the `render` method, the input field and button are created using template literals, and the reversed string is displayed using a reactive property `_reversed`. The input field's value is updated in the `_handleInput` method when the user types in the field, and the string is reversed in the `_reverse` method when the button is clicked. The reversed string is then assigned to the `_reversed` property, which updates the displayed string. [end of text] llama_print_timings: load time = 2488.76 ms llama_print_timings: sample time = 24.92 ms / 485 runs ( 0.05 ms per token, 19462.28 tokens per second) llama_print_timings: prompt eval time = 576.45 ms / 14 tokens ( 41.18 ms per token, 24.29 tokens per second) llama_print_timings: eval time = 38985.11 ms / 484 runs ( 80.55 ms per token, 12.41 tokens per second) llama_print_timings: total time = 39890.17 ms / 498 tokens Log end
It's definitely not going to win any speed prizes even though is a smaller model, but it could be ok for non time sensitive results, or where using a tiny, faster model is useful.