2024 llama.cpp 7840u: Difference between revisions

From zooid Wiki
Jump to navigation Jump to search
No edit summary
No edit summary
 
(6 intermediate revisions by the same user not shown)
Line 1: Line 1:
I briefly had a Macbook M3 Max with 64GB. It was pretty good at running local LLMs, but couldn't stand the ergonomics and not being able to run Linux, so returned it.
I briefly had a Macbook M3 Max with 64GB. It was pretty good at running local LLMs, but couldn't stand the ergonomics and not being able to run Linux, so returned it.


Line 9: Line 10:
<code>make -j16 LLAMA_HIPBLAS=1 LLAMA_HIP_UMA=1 AMDGPU_TARGETS=gfx1030 DLLAMA_HIP_UMA=
<code>make -j16 LLAMA_HIPBLAS=1 LLAMA_HIP_UMA=1 AMDGPU_TARGETS=gfx1030 DLLAMA_HIP_UMA=
ON</code>
ON</code>
 
run it like this:
run it like this:
 
<code>./main -m /home/vid/jan/models/mistral-ins-7b-q4/mistral-7b-instruct-v0.2.Q4_K_M
<code>./main -m /home/vid/jan/models/mistral-ins-7b-q4/mistral-7b-instruct-v0.2.Q4_K_M
.gguf -p "example code for a lit Web Component that reverses a string" -n 50 -e  
.gguf -p "example code for a lit Web Component that reverses a string" -n 50 -e  
-ngl 16 -n -1</code>
-ngl 33 -n -1</code>
 


  ggml_cuda_init: found 1 ROCm devices:
  ggml_cuda_init: found 1 ROCm devices:
   Device 0: AMD Radeon Graphics, compute capability 11.0, VMM: no
   Device 0: AMD Radeon Graphics, compute capability 11.0, VMM: no
  llm_load_tensors: ggml ctx size =    0.22 MiB
  llm_load_tensors: ggml ctx size =    0.22 MiB
  llm_load_tensors: offloading 24 repeating layers to GPU
  llm_load_tensors: offloading 32 repeating layers to GPU
  llm_load_tensors: offloaded 24/33 layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
  llm_load_tensors:      ROCm0 buffer size =  2978.91 MiB
  llm_load_tensors: offloaded 33/33 layers to GPU
  llm_load_tensors:        CPU buffer size = 4165.37 MiB
  llm_load_tensors:      ROCm0 buffer size =  4095.05 MiB
  ...............................................................................................
  llm_load_tensors:        CPU buffer size =   70.31 MiB
  ..............................................................................................
  llama_new_context_with_model: n_ctx      = 512
  llama_new_context_with_model: n_ctx      = 512
  llama_new_context_with_model: n_batch    = 512
  llama_new_context_with_model: n_batch    = 512
Line 30: Line 31:
  llama_new_context_with_model: freq_base  = 1000000.0
  llama_new_context_with_model: freq_base  = 1000000.0
  llama_new_context_with_model: freq_scale = 1
  llama_new_context_with_model: freq_scale = 1
  llama_kv_cache_init:      ROCm0 KV buffer size =    48.00 MiB
  llama_kv_cache_init:      ROCm0 KV buffer size =    64.00 MiB
llama_kv_cache_init:  ROCm_Host KV buffer size =    16.00 MiB
  llama_new_context_with_model: KV self size  =  64.00 MiB, K (f16):  32.00 MiB, V (f16):  32.00 MiB
  llama_new_context_with_model: KV self size  =  64.00 MiB, K (f16):  32.00 MiB, V (f16):  32.00 MiB
  llama_new_context_with_model:  ROCm_Host  output buffer size =    0.12 MiB
  llama_new_context_with_model:  ROCm_Host  output buffer size =    0.12 MiB
  llama_new_context_with_model:      ROCm0 compute buffer size =   173.04 MiB
  llama_new_context_with_model:      ROCm0 compute buffer size =   81.00 MiB
  llama_new_context_with_model:  ROCm_Host compute buffer size =    9.01 MiB
  llama_new_context_with_model:  ROCm_Host compute buffer size =    9.01 MiB
  llama_new_context_with_model: graph nodes  = 1030
  llama_new_context_with_model: graph nodes  = 1030
  llama_new_context_with_model: graph splits = 92
  llama_new_context_with_model: graph splits = 2
   
   
  system_info: n_threads = 8 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
  system_info: n_threads = 8 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
Line 49: Line 49:
   
   
   
   
   simple example code for a lit Web Component that reverses a string input
   simple example code for a lit Web Component that reverses a string
   
   
  Hi there! Here's a simple example of a Lit Web Component that reverses a string input:
  ```javascript
import { LitElement, html } from 'lit';
import { customElement, property } from 'lit/decorators.js';
   
   
  ```javascript
  @customElement('reverse-string')
  import { Component, html, css } from 'lit';
  class ReverseString extends LitElement {
   
   
class ReverseString extends Component {
   static styles = css`
   static styles = css`
     input {
     :host {
       padding: 10px;
       display: block;
      margin-bottom: 20px;
     }
     }
   `;
   `;
   
   
   static properties = {
   @property({ type: String }) input = '';
    value: { type: String }
  };
   
   
  constructor() {
    super();
    this.value = '';
  }
 
   render() {
   render() {
     return html`
     return html`
       <style>${this.constructor.styles}</style>
       <input type="text" value=${this.input} @input=${this._handleInput} />
      <input @input=${this._handleInputChange} value=${this.value} type="text">
      <button @click=${this._reverse}>Reverse</button>
       < p>Reversed string: ${this._reverseString(this.value)}< /p>
       < p>${this._reversed}< /p>
     `;
     `;
   }
   }
 
   _reverseString(str) {
   _handleInput(event) {
     return str.split('').reverse().join('');
     this.input = event.target.value;
   }
   }
   
   
   _handleInputChange(event) {
   _reverse() {
     this.value = event.target.value;
     this._reversed = this.input.split('').reverse().join('');
   }
   }
  @property private _reversed = '';
  }
  }
```
   
   
  customElements.define('reverse-string', ReverseString);
  ```css
:host {
  display: block;
}
  ```
  ```
   
   
  In this example, we define a custom Web Component called `reverse-string` that uses Lit for rendering and handling user input. The `ReverseString` class defines a render method that returns an HTML template with an input field and a paragraph that displays the reversed string. The component also defines a `_reverseString` method that reverses a given string using the `split`, `reverse`, and `join` array methods, and a `_handleInputChange` method that updates the component's value whenever the input changes. Finally, we use the `customElements.define` method to register our component with the browser.
  This is a simple example of a Lit Web Component that reverses a string. The component has an input field for the user to enter a string, and a button to reverse the string when clicked. The reversed string is displayed below the button.
   
   
  You can use this component in your HTML like this:
  The component uses the Lit library to define the custom element, and uses the `@customElement` decorator to define the element's name as 'reverse-string'. The `@property` decorator is used to define the input property, and the `static styles` property is used to define the component's styles.
   
   
  ```html
  In the `render` method, the input field and button are created using template literals, and the reversed string is displayed using a reactive property `_reversed`.
<reverse-string></reverse-string>
``` [end of text]
   
   
  llama_print_timings:        load time =    2865.16 ms
The input field's value is updated in the `_handleInput` method when the user types in the field, and the string is reversed in the `_reverse` method when the button is clicked. The reversed string is then assigned to the `_reversed` property, which updates the displayed string. [end of text]
  llama_print_timings:      sample time =      13.64 ms /  442 runs  (    0.03 ms per token, 32407.07 tokens per second)
 
  llama_print_timings: prompt eval time =   1281.98 ms /    14 tokens (  91.57 ms per token,    10.92 tokens per second)
  llama_print_timings:        load time =    2488.76 ms
  llama_print_timings:        eval time = 100829.12 ms /  441 runs  ( 228.64 ms per token,     4.37 tokens per second)
  llama_print_timings:      sample time =      24.92 ms /  485 runs  (    0.05 ms per token, 19462.28 tokens per second)
  llama_print_timings:      total time = 102397.06 ms /  455 tokens
  llama_print_timings: prompt eval time =     576.45 ms /    14 tokens (  41.18 ms per token,    24.29 tokens per second)
  llama_print_timings:        eval time =   38985.11 ms /  484 runs  (   80.55 ms per token,   12.41 tokens per second)
  llama_print_timings:      total time =   39890.17 ms /  498 tokens
  Log end
  Log end


It's definitely not going to win any speed prizes even though is a smaller model, but it could be ok for non time sensitive results, or where using a tiny, faster model is useful.
It's definitely not going to win any speed prizes even though is a smaller model, but it could be ok for non time sensitive results, or where using a tiny, faster model is useful.


{{Blikied|April 13 12, 2024}}
{{Blikied|April 13, 2024}}

Latest revision as of 17:34, 14 April 2024

I briefly had a Macbook M3 Max with 64GB. It was pretty good at running local LLMs, but couldn't stand the ergonomics and not being able to run Linux, so returned it.

I picked up a Thinkpad P16s with an AMD 7840u to give Linux hardware a chance to catch up with Apple silicon. It's an amazing computer for the price, and can run LLMs. Here's how I set up llama.cpp to use ROCm.

Install ROCm, set an env variable for the 780m: export HSA_OVERRIDE_GFX_VERSION=11.0.0

clone llama.cpp and compile it:

make -j16 LLAMA_HIPBLAS=1 LLAMA_HIP_UMA=1 AMDGPU_TARGETS=gfx1030 DLLAMA_HIP_UMA= ON

run it like this:

./main -m /home/vid/jan/models/mistral-ins-7b-q4/mistral-7b-instruct-v0.2.Q4_K_M .gguf -p "example code for a lit Web Component that reverses a string" -n 50 -e -ngl 33 -n -1

ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, compute capability 11.0, VMM: no
llm_load_tensors: ggml ctx size =    0.22 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      ROCm0 buffer size =  4095.05 MiB
llm_load_tensors:        CPU buffer size =    70.31 MiB
..............................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      ROCm0 KV buffer size =    64.00 MiB
llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
llama_new_context_with_model:  ROCm_Host  output buffer size =     0.12 MiB
llama_new_context_with_model:      ROCm0 compute buffer size =    81.00 MiB
llama_new_context_with_model:  ROCm_Host compute buffer size =     9.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2

system_info: n_threads = 8 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 512, n_batch = 2048, n_predict = -1, n_keep = 1


 simple example code for a lit Web Component that reverses a string

```javascript
import { LitElement, html } from 'lit';
import { customElement, property } from 'lit/decorators.js';

@customElement('reverse-string')
class ReverseString extends LitElement {

  static styles = css`
    :host {
      display: block;
    }
  `;

  @property({ type: String }) input = ;

  render() {
    return html`
      <input type="text" value=${this.input} @input=${this._handleInput} />
      <button @click=${this._reverse}>Reverse</button>
      < p>${this._reversed}< /p>
    `;
  }

  _handleInput(event) {
    this.input = event.target.value;
  }

  _reverse() {
    this._reversed = this.input.split().reverse().join();
  }

  @property private _reversed = ;
}
```

```css
:host {
  display: block;
}
```

This is a simple example of a Lit Web Component that reverses a string. The component has an input field for the user to enter a string, and a button to reverse the string when clicked. The reversed string is displayed below the button.

The component uses the Lit library to define the custom element, and uses the `@customElement` decorator to define the element's name as 'reverse-string'. The `@property` decorator is used to define the input property, and the `static styles` property is used to define the component's styles.

In the `render` method, the input field and button are created using template literals, and the reversed string is displayed using a reactive property `_reversed`.

The input field's value is updated in the `_handleInput` method when the user types in the field, and the string is reversed in the `_reverse` method when the button is clicked. The reversed string is then assigned to the `_reversed` property, which updates the displayed string. [end of text]
 
llama_print_timings:        load time =    2488.76 ms
llama_print_timings:      sample time =      24.92 ms /   485 runs   (    0.05 ms per token, 19462.28 tokens per second)
llama_print_timings: prompt eval time =     576.45 ms /    14 tokens (   41.18 ms per token,    24.29 tokens per second)
llama_print_timings:        eval time =   38985.11 ms /   484 runs   (   80.55 ms per token,    12.41 tokens per second)
llama_print_timings:       total time =   39890.17 ms /   498 tokens
Log end


It's definitely not going to win any speed prizes even though is a smaller model, but it could be ok for non time sensitive results, or where using a tiny, faster model is useful.



RSS

Blikied on April 13, 2024