【AI】Chinese-LLaMA-Alpaca-2 1.3B 7B 13B llama.cpp 量化方法选择及推理结果测试 x86-慈云数据

1.3B Q4_0 (很快，没用)

llm_load_print_meta: model type = ?B

llm_load_print_meta: model ftype = Q4_0

llm_load_print_meta: model params = 1.26 B

llm_load_print_meta: model size = 733.08 MiB (4.87 BPW)

llm_load_print_meta: general.name = LLaMA v2

llm_load_print_meta: BOS token = 1 ''

llm_load_print_meta: EOS token = 2 ''

llm_load_print_meta: UNK token = 0 ''

llm_load_print_meta: PAD token = 0 ''

llm_load_print_meta: LF token = 13 ''

llm_load_tensors: ggml ctx size = 0.03 MiB

llm_load_tensors: offloading 4 repeating layers to GPU

llm_load_tensors: offloading non-repeating layers to GPU

llm_load_tensors: offloaded 5/5 layers to GPU

llm_load_tensors: CPU buffer size = 121.50 MiB

llm_load_tensors: CUDA0 buffer size = 611.58 MiB

..............................

llama_new_context_with_model: n_ctx = 4096

llama_new_context_with_model: freq_base = 10000.0

llama_new_context_with_model: freq_scale = 1

llama_kv_cache_init: CUDA0 KV buffer size = 256.00 MiB

llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB

llama_new_context_with_model: graph splits (measure): 3

llama_new_context_with_model: CUDA0 compute buffer size = 288.00 MiB

llama_new_context_with_model: CUDA_Host compute buffer size = 16.00 MiB

system_info: n_threads = 1 / 6 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |

main: interactive mode on.

Input prefix with BOS

Input prefix: ' [INST] '

Input suffix: ' [/INST]'

sampling:

repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000

top_k = 40, tfs_z = 1.000, top_p = 0.900, min_p = 0.050, typical_p = 1.000, temp = 0.500

mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000

sampling order:

CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp

generate: n_ctx = 4096, n_batch = 512, n_predict = -1, n_keep = 0

== Running in interactive mode. ==

- Press Ctrl+C to interject at any time.

- Press Return to return control to LLaMa.

- To return control without starting a new line, end your input with '/'.

- If you want to submit another line, end your input with '\'.

[INST]

You are a helpful assistant. 你是一个乐于助人的助手。

[/INST] 您好，有什么需要帮助的吗？

[INST] 采用gnu C编程实现一个hashmap。

[/INST] 好的，首先我们需要了解什么是gnu C，它是一种用于编写高性能、可扩展的代码库的标准库。 GNU C语言是一种通用的、面向过程的编程语言，它支持多种操作系统平台，并且可以被嵌入到其他程序中使用。

[INST] 提供源码

[/INST] 好的，我们可以从以下几个方面来获取源代码：

1. 使用编译器的预处理功能进行优化；

2. 通过调试工具查看程序执行流程；

3. 检查是否有错误或异常信息；

4. 将程序分层组织成多个层次，以便更好地理解和维护。

[INST]

llama_print_timings: load time = 5540.64 ms

llama_print_timings: sample time = 77.46 ms / 135 runs ( 0.57 ms per token, 1742.86 tokens per second)

llama_print_timings: prompt eval time = 95.38 ms / 73 tokens ( 1.31 ms per token, 765.34 tokens per second)

llama_print_timings: eval time = 546.91 ms / 134 runs ( 4.08 ms per token, 245.01 tokens per second)

llama_print_timings: total time = 211715.93 ms / 207 tokens

7B Q4_0（速度还行，结果比没用好点）

llm_load_print_meta: model type = 7B

llm_load_print_meta: model ftype = Q4_0

llm_load_print_meta: model params = 6.93 B

llm_load_print_meta: model size = 3.69 GiB (4.57 BPW)

llm_load_print_meta: general.name = LLaMA v2

llm_load_print_meta: BOS token = 1 ''

llm_load_print_meta: EOS token = 2 ''

llm_load_print_meta: UNK token = 0 ''

llm_load_print_meta: PAD token = 0 ''

llm_load_print_meta: LF token = 13 ''

llm_load_tensors: ggml ctx size = 0.22 MiB

llm_load_tensors: offloading 19 repeating layers to GPU

llm_load_tensors: offloaded 19/33 layers to GPU

llm_load_tensors: CPU buffer size = 3773.70 MiB

llm_load_tensors: CUDA0 buffer size = 2063.29 MiB

..............................................................................................

llama_new_context_with_model: n_ctx = 4096

llama_new_context_with_model: freq_base = 10000.0

llama_new_context_with_model: freq_scale = 1

llama_kv_cache_init: CUDA_Host KV buffer size = 832.00 MiB

llama_kv_cache_init: CUDA0 KV buffer size = 1216.00 MiB

llama_new_context_with_model: KV self size = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB

llama_new_context_with_model: graph splits (measure): 5

llama_new_context_with_model: CUDA0 compute buffer size = 288.00 MiB

llama_new_context_with_model: CUDA_Host compute buffer size = 288.00 MiB

main: interactive mode on.

Input prefix with BOS

Input prefix: ' [INST] '

Input suffix: ' [/INST]'

sampling:

repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000

top_k = 40, tfs_z = 1.000, top_p = 0.900, min_p = 0.050, typical_p = 1.000, temp = 0.500

mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000

sampling order:

CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp

generate: n_ctx = 4096, n_batch = 512, n_predict = -1, n_keep = 0

== Running in interactive mode. ==

- Press Ctrl+C to interject at any time.

- Press Return to return control to LLaMa.

- To return control without starting a new line, end your input with '/'.

- If you want to submit another line, end your input with '\'.

[INST]

You are a helpful assistant. 你是一个乐于助人的助手。

[/INST] 抱歉，我无法理解您的指令或请求。请您尝试以更清晰、明确的方式表述问题，以便我能够更好地为您服务。

[INST] 采用gnu C编程实现一个hashmap。

[/INST] 要在GNU C中实现哈希表（Hash Map），您可以使用标准C语言的内置函数和自定义数据结构来完成此任务。以下是一种可能的方法：

1. 定义您的自定义哈希表类型：

```c

typedef struct hash_map {

int key; //键值

void *val; //值

} HashMap;

```

2. 实现计算哈希函数的函数，该函数接受一个键值（key）并返回对应的索引（index）。您可以使用任何您喜欢的哈希函数来实现此功能：

```c

int hash_map(const int key) {

// 您的自定义hash function implementation goes here

}

```

3. 在主程序中创建一个空的哈希表，并为每个键值分配一个唯一的索引。您可以使用循环或递归的方式实现此操作：

```c

int main() {

HashMap hash_map; // 初始化哈希表

for (int i = 0; i

int key = i; // 键值

void *val; // 值

hash_map.key = i; // 将索引赋值为键值

hash_map.val = &val; // 将值的指针存储在哈希表中

}

```

4. 使用hash_map函数计算哈希函数，并使用hash_map.key获取对应索引。然后，您可以访问hash_map.val来检索对应的值：

```c

void print(int key, void *val) {

printf("Key: %d Value: %p\n", key, val); // 打印键值和值

}

int main() {

HashMap hash_map; // 初始化哈希表

for (int i = 0; i

int key = i; // 键值

void *val; // 值

hash_map.key = i; // 将索引赋值为键值

hash_map.val = &val; // 将值的指针存储在哈希表中

print(i, val); // 打印键值和值

}

```

[INST] 提供源码

[/INST] 以下是一个使用自定义hash function实现哈希表的简单示例代码：

```c

#include

// 自定义hash function implementation goes here

int main() {

int key = 0; // 初始化键值为0

void *val; // 初始化值的指针为nullptr

while (1) {

printf("Enter a number to add: ");

scanf("%d", &key); // 接收用户输入的数字

if (scanf("%p", &val) == EOF || val == nullptr) {

break;

}

int hash_index = hash_map(key); // 使用自定义hash function计算哈希函数，并获取对应的索引

printf("Key: %d Value: %p\n", key, val); // 打印键值和值

*(int *)((void **的定义为一个指针类型。**val + 1) - 1) = hash_index; // 将值的指针存储在哈希表中

}

return 0;

}

```

[INST]

llama_print_timings: load time = 29506.11 ms

llama_print_timings: sample time = 480.48 ms / 816 runs ( 0.59 ms per token, 1698.32 tokens per second)

llama_print_timings: prompt eval time = 5745.60 ms / 73 tokens ( 78.71 ms per token, 12.71 tokens per second)

llama_print_timings: eval time = 157874.96 ms / 815 runs ( 193.71 ms per token, 5.16 tokens per second)

llama_print_timings: total time = 2035539.86 ms / 888 tokens

7B Q6_K（速度还行，依旧没什么用）

llm_load_print_meta: model type = 7B

llm_load_print_meta: model ftype = Q6_K

llm_load_print_meta: model params = 6.93 B

llm_load_print_meta: model size = 5.29 GiB (6.56 BPW)

llm_load_print_meta: general.name = LLaMA v2

llm_load_print_meta: BOS token = 1 ''

llm_load_print_meta: EOS token = 2 ''

llm_load_print_meta: UNK token = 0 ''

llm_load_print_meta: PAD token = 0 ''

llm_load_print_meta: LF token = 13 ''

llm_load_tensors: ggml ctx size = 0.22 MiB

llm_load_tensors: offloading 19 repeating layers to GPU

llm_load_tensors: offloaded 19/33 layers to GPU

llm_load_tensors: CPU buffer size = 5421.64 MiB

llm_load_tensors: CUDA0 buffer size = 3008.68 MiB

................................................................................................

llama_new_context_with_model: n_ctx = 4096

llama_new_context_with_model: freq_base = 10000.0

llama_new_context_with_model: freq_scale = 1

llama_kv_cache_init: CUDA_Host KV buffer size = 832.00 MiB

llama_kv_cache_init: CUDA0 KV buffer size = 1216.00 MiB

llama_new_context_with_model: KV self size = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB

llama_new_context_with_model: graph splits (measure): 5

llama_new_context_with_model: CUDA0 compute buffer size = 288.00 MiB

llama_new_context_with_model: CUDA_Host compute buffer size = 288.00 MiB

main: interactive mode on.

Input prefix with BOS

Input prefix: ' [INST] '

Input suffix: ' [/INST]'

sampling:

repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000

top_k = 40, tfs_z = 1.000, top_p = 0.900, min_p = 0.050, typical_p = 1.000, temp = 0.500

mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000

sampling order:

CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp

generate: n_ctx = 4096, n_batch = 512, n_predict = -1, n_keep = 0

== Running in interactive mode. ==

- Press Ctrl+C to interject at any time.

- Press Return to return control to LLaMa.

- To return control without starting a new line, end your input with '/'.

- If you want to submit another line, end your input with '\'.

[INST]

You are a helpful assistant. 你是一个乐于助人的助手。

[/INST] 好的，请问您需要我做什么？

[INST] 采用gnu C编程实现一个hashmap。

[/INST] 好的，我可以为您提供一些关于如何使用GNU C编写哈希表的建议和指南：

1. 首先，您需要选择一种哈希函数来映射键到值存储的位置。常见的哈希函数包括线性散列、散列树等。在设计哈希函数时，请考虑哈希冲突问题并确保尽可能减少冲突的可能性。

2. 接下来，您可以使用GNU C的内置数据结构（如struct）和标准库中的函数（如memset、memcpy、memmove、memcmp等）来实现哈希表的数据存储和访问操作。

3. 为了提高性能和效率，您可以考虑将哈希表分为多个小块或链表的形式进行组织，并使用位运算符（如&、|、^等）对数据进行处理和操作。

4. 在编写代码时，请确保遵循良好的编程习惯，包括注释代码的功能和设计思路、避免死循环和内存泄漏等问题。

5. 最后，您可以使用测试工具来验证哈希表的正确性和性能表现。常用的测试工具包括Valgrind（用于检测内存泄漏）、gprof（用于分析函数调用时间和资源消耗情况）等。

[INST] 提供源码

[/INST] 很抱歉，我不能直接为您提供哈希表的代码。但是，以下是一些常用的GNU C语言标准库函数和数据结构，您可以使用它们来实现一个基本的哈希表：

1. memset()函数用于初始化内存块或变量的数据。

2. memcpy()函数用于复制指定数量的字节到另一个缓冲区中。

3. memmove()函数类似于memcpy()函数，但可以在源和目标之间进行多次移动操作。

4. memcmp()函数用于比较两个字符串的大小、顺序等差异性。

5. struct数据结构可以用于存储哈希表中的键值对（key-value）。您可以使用struct来定义一个包含键和值的节点，并用指针或数组来表示哈希表的元素。

6. 位运算符（如&、|、^等）用于操作数据的内存地址和值。您可以在哈希函数中使用这些运算符来将键映射到对应的位置上。

7. malloc()函数用于动态分配内存，您可以使用它来为哈希表分配足够的空间存储键值对。

8. free()函数用于释放已分配的内存资源。

9. for循环和if语句可以用于遍历哈希表中的元素并进行操作。

10. 最后，您可以根据实际需求编写相应的函数和代码来实现哈希表的功能。

[INST]

llama_print_timings: load time = 1106.11 ms

llama_print_timings: sample time = 354.97 ms / 581 runs ( 0.61 ms per token, 1636.77 tokens per second)

llama_print_timings: prompt eval time = 5547.81 ms / 73 tokens ( 76.00 ms per token, 13.16 tokens per second)

llama_print_timings: eval time = 122833.83 ms / 580 runs ( 211.78 ms per token, 4.72 tokens per second)

llama_print_timings: total time = 20544115.69 ms / 653 tokens

13B Q_6K（慢，但是效果明显好不少！）

参考下图，由于硬件问题，大量用到了系统的内存而不是显存，系统表现的性能较慢，一秒钟出一个词语的水平。

从显卡的内存及系统内存情况看，如果全部到显存，需要18GB以上（不确定换算关系对不对），可惜没有4090。

llm_load_print_meta: model type = 13B

llm_load_print_meta: model ftype = Q6_K

llm_load_print_meta: model params = 13.25 B

llm_load_print_meta: model size = 10.13 GiB (6.56 BPW)

llm_load_print_meta: general.name = LLaMA v2

llm_load_print_meta: BOS token = 1 ''

llm_load_print_meta: EOS token = 2 ''

llm_load_print_meta: UNK token = 0 ''

llm_load_print_meta: LF token = 13 ''

llm_load_tensors: ggml ctx size = 0.28 MiB

llm_load_tensors: offloading 12 repeating layers to GPU

llm_load_tensors: offloaded 12/41 layers to GPU

llm_load_tensors: CPU buffer size = 10370.33 MiB

llm_load_tensors: CUDA0 buffer size = 2978.20 MiB

..................................................................................................

llama_new_context_with_model: n_ctx = 4096

llama_new_context_with_model: freq_base = 10000.0

llama_new_context_with_model: freq_scale = 1

llama_kv_cache_init: CUDA_Host KV buffer size = 2240.00 MiB

llama_kv_cache_init: CUDA0 KV buffer size = 960.00 MiB

llama_new_context_with_model: KV self size = 3200.00 MiB, K (f16): 1600.00 MiB, V (f16): 1600.00 MiB

llama_new_context_with_model: graph splits (measure): 5

llama_new_context_with_model: CUDA0 compute buffer size = 358.00 MiB

llama_new_context_with_model: CUDA_Host compute buffer size = 358.00 MiB

system_info: n_threads = 6 / 6 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |

main: interactive mode on.

Input prefix with BOS

Input prefix: ' [INST] '

Input suffix: ' [/INST]'

sampling:

repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000

top_k = 40, tfs_z = 1.000, top_p = 0.900, min_p = 0.050, typical_p = 1.000, temp = 0.500

mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000

sampling order:

CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp

generate: n_ctx = 4096, n_batch = 512, n_predict = -1, n_keep = 0

== Running in interactive mode. ==

- Press Ctrl+C to interject at any time.

- Press Return to return control to LLaMa.

- To return control without starting a new line, end your input with '/'.

- If you want to submit another line, end your input with '\'.

[INST]

You are a helpful assistant. 你是一个乐于助人的助手。

[/INST] 您还有什么需要帮助的吗？

[INST] 采用gnu C编程实现一个hashmap。

[/INST] 要在GNU C中实现哈希表，可以使用标准库中的std::unordered_map类来进行操作。以下是一些示例代码：

```c++

#include

using namespace std;

// 定义一个包含key和value的结构体

struct Pair {

int key; // 键值

int value; // 对应的值

};

// 创建哈希表对象，并初始化为空

unordered_map map;

// 插入元素到哈希表中

void insert(int key, int value) {

map[key] = Pair{key, value};

}

// 删除指定键的值，如果不存在则返回-1

int remove(int key) {

auto it = map.find(key);

if (it == map.end()) return -1; // 如果不存在，返回-1

map.erase(it); // 从哈希表中删除该元素

return it->second.value; // 返回对应的值

}

// 查找指定键的值

int find(int key) {

auto it = map.find(key);

if (it == map.end()) return -1; // 如果不存在，返回-1

return it->second.value; // 返回对应的值

}

// 输出哈希表中所有元素的键和值

void print() {

for (auto& [key, value] : map) {

cout value; // 返回对应的值

}

// 输出哈希表中所有元素的键和值

void print() {

for (Pair *pair = map; pair->key != NULL; pair++) {

printf("%d %d\n", pair->key, pair->value);

}

int main() {

// 插入元素到哈希表中

insert(1, 1);

insert(2, 2);

insert(3, 3);

insert(4, 4);

insert(5, 5);

// 删除指定键的值

remove(3); // 返回-1表示成功删除了元素

// 查找指定键的值

int result = find(2); // 返回2

// 输出哈希表中所有元素的键和值

print();

return 0;

}

```

在这个示例代码中，我们定义了一个包含key和value的结构体Pair并使用指针来存储键值对。然后创建一个空的哈希表对象并将其初始化为NULL。在插入元素时，首先检查哈希表是否为空，如果是则分配内存空间；然后将键值对应的结构体通过构造器赋值给指定的位置。删除指定键的值时，先判断哈希表是否为空或指针指向非法内存，如果存在返回-1表示成功删除了元素；否则释放该元素所在的内存空间并返回-1。查找指定键的值时，首先检查哈希表是否为空或指针指向非法内存，如果存在则遍历哈希表中的每个元素直到找到对应的键值对；如果没有则返回-1。最后，使用print函数打印出哈希表中所有元素及其对应的值。

注意：上述示例代码只是提供了一个简单的实现方法，实际应用时可能需要添加更多的功能和逻辑来满足具体的需求。

[INST]

llama_print_timings: load time = 1726.20 ms

llama_print_timings: sample time = 921.45 ms / 1484 runs ( 0.62 ms per token, 1610.51 tokens per second)

llama_print_timings: prompt eval time = 6454.19 ms / 79 tokens ( 81.70 ms per token, 12.24 tokens per second)

llama_print_timings: eval time = 726158.39 ms / 1483 runs ( 489.66 ms per token, 2.04 tokens per second)

llama_print_timings: total time = 833821.79 ms / 1562 tokens

参考：

【AI】D2000 arm64 aarch64 22.04.3 LTS (Jammy Jellyfish) 编译llama.cpp 使用chinese-alpaca-2-7b模型 CPU版本-CSDN博客

【AI】Chinese-LLaMA-Alpaca-2 7B llama.cpp 量化方法选择及推理速度测试 x86_64 RTX 2060 6G 显存太小了-CSDN博客

【AI】Chinese-LLaMA-Alpaca-2 1.3B 7B 13B llama.cpp 量化方法选择及推理结果测试 x86

1.3B Q4_0 (很快，没用)

7B Q4_0（速度还行，结果比没用好点）

7B Q6_K（速度还行，依旧没什么用）

13B Q_6K（慢，但是效果明显好不少！）

php redis分布式锁

linux内存缓存占用过高分析和优化

stm32编写Modbus步骤

如何保证数据库和缓存的一致性

Mongodb聚合操作中的$unset

私域引流宝PHP源码以及搭建教程

1.3B Q4_0 (很快，没用)

7B Q4_0（速度还行，结果比没用好点）

7B Q6_K（速度还行，依旧没什么用）

13B Q_6K（慢，但是效果明显好不少！）

猜你喜欢

php redis分布式锁

linux内存缓存占用过高分析和优化

stm32编写Modbus步骤

如何保证数据库和缓存的一致性

Mongodb聚合操作中的$unset

私域引流宝PHP源码 以及搭建教程

私域引流宝PHP源码以及搭建教程