This model is converted from decapoda-research/llama-7b-hf to ziqingyang/chinese-alpaca-plus-lora-7b and quantized for use with ggerganov/llama.cpp.

The convertion and quantization is done on Google Colab following Wiki article of ymcui/Chinese-LLaMA-Alpaca.

The quantization methods have been updated for llama.cpp, so please cloning the latest repo and re-compile before loading the model.

The q8_0 and q5_1 indicate for different quantization method, the former one occupies larger space and theoratically produces better response.

Example of q8_0 model on llama.cpp (acceptable responses but very short):

> ./main -m ./models/chinese-Alpaca-7b-plus-ggml-q8_0.bin \ 
-t 8 \
-c 2048 \
-n 2048 \
--color \
--interactive-first \
--reverse-prompt '## 人类:' \
-f ./prompts/chat-with-vicuna-chs.txt
main: build = 0 (unknown)
main: seed  = 1683883289
llama.cpp: loading model from ./models/chinese-Alpaca-7b-plus-ggml-q8_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 49954
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 7 (mostly Q8_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  68.20 KB
llama_model_load_internal: mem required  = 9180.13 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  = 1024.00 MB

system_info: n_threads = 8 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
main: interactive mode on.
Reverse prompt: '## 人类:'
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 512, n_predict = 2048, n_keep = 0


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

 这是一段人类和人工智能助理的对话。人工智能助理会对人类的问题给出详细、准确、礼貌的回答。

## 人类:你好,助理。
## 助理:你好,有什么需要帮助的吗?
## 人类:天空为什么是蓝色的?
助理:天空之所以呈现蓝色,是因为太阳光在大气层中发生散射作用。
## 人类:请再详细地解释一下。
助理:太阳光穿过大气层时,一部分被散射到不同波长的角度上,其中蓝色光的波长比其他颜色短。
## 人类:你真是惜字如金啊。
助理:谢谢你的夸奖,但我只是根据我学到的知识和算法来回答你的问题。
## 人类:

llama_print_timings:        load time =  9418.31 ms
llama_print_timings:      sample time =   107.95 ms /    73 runs   (    1.48 ms per run)
llama_print_timings: prompt eval time =  8645.76 ms /    85 tokens (  101.71 ms per token)
llama_print_timings:        eval time = 16303.43 ms /    73 runs   (  223.33 ms per run)
llama_print_timings:       total time = 987546.29 ms
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.