Yuhan's blog: April 2024

Tiny Llama is a small enough LLM model that could run from your laptop. It can generate somewhat good replies. People are fine-tuning it for their own purposes. I am going to show you how to play with it.

How to Play with a model on Huggingface ?

Go to the hugging face page, they already included the instructions on how to start the app: https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0

How to run it locally?

Download the files to your disk from the "Files" tag:

- https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0/tree/main

You will have to click on the download icon to download the files. Copying and pasting text leads to corrupted files. Put them all in one folder and run this python script:

Which one to use?

There are a few models. Use the chat model. It gives better result in implementing a question / answer chatbot.

All models are good at continuing the text that you provide. The chat model allows separate question and answers well by separating user input and assistant output. Surround user input in <|user|> tag , ending it with </s>. Surround desired reply with <|assistant|>, also ending it with </s>.

You can also use it without tags, but expect it look like creative writing than a conversation.

In additional to "text-generation" Transformers pipeline has a "conversation" mode, which allows the input the be a list of maps, each map is a message in the conversation. (So that it converts the list of maps into text with tags for you.)

How to fine tune with peft?

Use this script. It picks small percentage of a certain neural network components to fine tune.

https://gist.github.com/yuhanz/740ab396f237e8d2f41f07a10ccd8de9

You want to initialize TrainingArguments with push_to_hub=False, so that it doesn't require huggingface credentials. Replace the model name with relative path to folder containing model files in your disk.

Prepare the training data into conversations separated by tags. Use tokenizer to convert structured data into text with tags:

text = tokenizer.apply_chat_template(messages, tokenize=False)

Note that the configuration max_seq_length decides size of the input and the generated text. This may depend on the tokenizer for the model.

References:

https://www.kaggle.com/code/tommyadams/fine-tuning-tinyllama

https://www.databricks.com/blog/efficient-fine-tuning-lora-guide-llms

https://huggingface.co/docs/transformers/main/chat_templating

Yuhan's blog

Tuesday, April 16, 2024

Quick Tutorial: Tiny Llama Finetuning

How to Play with a model on Huggingface ?

How to run it locally?

Which one to use?

How to fine tune with peft?

References:

Thursday, April 04, 2024

Handling AWS API Gateway Error: Invalid HTTP endpoint encoding detected for URI