{" ": 0, "lo": 1, "hel": 2, "r": 3, "wo": 4, "ld": 5}
and the resulting tokens for "hello world"
would be [2, 1, 0, 4, 3, 5]
. Part of GPT-2's vocab is below. It uses Byte-Pair Encoding so tokens are parts of words. There are a lot of Ġ
characters and those are special, indicating the start of a new word. Ġaut
creates a token for the first 3 letters of"author" whereas aut
wouldn't, instead aut
's value would substituted in the middle for a word like "nautilus". There's usually a padding and truncation element so that inputs are normalized to be a certain shape.In [1]: from transformers import AutoTokenizer
In [2]: tokenizer = AutoTokenizer.from_pretrained("gpt2")
In [3]: tokenizer.get_vocab()
Out[3]:
{'Ġaut': 1960,
'roleum': 21945,
'151': 24309,
'ascal': 27747,
'azeera': 28535,
'Ġchore': 30569,
'][': 7131,
'ĠEns': 48221,
...}
In [4]: len(tokenizer.get_vocab())
Out[4]: 50279
In [5]: tokenizer.encode("hello world", return_tensors="pt")
Out[5]: tensor([[31373, 995]])
In [6]: tokenizer.encode("hello world hello world", return_tensors="pt")
Out[6]: tensor([[31373, 995, 23748, 995]])
In [7]: tokenizer.encode("hello world", return_tensors="pt", max_length=10, padding="max_length", truncation=True)
Out[7]: tensor([[31373, 995, 50259, 50259, 50259, 50259, 50259, 50259, 50259, 50259]])
In [8]: tokenizer.encode("hello world", return_tensors="pt", max_length=10, padding="max_length", truncation=True)
Out[8]: tensor([[31373]])
[0.14321, 0.098342, -1.12378 ...]
). GPT-2 has a vocab size of 50257, and embeddings have size of 768 so the embedding layer is (n_vocab, n_embeddings)
. (wte
stands for word token embeddings)In [1]: from transformers import AutoModel
In [2]: model = AutoModel.from_pretrained("gpt2")
In [3]: model.wte
Out[3]: Embedding(50257, 768)
(n_embedding, n_vocab)
In [2]: model = AutoModelForCausalLM.from_pretrained("gpt2")
In [3]: model.lm_head
Out[3]: Linear(in_features=768, out_features=50257, bias=False)
Finetuning is when you start with an already trained model as opposed to a randomly initialized model. The benefits to this are that you're able to leverage larger general datasets, leverage models trained already and use less resources. Available approaches:
"Finetuning Open-Source LLMs." Youtube, uploaded by Sebastian Raschka, 14 Oct. 2023, https://youtu.be/gs-IDg-FoIQ?si=OCUI22mSHWSfmFK6&t=375 ↩