Hank
LLM

Creating a custom language model

I watched and followed along with a video of Andrej Karpathy on creating an LLM on YouTube.

Scraping content

I then tried to capture as much text and content from Hank Green and build a model of what he would say next.

Scraped content from Vlog Brother videos and podcasts to generate the data set.

For the podcasts, we get the data from here and use BeautifulSoup to parse out Hank's text.

For the YouTube videos, someone's created a Python library just for this.

YouTubeTranscriptApi.get_transcript('DWReago8zrM')

In total, I managed to collect 3Mb of text (podcasts and YouTube videos).

Tokenization

To build a list of tokens that the model will accept I used the Python library called Tokenizers. I used tokenizers==0.13.2 in this example.

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer

# Customize the trainer
trainer = BpeTrainer(vocab_size=2000, min_frequency=4, show_progress=True)

# Train the tokenizer on the input text
tokenizer.train_from_iterator([text], trainer=trainer)

# Save the tokens
tokenizer.save("my_tokenizer.json")

This generated a JSON of all my tokens, which I then could use to encode and decode text for the model.

Before putting the text through the tokenizer I simplified the input by making everything lowercase.

The tokens ended up looking like this

chars = [ "john i'll see you ", 'degrees celsius ', "i don't know if ", 'a lot of people ', 'on the internet ', 'if you want to ', 'youtube rewind ', 'something that ', 'mammal biomass ', 'people who are ', "because we're ", 'awesome socks ', 'understanding ', 'and of course ', "i don't know ", 'good morning ', 'a little bit ', "i'll see you ", 'i think that ', "because it's ", " that's what ", 'project for ', 'complicated ', 'things that ', 'part of the ', 'going to be ', 'information ', 'ankle socks ', 'section 230 ', 'communities ', "and there's ", 'live stream ', 'nerdfighteri', 'baked beans ', 'hotter than ', 'crank green ', "we're gonna ", 'a different ', 'and you can ', 'that i have ', 'for example ', 'people were ', "and they're ", 'awesome.com ', 'every month ', 'people who ', 'thing that ', 'on tuesday ', 'about this ', "don't know ", 'figure out ', 'all of the ', 'during the ', "and that's ", 'definitely ', 'one of the ', 'everything ', 'lemon lips ', 'a bunch of ', 'people are ', 'hummus and ', 'because it ', 'think that ', 'absolutely ', 'impossible ', "that's not ", 'started to ', 'telling me ', 'chickenpox ', 'this video ', 'about that ', 'because we ', 'moderation ', 'signs that ', 'hank green ', 'myself and ', 'talk about ', 'particular ', 'ing things ', 'each other ', "ing that's ", 'i actually ', 'that there ', 'you should ', 'out of the ', ' this is a ', 'john green ', 'watercolor ', 'difference ', 'content to ', 'understand ', 'different ', 'something ', 'about the ', 'of course ', 'like this ', 'beautiful ', 'know that ', 'like that ', 'community ', "it's just ", 'wanted to ', 'dr pepper ', 'have been ', "i'm gonna ", "it's very ", 'right now ', 'trying to ', "you don't ", 'you could ', 'of people ', 'sometimes ', 'years ago ', 'formation ', "it's also ", 'decisions ', 'basically ', 'i thought ', 'but it is ', 'more than ', 'turns out ', 'important ', 'just like ', 'very good ', "if you're ", 'there are ', 'everybody ', 'questions ', 'elon musk ', "you can't ", "it's like ", 'because i ', 'should be ', 'rough the ', 'in the des', 'last year ', 'crunching ', 'obviously ', 'when they ', "it wasn't ", "it's been ", 'this year ', 'and there ', 'amount of ', "there's a ", 'next year ', 'literally ', 'extremely ', 'available ', 'and it is ', 'i want to ', 'i need to ', 'like what ', 'so if you ', 'this week ', 'one thing ', 'all these ', 'well done ', 'a lot of ', 'ing that ', 'and then ', 'which is ', 'actually ', 'going to ', 'a little ', 'internet ', 'like the ', 'with the ', 'possible ', ' this is ', 'and that ', "and it's ", 'probably ', 'audience ', "it's not ", 'bunch of ', 'would be ', 'ed about ', 'that the ', 'anything ', 'pressure ', 'together ', 'recommend', 'you know ', 'problems ', 'shingles ', 'computer ', ' thought ', 'that was ', 'about it ', 'and they ', 'might be ', 'ing with ', 'from the ', 'cription ', 'ing this ', 'of those ', 'tell you ', 'into the ', " there's ", 'whatever ', 'the same ', ' that is ', 'what the ', 'there is ', 'of these ', 'question ', 'deep sea ', 'going on ', 'at least ', 'that are ', 'how many ', 'grateful ', "couldn't ", "i didn't ", 'the most ', 'for that ', 'with you ', 'somebody ', 'deration ', 'has been ', 'to do it ', 'platform ', 'decrease ', 'opposite ', 'thousand ', ' through ', 'remember ', 'i really ', 'you just ', "it's the ", 'gonna be ', 'ing them ', 'over the ', 'may have ', 'research ', 'national ', 'and when ', 'the moon ', 'it makes ', 'like you ', 'in order ', "we don't ", 'but also ', 'me court ', "i'm sure ", 'they are ', 'they can ', 'it was a ', 'now that ', 'bring me ', 'starting ', 'happened ', 'because ', 'awesome ', 'project ', "i don't ", 'youtube ', " that's ", 'ing the ', 'you can ', 'want to ', 'this is ', 'and the ', 'content ', 'for the ', 'i think ', 'morning ', 'kind of ', "there's ", 'problem ', 'of them ', 'of this ', 'see you ', 'need to ', 'twitter ', 'celsius ', 'have to ', "i'm not ", 'ed that ', 'degrees ', "doesn't ", 'is that ', 'to make ', 'ing and ', 'do that ', 'interest', 'in this ', 'you are ', 'we were ', 'biomass ', "ouldn't ", 'all the ', 'getting ', 'dollars ', 'es that ', 'percent ', 'college ', 'another ', 'i would ', 'hundred ', ' things ', 'like it ', 'so that ', 'talking ', 'at that ', 'ational ', 'to be a ', 'hard to ', "they're ", 'on that ', 'you get ', 'sort of ', 'platform', 'nerdfigh', "haven't ", 'an that ', 'of that ', 'be like ', 'section ', 'working ', 'happens ', 'animals ', 'between ', 'elon mus', 'en they ', 'used to ', 'er than ', 'is just ', "and i'm ", 'that is ', 'for you ', 'what is ', 'totally ', 'anymore ', 'perfect ', 'spotify ', 'already ', 'i found ', 'so much ', 'my favor', 'where i ', 'example ', 'ans and ', 'ers and ', 'you thin', 'the way ', 'it says ', 'of what ', 'was the ', 'viously ', 'someone ', 'without ', 'mammals ', 'surface ', 'feeling ', 'butt far', 'a lemon ', '[music] ', 'and you ', 'and not ', 'we have ', 'feature ', 'ount of ', 'will be ', 'difficul', 'advertis', 'finally ', 'present ', 'put the ', 'easy to ', 'whether ', 'changed ', 'million ', 'disease ', 'explain ', 'believe ', 'amazing ', 'algorith', 'quickly ', 'ercolor ', 'a month ', 'and now ', 'to have ', 'ed into ', 'you got ', 'like to ', 'so many ', "it's so ", 'just to ', 'us maxim', 'try and ', 'spaceshi', 'looking ', 'see the ', 'communic', 'people ', 'of the ', 'in the ', 'really ', 'ing to ', 'if you ', 'little ', 'it was ', 'on the ', 'hummus ', "that's ", 'things ', 'to the ', 'i have ', 'course ', 'ations ', 'k that ', 's that ', "you're ", "it's a ", 'problem', 'that i ', 'pretty ', 'at the ', "didn't ", 'should ', 'ted to ', 'is the ', 'videos ', 'before ', ' there ', 'underst', 'a good ', "what's ", 'ically ', 'figure ', 'and we ', 'i just ', 'dience ', 'en the ', 'i love ', 'complic', 'system ', 'and so ', 'to get ', 'with a ', 'cannot ', 'why is ', 'person ', 'coffee ', 'ing me ', 'i will ', 'we are ', 'rewind ', 'called ', 'on and ', 'number ', ' those ', 'around ', 'pepper ', 'we can ', 'myself ', 'almost ', 'like a ', 'making ', 'all of ', 'complex', 'filled ', 'had to ', 'second ', 'though ', 'and it ', 'having ', 'looked ', 'how to ', 'stream ', 'reason ', 't that ', 'inside ', 'ingles ', 'living ', 'es and ', "wasn't ", 'google ', 'bottom ', 'humans ', 'crunch ', 'mammal ', 'friend ', 'simple ', 'pickle ', ' their ', 'but it ', 'by the ', 'i think', 'ending ', 'no one ', 'ousand ', ' these ', 'itself ', 'coming ', 'selves ', 'folder ', 'he was ', 'school ', 'anding ', 'onally ', 'to see ', 'ateful ', 'like i ', 'for me ', 'th and ', 'as the ', 'member ', 'planet ', 'fairly ', 'chicken', 'king th', 'is not ', 'atever ', 'ed and ', 'we did ', 'have a ', 'make a ', 'change ', 'speech ', 'oxygen ', 'always ', 'to you ', 'i mean ', 'it but ', 'saying ', 'mostly ', 'higher ', 'moment ', 'either ', 'concern', 'better ', 'enough ', 'and to ', 'to say ', 'liable ', 'likely ', 'behind ', 'active ', 'diesel ', 'al and ', 'format ', 'strong ', 'switch ', 'access ', 'matter ', 'places ', 'social ', 'a very ', 'so the ', 'action ', 'in our ', 'just a ', 'a link ', 'when i ', 'arious ', 'answer ', 'organiz', 'sounds ', ' third ', 'or the ', "aren't ", 'it out ', 'out of ', 'we had ', 'we gott', 'one is ', 'unless ', 'ons of ', 'agricul', 'anyway ', 'happen ', 'made a ', 'decide ', 'entire ', ' that ', 'about ', ' this ', 'cause ', "don't ", 'and th', 'thing ', 'which ', 'ed to ', 'awesom', 'a lot ', 'ation ', 'would ', 'there ', 's and ', 'differ', 'i thin', "we're ", 'it is ', 'right ', 'and i ', 'video ', 'other ', 'to be ', 'where ', 'gonna ', 'could ', 'to do ', ' they ', 'years ', 'intern', 'but i ', 'water ', 'times ', 'ing th', 'maybe ', 'green ', 'being ', 'lemon ', 'commun', 'weird ', 'first ', 'great ', 'happen', 'socks ', 'beans ', 'sible ', 'doing ', 'every ', "can't ", 'ed by ', "ere's ", 'might ', 'y and ', 'for a ', 'whole ', 'never ', "isn't ", 'tting ', 'world ', 'grees ', 'stuff ', 'ching ', "esn't ", 'esday ', 'questi', 'tiful ', "ey're ", 'dollar', 'those ', 'makes ', 'ought ', 'i did ', 'month ', 'think ', 'bunch ', 'i was ', 'money ', 't-rex ', 'found ', ' than ', 'space ', 'exist ', 'these ', 'do it ', 'place ', 'illed ', 'ately ', 'while ', 'human ', 'racks ', 'recomm', 'still ', 'seems ', "we've ", 'going ', 'ction ', 'tally ', 'often ', 'ating ', 'ature ', 'itely ', 'story ', 'rough ', 's out ', 'we do ', 'quite ', 'sense ', 'sions ', 'again ', 'point ', 'signs ', 'paper ', 'ankle ', 'darcy ', 'value ', 'spotif', 'to me ', 'liked ', 'ement ', 'in 201', 'super ', 'least ', 'had a ', 'guess ', 'ously ', 'phone ', 'baked ', 'their ', 's the ', 'a new ', 'of it ', 'es of ', 'share ', 'concer', 'watch ', 'suppor', 'after ', ' them ', 'ities ', 'state ', 'absolu', 'in my ', 'comes ', 'import', 'specif', 'ready ', 'pants ', 'though', 'build ', 'was a ', 'trust ', 'looks ', 'memes ', 'design', 'extra ', 'decaf ', 'three ', 'a few ', 'resear', 'it to ', 'tions ', "she's ", 'pping ', 'close ', 'partic', 'start ', 'sorry ', 'media ', '[music', 's now ', 'iness ', 'ently ', "you'd ", 'ed in ', "i'm a ", 'if it ', 'crank ', 'fully ', 'today ', 'means ', 'americ', 'court ', 'raise ', 'go to ', 'color ', 'funny ', 'virus ', 'clear ', 'taken ', 'aquari', 'temper', 'ingle ', 'algori', 'the ch', 'ed up ', 'es to ', 'vertis', 'works ', 'ounds ', 'solar ', 'until ', 'correc', 'piece ', 'respon', 'e.com ', 'sound ', 'ten th', 'a way ', 'order ', 'i got ', 'ed on ', 'using ', 'on my ', 'aceshi', 'roads ', 'ative ', 'earth ', 'meant ', 'parts ', 'happy ', 'giant ', "ey'll ", 'wants ', 'folks ', ' the ', 'that ', 'like ', "it's ", 'this ', 'ally ', 'just ', 'have ', 'very ', "at's ", 'ould ', 'what ', 'with ', 'know ', 'good ', 'ight ', 'they ', 'ther ', 'ings ', 'more ', 'john ', 'king ', ' thin', 'also ', 'will ', 'inter', 'make ', 'youtu', 'want ', 'were ', 'tion ', 'been ', 'of th', 'from ', 'some ', 'able ', 'time ', 'most ', 'when ', "dn't ", 'ence ', 'ound ', 'here ', 'your ', 'year ', 'ning ', 'many ', 'ough ', 'look ', 'hard ', 'said ', 'work ', 'kind ', 'ting ', 'into ', 'ated ', 'love ', 'every', 'ding ', 'last ', 'ving ', 'sure ', 'part ', 'well ', 'made ', 'video', 'is a ', 'self ', 'over ', 'body ', 'less ', 't of ', "i'll ", 'much ', 'long ', 'befor', 'ever ', 'same ', 'only ', 'down ', 'ying ', 'under', 'sius ', 'take ', 's of ', 'face ', 'them ', 'en th', 'does ', 'quest', 'sign ', 'done ', 'than ', 'ways ', 'in a ', 'ents ', 'ably ', 'compl', 'hank ', 'rough', 'on tu', 'acks ', "i've ", 'club ', "he's ", 'tell ', 'once ', 'deep ', 'feel ', 'back ', 'next ', 'live ', 'lips ', 'life ', 'side ', 'clear', 'idea ', 'cent ', 'find ', 'onal ', 'of a ', 'then ', 'betwe', 'secon', 'pper ', 'each ', 'five ', 'yeah ', 'ants ', 'bean ', 'defin', 'come ', 'lege ', 'even ', 'speci', 'week ', 'other', 's on ', 'eful ', "sn't ", 'elon ', 'wild ', 'compu', 'sion ', 'both ', 'music', 'crab ', 'y to ', 'used ', 'tive ', 'nerdf', 'heat ', 'excit', 'ment ', 'ched ', 'else ', 'ical ', 'ttom ', 'give ', 'chang', 'care ', 'free ', "en't ", 'best ', 'haven', 'came ', 'mean ', 'butt ', 'favor', 'away ', 'aked ', 'fast ', 'power', 'tely ', 's in ', 'list ', 'so i ', 'decre', 'says ', 'a lem', 'ular ', '2023 ', 'milli', 'name ', 'organ', 'to a ', 'on a ', 'stly ', 'chick', 'math ', 'provi', 'ture ', 'exper', 'as a ', 'lots ', 'gets ', 'fish ', 'notic', 'talk ', 'days ', 'trans', 'four ', 'found', 'gest ', 'inclu', 'itch ', 'stop ', 'consi', 'pped ', 'boba ', 'creat', 'meme ', 'cute ', 'huge ', 'dinos', '.com ', 'case ', 'okay ', 'ones ', 'ower ', 'emper', 'open ', 'form ', 'ters ', 'left ', 'watch', 'stuff', 'high ', 'blue ', 'half ', 'food ', 'hind ', 'hope ', 'kely ', 'nice ', 'uary ', 'ange ', ' than', 'read ', 'esel ', 'oppos', 'main ', 'maxim', 'ases ', 'show ', 'joke ', 'preci', 'poop ', 'moon ', 'fine ', 'dest ', 'head ', 'exist', 'barri', 'molec', 'line ', 'ined ', 'rely ', 'liter', 'i am ', 'tice ', 'be a ', 'extre', 'quick', 'mely ', 'avail', 'decid', 'banan', 'bird ', 'bers ', 'hour ', 'ivers', 'must ', 'ricul', 'tful ', 'a doc', 'ends ', 'ards ', 'resul', 'revie', 'ster ', 'roll ', 'sely ', 'supre', 'need ', 'told ', 'ridic', 'turn ', 'and ', 'ing ', 'the ', 'you ', 'out ', 'but ', 'for ', 'ere ', 'are ', 'one ', 'ust ', 'was ', 'now ', 'ent ', 'not ', 'peop', 'use ', 'all ', "i'm ", "'re ", 'can ', "n't ", 'get ', 'ant ', 'ough', 'ect ', 'year', 'some', "e's ", 'way ', 'ons ', 'ver ', 'esom', 'ind ', 'vide', 'prob', 'proj', 'who ', 'any ', 'see ', 'diff', 'ess ', 'ose ', 'how ', 'ast ', 'ter ', 'ther', 'day ', 'ell ', 'had ', 'our ', 'ity ', 'litt', 'did ', 'ace ', 'ked ', 'cour', 'ely ', 'mus ', 'comm', 'why ', 'ful ', 'ans ', 'ard ', 'ite ', 'actu', 'ong ', 'ate ', 'with', 'two ', 'has ', 'bit ', 'happ', 'cont', 'ird ', 'gonn', 'ese ', 'thin', 'ted ', 'ack ', 'pres', 'big ', 'sign', 'end ', 'got ', 'ure ', 'ase ', 'ble ', 'der ', 'ever', 'ber ', 'know', 'ous ', 'own ', 'star', 'chan', 'ybe ', 'form', 'ine ', 'bad ', 'ies ', 'ank ', 'twit', 'ple ', 'ole ', 'tty ', 'comp', 'kes ', 'deci', 'eir ', 'new ', 'lem ', 'per ', 'est ', 'doll', 'ass ', 'too ', 'put ', 'rex ', 'crun', 'syst', 'ain ', 'ves ', 'ree ', 'red ', 'its ', 'pers', 'beau', 'say ', 'tly ', 'enti', 'stor', 'stre', 'ice ', "'ve ", 'ers ', 'mamm', 'call', 'high', 'als ', 'ill ', 'ago ', 'biom', 'reas', 'ext ', 'plat', 'from', 'turn', 'his ', 'les ', 'art ', 'hot ', 'ten ', 'ems ', 'ool ', 'supp', 'fee ', "'ll ", 'top ', 'anim', 'ary ', 'sea ', 'crab', 'ger ', 'gott', 'illi', 'ild ', 'pick', 'int ', 'ower', 'few ', 'toge', 'solu', 'hund', 'ame ', 'ope ', 'tran', 'crip', 'feel', 'spee', '100 ', 'gle ', 'old ', 'char', 'sear', 'a th', 'gen ', 'argu', 'deli', 'icul', 'plan', 'medi', 'her ', 'unti', 'prot', 'conf', 'plac', 'perf', 'soci', 'ads ', 'real', 'read', 'stan', 'ask ', 'vers', 'extr', 'she ', 'ound', 'age ', 'cred', 'lly ', 'low ', 'tell', 'inst', 'ink ', 'ors ', 'job ', 'teri', '2023', 'prof', 'prom', 'him ', 'eve ', 'educ', 'fair', 'man ', 'over', 'ven ', 'win ', 'answ', 'bett', 'dise', 'expl', 'hott', '.com', 'ise ', 'imag', 'lid ', 'beli', 'may ', 'act ', 'deca', 'peep', 'set ', 'show', 'dog ', 'joy ', 'ght ', 'ock ', 'tain', 'tea ', '230 ', 'serv', 'brea', 'kay ', 'situ', 'inde', 'orig', 'lish', 'ustr', 'espe', 'elec', 'exam', 'john', 'part', 'pox ', 'heal', 'piec', 'amaz', 'coun', 'fly ', 'hear', 'ped ', 'tor ', 'eric', 'off ', 'res ', 'sten', 'ash ', 'exhi', 'mean', 'cut ', 'fun ', 'air ', 'dle ', 'gori', 'let ', 'row ', "t's ", 'well', 'them', 'anno', 'resp', 'divi', 'need', 'tast', 'ept ', '000 ', 'ead ', 'ete ', 'ety ', 'frea', 'gas ', 'vacc', 'yet ', 'ths ', 'ony ', 'hasi', 'eshi', 'buy ', 'acci', 'weir', 'shar', 'fort', 'boy ', ' th', 'ing', 'at ', 'is ', 'to ', 'you', 'it ', "'s ", 'ed ', 'er ', 'on ', 'le ', 'of ', 'ly ', 'ke ', 'or ', 've ', 'es ', 'en ', 'ot ', 'so ', 'an ', 'ow ', "'t ", 'in ', 'be ', 'we ', 'ver', 'ch ', 'me ', 're ', 'ld ', 'th ', 'pro', 'et ', 'com', 'ey ', 'do ', 'us ', 'll ', 'con', 'don', 'for', 'al ', 'som', 'ati', 'igh', 'ear', 'a l', 'whi', 'goo', 'ter', 'my ', 'em ', 'mor', 'wor', 'if ', 'ts ', 'se ', 'pre', 'gre', 'as ', 'ks ', 'by ', 'per', 'hum', 'st ', 'joh', 'ay ', 'ce ', 'oun', 'all', 'our', 'de ', 'loo', 'om ', 'ad ', 'ds ', 'wat', 'ge ', 'lem', 'cre', 'par', 'mon', 'sig', 'est', 'can', 'up ', 'mos', 'end', 'ar ', 'no ', 'now', 'ys ', 'id ', 'sel', 'big', 'der', 'el ', 'any', 'onn', 'op ', 'stu', 'go ', 'fin', 'ir ', 'pos', 'ick', 'lic', 'tim', 'dy ', 'tic', 'mem', 'ic ', 'ue ', 'des', 'spe', 'ell', 'col', 'fir', 'mus', 'ner', 'clu', 'soc', 'tal', 'am ', 'sol', 'ous', 'ari', 'tur', 'exi', 'ps ', 'wan', 'str', 'ill', 'cel', 'fig', 'fol', 'te ', 'a c', 'rew', 'he ', 'wee', 'ep ', 'hun', 'a m', '00 ', 'uh ', 'ven', 'fe ', 'ff ', 'sor', 'bun', 'exc', '201', 'duc', 'fun', 'ide', 'num', 'dic', 'fac', 'oh ', 'um ', 'yea', 'fri', 'bas', 'cy ', 'ty ', 'oul', 'sur', 'cof', 'mis', 'twe', 'vir', '10 ', 'cer', 'eas', 'ank', 'a s', 'iti', 'tif', 'acc', 'sim', 'sub', 'ail', 'gen', 'rec', 'mag', 'imp', 'dur', 'car', 'dar', 'dr ', 'mar', 'ry ', 'sen', 'mom', 'val', 'rea', 'sh ', 'hal', 'mat', 'sch', 'att', 'itu', 'ess', 'dis', 'try', 'ba ', 'bar', 'far', 'won', 'a f', 'but', 'spo', "'d ", 'foo', 'fav', 'gan', 'man', 'ss ', 'ser', 'eng', 'ort', 'whe', 'sul', 'aur', '20 ', 'air', 'cal', 'ft ', 'sal', 'tre', 'ze ', 'ori', 'ign', 'pub', 'ber', 'din', 'dem', 'gs ', 'law', 'ny ', 'nam', 'fur', 'leg', 'ain', 'fer', 'run', 'sci', 'shi', 'doc', 'tri', 'oxy', 'coo', 'dan', 'how', 'nor', 'pt ', 'pic', 'tow', 'emp', 'qui', 'tak', 'flu', 'ama', 'bro', 'cen', 'cor', 'gy ', 'hon', 'lar', 'pe ', 'sta', 'vie', 'new', 'ag ', 'pod', 'fix', 'aqu', 'boo', 'cul', 'den', 'foc', 'hou', 'min', 'mr ', 'mol', 'non', 'rat', 'wal', 'xim', 'ind', 'arr', 'suc', 'joy', 'cru', 'oce', 'ban', 'bus', 'bab', 'fam', 'lon', 'pen', 'pri', 'tan', 'tom', 'the', 'tis', 'ack', 'cap', 'hur', '12 ', 'ck ', 'dro', 'evi', 'gra', 'han', 'hel', 'ili', 'jan', 'jec', 'mic', 'mun', 'm5 ', 'nec', 'pat', 'tin', 'tea', 'ute', 'vel', 'wra', 'are', 're-', 'who', 'beg', 'ham', 'bur', 'ass', "i'm", 'doz', 'tra', 'isl', 'pho', 'e ', 't ', 'th', 's ', 'd ', 'in', 'an', 'y ', 'er', 'o ', 'on', 'ou', 'a ', 'en', 'or', 'ar', 'al', 'of', 're', 'li', 'i ', 'at', 'wh', 'it', 'om', 'us', 'l ', 'ow', 'st', 'be', 'ch', 'ha', 'es', 'ti', 'wa', 'em', 'bu', 'k ', 'ab', 'ig', 'op', 'ma', 'ac', 'el', 'de', 'ro', 'wi', 'di', 'pe', 'oo', 'as', 'ic', 'se', 'os', 'un', 'm ', 'f ', 'ca', 'tu', 'we', "i'", 'r ', 'tt', 'si', 'n ', 'ex', 'sh', 'vi', 'do', 'su', 'ec', 'jo', 'sa', 'um', 'tr', 'p ', 'ol', 'gh', 'pl', 'ir', 'ne', 'aw', 'cr', 'fr', 'ci', 'pp', 'qu', 'fe', 'bo', 'go', 'ff', 'fu', 'il', 'tw', 'to', '0 ', 'ag', 'le', 'oc', 'cl', 'im', 'au', 'h ', 'sp', 'me', 'ta', 'am', 'pu', 'g ', 'is', 'lo', 'ul', 'w ', 'ad', 'x ', 'po', 'ri', 'ea', 'ge', 'ra', '20', 'du', 'pr', 'fi', 'ev', 'mo', 'gu', 'co', 'gi', 'ep', 'bi', 'br', 'hi', 'ey', 'pa', 'cu', 'no', 'b ', 'gr', 'he', 'if', 'mu', 'av', 'll', 'dr', 'fl', 'ph', 'sy', 'te', 'sc', 'mm', 't-', 'sw', 'iz', '5 ', '23', 'mi', 'so', 'gl', 'la', 'ob', 'af', 'df', 'hu', 'up', '2 ', 'bl', 'da', 'ip', 'pi', 'ho', 'hy', 'my', '6 ', 'ki', '. ', 'ed', 'ei', 'vo', 'ap', 'wr', 'eg', 'sk', 'sl', 'ox', '8 ', '19', 'ju', 'ur', '1 ', '4 ', '] ', 'd-', 'p4', 'dy', 'et', 'od', 'ty', ' ', "'", '-', '.', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '[', ']', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']

This list is sorted from longest token to shortest (scroll left + right to go through them all).
It's interesting to see the longer tokens, it makes his speech patterns more noticeable. And acts as a confirmation that the tokenizer did what we'd expect it to. With the top tokens being,

"john i'll see you "
'degrees celsius '
"i don't know if "
'a lot of people '

Hank always ends his videos with "Joen I'll see you" and says all of these phrases.

Train the model

In the video, he links out to the code needed to build the model. It can be found here. Switching out his tokens method for mine.

I also reduced the block size to 127 which decreases the number of tokens to be predicted but because we have so many more possible tokens it's a similar outcome.

The training process and progress looked like this.

step 0: train loss 7.6589, val loss 7.6611
step 500: train loss 4.0463, val loss 3.8552
step 1000: train loss 3.4484, val loss 3.2592
step 1500: train loss 3.1274, val loss 2.9934
step 2000: train loss 2.8747, val loss 2.7594
step 2500: train loss 2.6454, val loss 2.5829
step 3000: train loss 2.4475, val loss 2.4162
step 3500: train loss 2.2536, val loss 2.2557
step 4000: train loss 2.0791, val loss 2.1240
step 4500: train loss 1.9216, val loss 1.9794

I think my data is a bit flawed because of the way we are splitting our data

train_data = data[:n]
val_data = data[n:]

and the data not being homogeneous means we're testing against a different style of data. The first half of the data is Youtube videos and the 2nd half is podcast data.

Preparing for the web

This video uses Torch, but currently, Torch isn't well supported on the web. I converted the torch model into an onnx. Torch has a built-in tool to convert itself into onnx.

I struggled a lot with the conversion from torch with the root of the issue was understanding the shape the model wanted. This is what ended up working for me.

torch.onnx.export(model, context, "longTokens.onnx",input_names=['input'],  output_names=['output'], dynamic_axes={'input': {0: 'batch_size', 1: 'seq_len'}})

The output of this is a 50mb .onnx file.

Onnx has also changed a lot over the past few years so beware of old tutorials, like this one was misleading, it was useful for a better understanding of how it works, but the API is now different.

Making the web app

Made a vite/vue/ts project as I always do.

For this project I tried out Bulma for styling, it's not much of a test because there is so little on this project.

To get the onnxruntime-web library working as expected, there were a few stumbles.

I needed a utility file to load the wasms needed for the project.

// @ts-ignore
import * as ort from "onnxruntime-web/dist/ort-web.min.js";

import wasm from "onnxruntime-web/dist/ort-wasm.wasm?url";
import wasmThreaded from "onnxruntime-web/dist/ort-wasm-threaded.wasm?url";
import wasmSimd from "onnxruntime-web/dist/ort-wasm-simd.wasm?url";
import wasmSimdThreaded from "onnxruntime-web/dist/ort-wasm-simd-threaded.wasm?url";

// @ts-ignore
import modelUrl from "longTokens.onnx";

ort.env.wasm.wasmPaths = {
  "ort-wasm.wasm": wasm,
  "ort-wasm-threaded.wasm": wasmThreaded,
  "ort-wasm-simd.wasm": wasmSimd,
  "ort-wasm-simd-threaded.wasm": wasmSimdThreaded,
};

And then the generation function

const data = encode(input).map((val) => String(val));
const idx = new ort.Tensor("int64", data, [1, data.length]);
const outputMap = await props.session.run({ input: idx });
let output = decode(outputMap.output.data as unknown as number[]);

There was a lot of fiddling with getting bigInt working but converting it into a string and back into an int worked.

Ok, so does this make sense?
No, you probably shouldn't be making models for the front end except in some unique situations.

Should you make a model from nothing?
No, it will not be nearly as good as just fine-tuning an existing model.

Published on: 2023-03-05