Markov name generation

2018/01/01

5 minute read

Generation of random planet names was a task I tackled recently, and these type of ‘random name’ generations are great candidates for Markov Chains. Markov chains are probabilistic structures where the next element in the chain depends on the previous elements. By training the model on a set on input data the process creates a chain with each link containing a set of the next probable elements. By starting at a random link in the chain and following until you reach an ending node it’s possible to get similar names to the original set.

Here is an example of how this works: Say you have the training input

[sally sells seashells]

we will partition the input into 2 letter groups (the order of the chain is 2) and record the next letters that follow each partition. $ means end of word.

(sa) -> l
(al) -> l
(ly) -> $

(se) -> l
(el) -> l
(ll) -> s
(ls) -> $

(se) -> a
(ea) -> s
(as) -> h
(sh) -> e
(he) -> l
(el) -> l
(ll) -> s
(ls) -> $

now we combine the same keys

(sa) -> (l)
(al) -> (l)
(ll) -> (y)
(ly) -> ($)
(se) -> (l,a)
(el) -> (l,l)
(ll) -> (s,s)
(ls) -> ($,$)
(ea) -> (s)
(as) -> (h)
(sh) -> (e)
(he) -> l

we have built our chain. To generate a new name we pick a random starting point and follow the chain by appending one of the characters in the values until we reach an end or have a long enough string. Let’s say we picked ea to start. Here is a sample string that we could end up with

(ea) + s -> eas
e(as) + h -> eash
ea(sh) + e -> eashe
eas(he) + l -> eashel
...

we would end up with eashells. If we had encountered (se) we would randomly choose between l and a. This small example does not product very original results but that’s because our training data was limited. Given a large amount of data and a higher order we would end up with better results. Here is some clojure code to create a markov chain and generate names.

;;the list is truncated for readability
(def words ["Acamar","Achernar","Achird","Acrab","Akrab","Elakrab","Graffias",
            "Acrux","Acubens","Adhafera","Adhara","Ain","Aladfar","Alamak","Alathfar","Alaraph","Albaldah","Albali" ...])
            
;;append ? to the end of each word so we have an easy way of knowing that a word has ended          
(def samples (map  #(str % "?") words))


;;partition the words. we will use the first 2 characters as the key and the last as the value.
;;((\s \a \l) (\a \l \l) (\l \l \y) (\s \e \l) (\e \l \l) (\l \l \s) (\s \e \a) (\e \a \s) (\a \s \h) (\s \h \e) (\h \e \l) (\e \l \l))
(def orders (mapcat #(partition 3 1 %) samples))

;;convert into a list of maps with the keys and values set
;;({"sa" \l} {"al" \l} {"ll" \y} {"se" \l} {"el" \l} {"ll" \s} {"se" \a} {"ea" \s} {"as" \h} {"sh" \e} {"he" \l} {"el" \l})
(def gr (map (fn [[f s t :as o]] {(str f s) t}) orders))

;;convert maps to 2-vectors so we can group them
;;(["sa" \l] ["al" \l] ["ll" \y] ["se" \l] ["el" \l] ["ll" \s] ["se" \a] ["ea" \s] ["as" \h] ["sh" \e] ["he" \l] ["el" \l])
(def mp (map first gr))

;;group by the keys so that we have a list of possible characters for each key
;;{"al" [["al" \l]], "ll" [["ll" \y] ["ll" \s]], "el" [["el" \l] ["el" \l]], "ea" [["ea" \s]], "sa" [["sa" \l]], "se" [["se" \l] ["se" \a]],
;; "sh" [["sh" \e]], "as" [["as" \h]], "he" [["he" \l]]}
(def gb (group-by first mp))

;;clean up the group-by so that only the following characters remain
;;{"ls" (\? \?), "al" (\l), "ll" (\y \s \s), "el" (\l \l), "ly" (\?), "ea" (\s), "sa" (\l), "se" (\l \a), "sh" (\e), "as" (\h), "he" (\l)}
(def markov-map (reduce-kv #(assoc %1 %2 (map second %3)) {} gb))

(defn generate-name
  "create a new name from the markov chain"
  []
  (loop [word ""
    part (subs (rand-nth words) 0 2)]
    (if (= part \?)
      word
      (let [new-word (str word part)
            new-part (rand-nth (markov-map (apply str (take-last 2 new-word))))]
           (recur new-word new-part)))))
           

the generation function works like this

  1. our word is empty and the next part is the first 2 letters from a random word in our training data
  2. if our part is the end of a word stop and return the word so far
  3. other wise append the part to the word, select the a random value from the markov map with the key as the last 2 letters of our word
  4. recurse with the new word and new part

Here are some of the names that the generator came up with

(take 25 (repeatedly generate-name))

("Sape" "Kal" "Torus" "Mei" "Taulurushaukbu" "DuQue" "Cangforishynoriah" "Pavlius" "Gachine" "Arra" "Midea" "Avent" "Alcyoni" "Hava'eloremaias" "Rembus" "Scelsiascidia" "Fimsor" "Drenglossichiropus" "Ohnis" "Nex" "Bram" "Minn" "Altaramelbalia" "Gia" "Propintlah")

Here is a link to the full seed list.

Playing with the order affects the quality of the results. For this example and order 3 was overfitting the data - producing mostly names identical to the ones in the training data. Order 2 seems like the best fit.