Skip to content

Instantly share code, notes, and snippets.

@dpetranek
Forked from ftrain/rhymes.clj
Created December 14, 2017 02:21
Show Gist options
  • Select an option

  • Save dpetranek/cd7d7a19cdef38d98e39b4739fcdaa0f to your computer and use it in GitHub Desktop.

Select an option

Save dpetranek/cd7d7a19cdef38d98e39b4739fcdaa0f to your computer and use it in GitHub Desktop.
Annotated rhyming dictionary
;; So we want a rhyming dictionary in Clojure.
;;
;; https://gist.github.com/jackrusher/8640437
;;
;; I'm going to study this code and learn as I go.
;;
;; First I put it in a namespace.
(ns unscroll.rhyme
(:require
[clojure.string :as string]))
;; So that I can mess with it inside my current environment in
;; emacs.
;;
(def rhyme-txt
(map #(string/split % #"[ ]+")
(string/split-lines
(slurp "data/cmudict.txt"))))
;; Good ol' CMU rhyming dictionary; nice to see you again my friend!
;;
;; The format of the dictionary is such:
;;
;; ABDOMEN AE0 B D OW1 M AH0 N
;;
;; ie.
;;
;; WORD[SPACE]PHONEME_1[SPACE]PHONEME_2[SPACE]...PHONEME_N[NEWLINE]
;;
;; So we slurp in the file, split the lines by newlines, and then
;; split on space.
;;
;; Gotta love slurp; one thing it took me a while to figure out the
;; other day is that slurp starts looking for files at the top level
;; of the Clojure project. Where the source files are means nothing
;; to Clojure (because it means nothing to the JVM).
;;
;; Anyway now that we've done that we can say:
;; (take 3 (drop 1010 rhyme-txt))
;;
;; (We want to fast-forward past a bunch of comments and miscellany
;; to get a clear example, so start around 1010)
;;
;; And we'd see
;; (["ACTUARY" "AE1" "K" "CH" "UW0" "EH1" "R" "IY2"]
;; ["ACTUATE" "AE1" "K" "CH" "UW2" "EY1" "T"]
;; ["ACTUATOR" "AE1" "K" "T" "Y" "UW0" "EY2" "T" "ER0"])
;;
;; So now we're dealing with a list of vectors where the first value
;; is the word and the rest represent its phonemes.
;;
;; There's a lot going on with this one as there always is with
;; Clojure code. Let's see how it gets called first:
;;
;; (get-deepest-values (get-in rhyme-to-word (take 3 (word-to-rhyme "connection"))))
;; Which could probably be a function of its own called "rhyme."
;;
;; Although I've noticed Jack doesn't really care much whether
;; something is a function or a variable, it's like it DOESN'T EVEN
;; MATTER. Hmm.
;;
;; Anyway, we're going to give this function, word-to-rhyme a word
;; and get something back.
;;
;; (word-to-rhyme "connection")
;;
;; and we get
;;
;; [:N :AH :SH :K :EH :N :AH :K]
;;
;; which is the reverse order of phonemes. Which makes sense, we're
;; building a rhyming dictionary, gonna go end to beginning in terms
;; of phonemes.
;;
;; Going to add a crazy amount of indenting here so I can see the
;; levels more clearly as a n00b.
(def word-to-rhyme
(reduce (fn [m [word & rhyme]]
(assoc m
(string/lower-case word)
(mapv
#(keyword
(string/replace %1 #"[0-9]" ""))
(reverse rhyme))))
{} rhyme-txt))
;; This is a funny one because I mess with the parens my java keeps
;; running out of heap space. So it's clearly doing a lot, like
;; loading the whole CMU dictionary into memory.
;; The super hot LISPY action in there is
;; (reduce function=[that function] val=[an empty map {}]
;; coll=[rhyme-txt, our list of vectors])
;; Clojure docs are kind of a bear. For reduce, they say "If val is
;; supplied, returns the result of applying f to val and the first
;; item in coll, then applying f to that result and the 2nd item,
;; etc." AWESOME.
;; So what reduce does here is return the results applying that
;; function in there to {} and the first item in the rhyme-text,
;; then applies that to the second item, etc.
;; So in [m [word & rhyme]] it's going to be going:
;;
;; [{} ["CONNECTION" K AH N EH K SH AH N ]]
;;
;; So that's interesting because we have the word & rhyme -- the way
;; that destructuring works, rhyme will catch all of the phonemes
;; into a list; it's almost like the CMU people could predict this
;; kind of programming would occur using their dictionary.
;; Then we say:
;;
;; (assoc map key val)
;;
;; Or here:
;;
;; (assoc {} "connection" ...)
;;
;; And then a couple things happen
;;
;; First we reverse the rhyme so "K AH0 N EH1 K SH AH0 N" becomes
;; N AH0 SH K EH1 N AH0 K
;; that's (reverse rhyme)
;;
;; SECOND we replace all the numbers with nothing (likely because we
;; just don't need the data, can't remember why CMU uses numbers--
;; syllable stress maybe?)
;;
;; we do that with a regular expression yielding
;;
;; N AH SH K EH N AH K
;;
;; THIRD we run a mapv across that list of phonemes--that is, apply
;; to each element and return a vector. And what we are applying is
;; the "keyword" function which turns a string to a clojure keyword so
;; we end up with a structure like:
;;
;; {"connection" [:N :AH :SH :K :EH :N :AH :K]}
;;
;; NOTE: I'm not sure WHY we're converting to keywords but they are
;; prettier in general and make for better keywords in maps, and I'm
;; assuming they actually are optimized as, like, keywords, so....
;;
;; Anyway, and then we repeat that (lazily, I guess, so in chunks of
;; 32 or whatever it is that Clojure does?) as needed until we've
;; slurped up the whole file into a big map or what I still think of
;; as an associative array. Aaaand now we have a variable that
;; defines a function that when given---
;;
;; Oh god, I SEE. I ACTUALLY SEE. This is a def instead of a
;; function for a reason. It's a var that calls a function which
;; returns a map, but in Clojure a map can operate as a function. So
;; when I say:
;;
;; (word-to-rhyme "connection") I'm causing the interpreter to read
;; the entirety of the dictionary into a map, and assigning that map
;; to "word-to-rhyme" and then because I'm calling word-to-rhyme as
;; the first item in a sexp, the interpreter evaluates it as a
;; function and returns the phonemes that it has assoc'd to that
;; word.
;;
;; Clojure is kind of dense.
;;
;; So I'm going to assume we're in similar territory here with this
;; function.
(def rhyme-to-word
(reduce
(fn [m [word rhyme]]
(assoc-in m rhyme { :terminate word }))
{}
word-to-rhyme))
;; Aand we are, KIND OF. Hmm. So in this case we take the map (now just a
;; map) from word-to-rhyme and do another reduce, except the
;; structure we're building up is going to be a trie (?) so we're
;; going:
;;
;; (assoc-in {} [:N :AH :SH :K :EH :N :AH :K] { :terminate "connection" })
;;
;; And as a result we're getting:
;;
;; {:N {:AH {:SH {:K {:EH {:N {:AH {:K {:terminate "connection"}}}}}}}}}
;;
;; Great but that's one word. NOW Clojure hands that same map back
;; to the reduce with ANOTHER word, and so on for thousands of
;; words, building up a huge nested behemoth of a data structure.
;;
;; So we've passed assoc-in the phonemes for "connection"; we can
;; now pass it "correction" and they should be all jammed up in a
;; really nice way...
;;
;; (assoc-in (assoc-in {} [:N :AH :SH :K :EH :N :AH :K] { :terminate
;; "connection" }) [:N :AH :SH :K :EH :ER :K] {:terminate
;; "correction"})
;;
;; Okay, yes we end up with something that will let us take one
;; word, look up the phonemes (in reverse order) and look for
;; similar phonemes, then map those back to the words. That's what
;; we have here, no doubt. Looks like this:
;;
;; {:N {:AH {:SH {:K {:EH {:ER {:K {:terminate "correction"}}, :N
;; {:AH {:K {:terminate "connection"}}}}}}}}}
;;
;; And since I can assoc-in I can get in too.
;;
;; Okay so on we go...
(defn get-deepest-values [k]
(if (string? k) [k] (mapcat get-deepest-values (vals k))))
;; What the hell is this? What is it for? OH GOD.
;;
;; So here we're looking for strings inside a nest of keywords--that
;; makes sense. What is mapcat? Clojure docs: "Returns the result
;; of applying concat to the result of applying map to f and
;; colls. Thus function f should return a collection."
;;
;; Great, thanks Clojure docs. You're my pal.
;; What it means I think is that you're going to give this function
;; a bundle of stuff and it'll do something to each piece of stuff
;; (map...) and then smush everything together into one nice list
;; (...cat). So we're saying given a nested associated structure
;; like the one we just made, pull out all the values ...
;;
;; wait hold on--let's look at how it's actually called.
;; Okay this is the big mooooment
(get-deepest-values (get-in rhyme-to-word (take 5 (word-to-rhyme "erection"))))
;; And this gives a result thus: ("erection" "direction" "correction" "collection" "inflection" "preelection" "circumspection" "introspection" "imperfection" "perfection" "midsection" "transection" "connection" "protection")
;; But sometimes it's all too last-first for me, so Let's do that
;; using this guy "->>" with comments
(->>
(word-to-rhyme "erection")
;; gives us [:N :AH :SH :K :EH :R :IH]
(take 5)
;; gives us (:N :AH :SH :K :EH)--i.e. five phonemes, or the "ection" part of the rhyme.
(get-in rhyme-to-word)
;; so we're calling (get-in rhyme-to-word '(:N :AH :SH :K :EH))
;; and we get this result:
;;
;; {:R {:IH {:terminate "erection"}}, :ER {:D
;; {:terminate "direction"}, :K {:terminate "correction"}}, :L {:AH
;; {:K {:terminate "collection"}}, :F {:N {:IH
;; {:terminate "inflection"}}}, :IH {:IY {:R {:P
;; {:terminate "preelection"}}}}}, :P {:S {:M {:AH {:K {:ER {:S
;; {:terminate "circumspection"}}}}}, :AH {:R {:T {:N {:IH
;; {:terminate "introspection"}}}}}}}, :F {:ER {:P {:M {:IH
;; {:terminate "imperfection"}}, :terminate "perfection"}}}, :S {:D
;; {:IH {:M {:terminate "midsection"}}}, :N {:AE {:R {:T
;; {:terminate "transection"}}}}}, :N {:AH {:K
;; {:terminate "connection"}}}, :T {:AH {:R {:P
;; {:terminate "protection"}}}}}
;;
;; Okay so THAT's what we're working with when we call....
(get-deepest-values))
;; So it turns out that all that
;;
;; (defn get-deepest-values [k]
;; (if (string? k) [k] (mapcat get-deepest-values (vals k))))
;;
;; Does is say: Get the values from the key/value pairs that are in
;; a map called "k." If you hit any value at all that is a string,
;; return it and you're done for that part. OTHERWISE keep mapping
;; over the values and run this function again on each one of them
;; (until you hit a string). And however nested things are is fine
;; and all, but please return a nice flat list of results (that's
;; why it's mapcat instead of cat")
;;
;; So it's like you gave it a siamese russian doll. And it keeps
;; looking inside the first twin's dolls until it finds a piece of
;; paper with a word or two on it. And then it throws away all the
;; dolls. And it does the same to the other twin. Maybe the first
;; twin is three dolls deep. Maybe the second twin is four dolls
;; deep.
;; So that's the reason we switched strings to keywords, because
;; we knew we'd be looking things up by phonemes and we wanted
;; a data type that is good for "looking things up in deeply
;; nested maps" and if there's ONE data type like that, it's keywords! Maybe;
;; god if I know.
;; Of all of them, these tiny recursive functions are the hardest to
;; write and understand.
;; Anyway, that's how it works, I think.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment