-
-
Save dpetranek/cd7d7a19cdef38d98e39b4739fcdaa0f to your computer and use it in GitHub Desktop.
Annotated rhyming dictionary
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| ;; So we want a rhyming dictionary in Clojure. | |
| ;; | |
| ;; https://gist.github.com/jackrusher/8640437 | |
| ;; | |
| ;; I'm going to study this code and learn as I go. | |
| ;; | |
| ;; First I put it in a namespace. | |
| (ns unscroll.rhyme | |
| (:require | |
| [clojure.string :as string])) | |
| ;; So that I can mess with it inside my current environment in | |
| ;; emacs. | |
| ;; | |
| (def rhyme-txt | |
| (map #(string/split % #"[ ]+") | |
| (string/split-lines | |
| (slurp "data/cmudict.txt")))) | |
| ;; Good ol' CMU rhyming dictionary; nice to see you again my friend! | |
| ;; | |
| ;; The format of the dictionary is such: | |
| ;; | |
| ;; ABDOMEN AE0 B D OW1 M AH0 N | |
| ;; | |
| ;; ie. | |
| ;; | |
| ;; WORD[SPACE]PHONEME_1[SPACE]PHONEME_2[SPACE]...PHONEME_N[NEWLINE] | |
| ;; | |
| ;; So we slurp in the file, split the lines by newlines, and then | |
| ;; split on space. | |
| ;; | |
| ;; Gotta love slurp; one thing it took me a while to figure out the | |
| ;; other day is that slurp starts looking for files at the top level | |
| ;; of the Clojure project. Where the source files are means nothing | |
| ;; to Clojure (because it means nothing to the JVM). | |
| ;; | |
| ;; Anyway now that we've done that we can say: | |
| ;; (take 3 (drop 1010 rhyme-txt)) | |
| ;; | |
| ;; (We want to fast-forward past a bunch of comments and miscellany | |
| ;; to get a clear example, so start around 1010) | |
| ;; | |
| ;; And we'd see | |
| ;; (["ACTUARY" "AE1" "K" "CH" "UW0" "EH1" "R" "IY2"] | |
| ;; ["ACTUATE" "AE1" "K" "CH" "UW2" "EY1" "T"] | |
| ;; ["ACTUATOR" "AE1" "K" "T" "Y" "UW0" "EY2" "T" "ER0"]) | |
| ;; | |
| ;; So now we're dealing with a list of vectors where the first value | |
| ;; is the word and the rest represent its phonemes. | |
| ;; | |
| ;; There's a lot going on with this one as there always is with | |
| ;; Clojure code. Let's see how it gets called first: | |
| ;; | |
| ;; (get-deepest-values (get-in rhyme-to-word (take 3 (word-to-rhyme "connection")))) | |
| ;; Which could probably be a function of its own called "rhyme." | |
| ;; | |
| ;; Although I've noticed Jack doesn't really care much whether | |
| ;; something is a function or a variable, it's like it DOESN'T EVEN | |
| ;; MATTER. Hmm. | |
| ;; | |
| ;; Anyway, we're going to give this function, word-to-rhyme a word | |
| ;; and get something back. | |
| ;; | |
| ;; (word-to-rhyme "connection") | |
| ;; | |
| ;; and we get | |
| ;; | |
| ;; [:N :AH :SH :K :EH :N :AH :K] | |
| ;; | |
| ;; which is the reverse order of phonemes. Which makes sense, we're | |
| ;; building a rhyming dictionary, gonna go end to beginning in terms | |
| ;; of phonemes. | |
| ;; | |
| ;; Going to add a crazy amount of indenting here so I can see the | |
| ;; levels more clearly as a n00b. | |
| (def word-to-rhyme | |
| (reduce (fn [m [word & rhyme]] | |
| (assoc m | |
| (string/lower-case word) | |
| (mapv | |
| #(keyword | |
| (string/replace %1 #"[0-9]" "")) | |
| (reverse rhyme)))) | |
| {} rhyme-txt)) | |
| ;; This is a funny one because I mess with the parens my java keeps | |
| ;; running out of heap space. So it's clearly doing a lot, like | |
| ;; loading the whole CMU dictionary into memory. | |
| ;; The super hot LISPY action in there is | |
| ;; (reduce function=[that function] val=[an empty map {}] | |
| ;; coll=[rhyme-txt, our list of vectors]) | |
| ;; Clojure docs are kind of a bear. For reduce, they say "If val is | |
| ;; supplied, returns the result of applying f to val and the first | |
| ;; item in coll, then applying f to that result and the 2nd item, | |
| ;; etc." AWESOME. | |
| ;; So what reduce does here is return the results applying that | |
| ;; function in there to {} and the first item in the rhyme-text, | |
| ;; then applies that to the second item, etc. | |
| ;; So in [m [word & rhyme]] it's going to be going: | |
| ;; | |
| ;; [{} ["CONNECTION" K AH N EH K SH AH N ]] | |
| ;; | |
| ;; So that's interesting because we have the word & rhyme -- the way | |
| ;; that destructuring works, rhyme will catch all of the phonemes | |
| ;; into a list; it's almost like the CMU people could predict this | |
| ;; kind of programming would occur using their dictionary. | |
| ;; Then we say: | |
| ;; | |
| ;; (assoc map key val) | |
| ;; | |
| ;; Or here: | |
| ;; | |
| ;; (assoc {} "connection" ...) | |
| ;; | |
| ;; And then a couple things happen | |
| ;; | |
| ;; First we reverse the rhyme so "K AH0 N EH1 K SH AH0 N" becomes | |
| ;; N AH0 SH K EH1 N AH0 K | |
| ;; that's (reverse rhyme) | |
| ;; | |
| ;; SECOND we replace all the numbers with nothing (likely because we | |
| ;; just don't need the data, can't remember why CMU uses numbers-- | |
| ;; syllable stress maybe?) | |
| ;; | |
| ;; we do that with a regular expression yielding | |
| ;; | |
| ;; N AH SH K EH N AH K | |
| ;; | |
| ;; THIRD we run a mapv across that list of phonemes--that is, apply | |
| ;; to each element and return a vector. And what we are applying is | |
| ;; the "keyword" function which turns a string to a clojure keyword so | |
| ;; we end up with a structure like: | |
| ;; | |
| ;; {"connection" [:N :AH :SH :K :EH :N :AH :K]} | |
| ;; | |
| ;; NOTE: I'm not sure WHY we're converting to keywords but they are | |
| ;; prettier in general and make for better keywords in maps, and I'm | |
| ;; assuming they actually are optimized as, like, keywords, so.... | |
| ;; | |
| ;; Anyway, and then we repeat that (lazily, I guess, so in chunks of | |
| ;; 32 or whatever it is that Clojure does?) as needed until we've | |
| ;; slurped up the whole file into a big map or what I still think of | |
| ;; as an associative array. Aaaand now we have a variable that | |
| ;; defines a function that when given--- | |
| ;; | |
| ;; Oh god, I SEE. I ACTUALLY SEE. This is a def instead of a | |
| ;; function for a reason. It's a var that calls a function which | |
| ;; returns a map, but in Clojure a map can operate as a function. So | |
| ;; when I say: | |
| ;; | |
| ;; (word-to-rhyme "connection") I'm causing the interpreter to read | |
| ;; the entirety of the dictionary into a map, and assigning that map | |
| ;; to "word-to-rhyme" and then because I'm calling word-to-rhyme as | |
| ;; the first item in a sexp, the interpreter evaluates it as a | |
| ;; function and returns the phonemes that it has assoc'd to that | |
| ;; word. | |
| ;; | |
| ;; Clojure is kind of dense. | |
| ;; | |
| ;; So I'm going to assume we're in similar territory here with this | |
| ;; function. | |
| (def rhyme-to-word | |
| (reduce | |
| (fn [m [word rhyme]] | |
| (assoc-in m rhyme { :terminate word })) | |
| {} | |
| word-to-rhyme)) | |
| ;; Aand we are, KIND OF. Hmm. So in this case we take the map (now just a | |
| ;; map) from word-to-rhyme and do another reduce, except the | |
| ;; structure we're building up is going to be a trie (?) so we're | |
| ;; going: | |
| ;; | |
| ;; (assoc-in {} [:N :AH :SH :K :EH :N :AH :K] { :terminate "connection" }) | |
| ;; | |
| ;; And as a result we're getting: | |
| ;; | |
| ;; {:N {:AH {:SH {:K {:EH {:N {:AH {:K {:terminate "connection"}}}}}}}}} | |
| ;; | |
| ;; Great but that's one word. NOW Clojure hands that same map back | |
| ;; to the reduce with ANOTHER word, and so on for thousands of | |
| ;; words, building up a huge nested behemoth of a data structure. | |
| ;; | |
| ;; So we've passed assoc-in the phonemes for "connection"; we can | |
| ;; now pass it "correction" and they should be all jammed up in a | |
| ;; really nice way... | |
| ;; | |
| ;; (assoc-in (assoc-in {} [:N :AH :SH :K :EH :N :AH :K] { :terminate | |
| ;; "connection" }) [:N :AH :SH :K :EH :ER :K] {:terminate | |
| ;; "correction"}) | |
| ;; | |
| ;; Okay, yes we end up with something that will let us take one | |
| ;; word, look up the phonemes (in reverse order) and look for | |
| ;; similar phonemes, then map those back to the words. That's what | |
| ;; we have here, no doubt. Looks like this: | |
| ;; | |
| ;; {:N {:AH {:SH {:K {:EH {:ER {:K {:terminate "correction"}}, :N | |
| ;; {:AH {:K {:terminate "connection"}}}}}}}}} | |
| ;; | |
| ;; And since I can assoc-in I can get in too. | |
| ;; | |
| ;; Okay so on we go... | |
| (defn get-deepest-values [k] | |
| (if (string? k) [k] (mapcat get-deepest-values (vals k)))) | |
| ;; What the hell is this? What is it for? OH GOD. | |
| ;; | |
| ;; So here we're looking for strings inside a nest of keywords--that | |
| ;; makes sense. What is mapcat? Clojure docs: "Returns the result | |
| ;; of applying concat to the result of applying map to f and | |
| ;; colls. Thus function f should return a collection." | |
| ;; | |
| ;; Great, thanks Clojure docs. You're my pal. | |
| ;; What it means I think is that you're going to give this function | |
| ;; a bundle of stuff and it'll do something to each piece of stuff | |
| ;; (map...) and then smush everything together into one nice list | |
| ;; (...cat). So we're saying given a nested associated structure | |
| ;; like the one we just made, pull out all the values ... | |
| ;; | |
| ;; wait hold on--let's look at how it's actually called. | |
| ;; Okay this is the big mooooment | |
| (get-deepest-values (get-in rhyme-to-word (take 5 (word-to-rhyme "erection")))) | |
| ;; And this gives a result thus: ("erection" "direction" "correction" "collection" "inflection" "preelection" "circumspection" "introspection" "imperfection" "perfection" "midsection" "transection" "connection" "protection") | |
| ;; But sometimes it's all too last-first for me, so Let's do that | |
| ;; using this guy "->>" with comments | |
| (->> | |
| (word-to-rhyme "erection") | |
| ;; gives us [:N :AH :SH :K :EH :R :IH] | |
| (take 5) | |
| ;; gives us (:N :AH :SH :K :EH)--i.e. five phonemes, or the "ection" part of the rhyme. | |
| (get-in rhyme-to-word) | |
| ;; so we're calling (get-in rhyme-to-word '(:N :AH :SH :K :EH)) | |
| ;; and we get this result: | |
| ;; | |
| ;; {:R {:IH {:terminate "erection"}}, :ER {:D | |
| ;; {:terminate "direction"}, :K {:terminate "correction"}}, :L {:AH | |
| ;; {:K {:terminate "collection"}}, :F {:N {:IH | |
| ;; {:terminate "inflection"}}}, :IH {:IY {:R {:P | |
| ;; {:terminate "preelection"}}}}}, :P {:S {:M {:AH {:K {:ER {:S | |
| ;; {:terminate "circumspection"}}}}}, :AH {:R {:T {:N {:IH | |
| ;; {:terminate "introspection"}}}}}}}, :F {:ER {:P {:M {:IH | |
| ;; {:terminate "imperfection"}}, :terminate "perfection"}}}, :S {:D | |
| ;; {:IH {:M {:terminate "midsection"}}}, :N {:AE {:R {:T | |
| ;; {:terminate "transection"}}}}}, :N {:AH {:K | |
| ;; {:terminate "connection"}}}, :T {:AH {:R {:P | |
| ;; {:terminate "protection"}}}}} | |
| ;; | |
| ;; Okay so THAT's what we're working with when we call.... | |
| (get-deepest-values)) | |
| ;; So it turns out that all that | |
| ;; | |
| ;; (defn get-deepest-values [k] | |
| ;; (if (string? k) [k] (mapcat get-deepest-values (vals k)))) | |
| ;; | |
| ;; Does is say: Get the values from the key/value pairs that are in | |
| ;; a map called "k." If you hit any value at all that is a string, | |
| ;; return it and you're done for that part. OTHERWISE keep mapping | |
| ;; over the values and run this function again on each one of them | |
| ;; (until you hit a string). And however nested things are is fine | |
| ;; and all, but please return a nice flat list of results (that's | |
| ;; why it's mapcat instead of cat") | |
| ;; | |
| ;; So it's like you gave it a siamese russian doll. And it keeps | |
| ;; looking inside the first twin's dolls until it finds a piece of | |
| ;; paper with a word or two on it. And then it throws away all the | |
| ;; dolls. And it does the same to the other twin. Maybe the first | |
| ;; twin is three dolls deep. Maybe the second twin is four dolls | |
| ;; deep. | |
| ;; So that's the reason we switched strings to keywords, because | |
| ;; we knew we'd be looking things up by phonemes and we wanted | |
| ;; a data type that is good for "looking things up in deeply | |
| ;; nested maps" and if there's ONE data type like that, it's keywords! Maybe; | |
| ;; god if I know. | |
| ;; Of all of them, these tiny recursive functions are the hardest to | |
| ;; write and understand. | |
| ;; Anyway, that's how it works, I think. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment