# NLP concepts with spaCy

“Natural Language Processing” is a field at the intersection of computer science, linguistics and artificial intelligence which aims to make the underlying structure of language available to computer programs for analysis and manipulation. It’s a vast and vibrant field with a long history! New research and techniques are being developed constantly.

The aim of this notebook is to introduce a few simple concepts and techniques from NLP—just the stuff that’ll help you do creative things quickly, and maybe open the door for you to understand more sophisticated NLP concepts that you might encounter elsewhere.

We'll be using a library called [spaCy](https://spacy.io/), which is a good compromise between being very powerful and state-of-the-art and easy for newcomers to understand.

(Traditionally, most NLP work in Python was done with a library called [NLTK](http://www.nltk.org/). NLTK is a fantastic library, but it’s also a writhing behemoth: large and slippery and difficult to understand. Also, much of the code in NLTK is decades out of date with contemporary practices in NLP.)

This tutorial is written in Python 2.7, but the concepts should translate easily to later versions.

## Natural language

“Natural language” is a loaded phrase: what makes one stretch of language “natural” while another stretch is not? NLP techniques are opinionated about what language is and how it works; as a consequence, you’ll sometimes find yourself having to conceptualize your text with uncomfortable abstractions in order to make it work with NLP. (This is especially true of poetry, which almost by definition breaks most “conventional” definitions of how language behaves and how it’s structured.)

Of course, a computer can never really fully “understand” human language. Even when the text you’re using fits the abstractions of NLP perfectly, the results of NLP analysis are always going to be at least a little bit inaccurate. But often even inaccurate results can be “good enough”—and in any case, inaccurate output from NLP procedures can be an excellent source of the sublime and absurd juxtapositions that we (as poets) are constantly in search of.

## English only (sorta)

The English Speakers Only Club
The main assumption that most NLP libraries and techniques make is that the text you want to process will be in English. Historically, most NLP research has been on English specifically; it’s only more recently that serious work has gone into applying these techniques to other languages. The examples in this chapter are all based on English texts, and the tools we’ll use are geared toward English. If you’re interested in working on NLP in other languages, here are a few starting points:
* [Konlpy](https://github.com/konlpy/konlpy), natural language processing in
  Python for Korean
* [Jieba](https://github.com/fxsjy/jieba), text segmentation and POS tagging in
  Python for Chinese
* The [Pattern](http://www.clips.ua.ac.be/pattern) library (like TextBlob, a
  simplified/augmented interface to NLTK) includes POS-tagging and some
  morphology for Spanish in its
  [pattern.es](http://www.clips.ua.ac.be/pages/pattern-es) package.

## English grammar: a crash course

The only thing I believe about English grammar is [this](http://www.writing.upenn.edu/~afilreis/88v/creeley-on-sentence.html):

> "Oh yes, the sentence," Creeley once told the critic Burton Hatlen, "that's
> what we call it when we put someone in jail."

There is no such thing as a sentence, or a phrase, or a part of speech, or even
a "word"---these are all pareidolic fantasies occasioned by glints of sunlight
we see reflected on the surface of the ocean of language; fantasies that we
comfort ourselves with when faced with language's infinite and unknowable
variability.

Regardless, we may find it occasionally helpful to think about language using
these abstractions. The following is a gross oversimplification of both how
English grammar works, and how theories of English grammar work in the context
of NLP. But it should be enough to get us going!

### Sentences and parts of speech

English texts can roughly be divided into "sentences." Sentences are themselves
composed of individual words, each of which has a function in expressing the
meaning of the sentence. The function of a word in a sentence is called its
"part of speech"---i.e., a word functions as a noun, a verb, an adjective, etc.
Here's a sentence, with words marked for their part of speech:

    I       really love entrees       from        the        new       cafeteria.
    pronoun adverb verb noun (plural) preposition determiner adjective noun

Of course, the "part of speech" of a word isn't a property of the word itself.
We know this because a single "word" can function as two different parts of speech:

> I love cheese.

The word "love" here is a verb. But here:

> Love is a battlefield.

... it's a noun. For this reason (and others), it's difficult for computers to
accurately determine the part of speech for a word in a sentence. (It's
difficult sometimes even for humans to do this.) But NLP procedures do their
best!

### Phrases and larger syntactic structures

There are several different ways for talking about larger syntactic structures in sentences. The scheme used by spaCy is called a "dependency grammar." We'll talk about the details of this below.


## Installing spaCy

[Follow the instructions here](https://spacy.io/docs/usage/). When using `pip`, make sure to upgrade to the newest version first, with `pip install --upgrade pip`. (This will ensure that at least *some* of the dependencies are installed as pre-built binaries)

    pip install spacy
    
(If you're not using a virtual environment, try `sudo pip install spacy`.)

Currently, spaCy is distributed in source form only, so the installation process involves a bit of compiling. On macOS, you'll need to install [XCode](https://developer.apple.com/xcode/) in order to perform the compilation steps. [Here's a good tutorial for macOS Sierra](http://railsapps.github.io/xcode-command-line-tools.html), though the steps should be similar on other versions.

After you've installed spaCy, you'll need to download the data. Run the following on the command line:

    python -m spacy download en

## Basic usage

Import `spacy` like any other Python module. The `spaCy` code expects all strings to be unicode strings, so make sure you've included `from __future__ import unicode_literals` at the top of your Python 2.7 code—it'll make your life easier, trust me.

In [170]:
from __future__ import unicode_literals
import spacy

Create a new spaCy object using `spacy.load('en')` (assuming you want to work with English; spaCy supports other languages as well).

In [171]:
nlp = spacy.load('en')

And then create a `Document` object by calling the spaCy object with the text you want to work with. Below I've included a few sentences from the Universal Declaration of Human Rights:

In [204]:
doc = nlp("All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood. Everyone has the right to life, liberty and security of person.")

## Sentences

If you learn nothing else about spaCy (or NLP), then learn at least that it's a good way to get a list of sentences in a text. Once you've created a document object, you can iterate over the sentences it contains using the `.sents` attribute:

In [172]:
for item in doc.sents:
    print item.text

All human beings are born free and equal in dignity and rights.
They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood.
Everyone has the right to life, liberty and security of person.


The `.sents` attribute is a generator, so you can't index or count it directly. To do this, you'll need to convert it to a list first using the `list()` function:

In [110]:
sentences_as_list = list(doc.sents)

In [111]:
len(sentences_as_list)

3

In [112]:
import random
random.choice(sentences_as_list)

They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood.

## Words

Iterating over a document yields each word in the document in turn. Words are represented with spaCy [Token](https://spacy.io/docs/api/token) objects, which have several interesting attributes. The `.text` attribute gives the underlying text of the word, and the `.lemma_` attribute gives the word's "lemma" (explained below):

In [173]:
for word in doc:
    print word.text, word.lemma_

All all
human human
beings being
are be
born bear
free free
and and
equal equal
in in
dignity dignity
and and
rights right
. .
They -PRON-
are be
endowed endow
with with
reason reason
and and
conscience conscience
and and
should should
act act
towards towards
one one
another another
in in
a a
spirit spirit
of of
brotherhood brotherhood
. .
Everyone everyone
has have
the the
right right
to to
life life
, ,
liberty liberty
and and
security security
of of
person person
. .


A word's "lemma" is its most "basic" form, the form without any morphology
applied to it. "Sing," "sang," "singing," are all different "forms" of the
lemma *sing*. Likewise, "octopi" is the plural of "octopus"; the "lemma" of
"octopi" is *octopus*.

"Lemmatizing" a text is the process of going through the text and replacing
each word with its lemma. This is often done in an attempt to reduce a text
to its most "essential" meaning, by eliminating pesky things like verb tense
and noun number.

Individual sentences can also be iterated over to get a list of words:

In [114]:
sentence = list(doc.sents)[1]
for word in sentence:
    print word.text

They
are
endowed
with
reason
and
conscience
and
should
act
towards
one
another
in
a
spirit
of
brotherhood
.


## Parts of speech

The `pos_` attribute gives a general part of speech; the `tag_` attribute gives a more specific designation. [List of meanings here.](https://spacy.io/docs/api/annotation)

In [115]:
for item in doc:
    print item.text, item.pos_, item.tag_

All DET DT
human ADJ JJ
beings NOUN NNS
are VERB VBP
born VERB VBN
free ADJ JJ
and CCONJ CC
equal ADJ JJ
in ADP IN
dignity NOUN NN
and CCONJ CC
rights NOUN NNS
. PUNCT .
They PRON PRP
are VERB VBP
endowed VERB VBN
with ADP IN
reason NOUN NN
and CCONJ CC
conscience NOUN NN
and CCONJ CC
should VERB MD
act VERB VB
towards ADP IN
one NUM CD
another DET DT
in ADP IN
a DET DT
spirit NOUN NN
of ADP IN
brotherhood NOUN NN
. PUNCT .
Everyone NOUN NN
has VERB VBZ
the DET DT
right NOUN NN
to ADP IN
life NOUN NN
, PUNCT ,
liberty NOUN NN
and CCONJ CC
security NOUN NN
of ADP IN
person NOUN NN
. PUNCT .


### Extracting words by part of speech

With knowledge of which part of speech each word belongs to, we can make simple code to extract and recombine words by their part of speech. The following code creates a list of all nouns and adjectives in the text:

In [175]:
nouns = []
adjectives = []
for item in doc:
    if item.pos_ == 'NOUN':
        nouns.append(item.text)
for item in doc:
    if item.pos_ == 'ADJ':
        adjectives.append(item.text)

And below, some code to print out random pairings of an adjective from the text with a noun from the text:

In [177]:
print random.choice(adjectives) + " " + random.choice(nouns)

equal Everyone


Making a list of verbs works similarly:

In [178]:
verbs = []
for item in doc:
    if item.pos_ == 'VERB':
        verbs.append(item.text)

Although in this case, you'll notice the list of verbs is a bit unintuitive. We're getting words like "should" and "are" and "has"—helper verbs that maybe don't fit our idea of what verbs we want to extract.

In [179]:
verbs

[u'are', u'born', u'are', u'endowed', u'should', u'act', u'has']

This is because we used the `.pos_` attribute, which only gives us general information about the part of speech. The `.tag_` attribute allows us to be more specific about the kinds of verbs we want. For example, this code gives us only the verbs in past participle form:

In [180]:
only_past = []
for item in doc:
    if item.tag_ == 'VBN':
        only_past.append(item.text)

In [181]:
only_past

[u'born', u'endowed']

## Larger syntactic units

Okay, so we can get individual words by their part of speech. Great! But what if we want larger chunks, based on their syntactic role in the sentence? The easy way is `.noun_chunks`, which is an attribute of a document or a sentence that evaluates to a list of [spans](https://spacy.io/docs/api/span) of noun phrases, regardless of their position in the document:

In [183]:
for item in doc.noun_chunks:
    print item.text

All human beings
dignity
rights
They
reason
conscience
a spirit
brotherhood
Everyone
life
the right to life, liberty
security
person


For anything more sophisticated than this, though, we'll need to learn about how spaCy parses sentences into its syntactic components.

### Understanding dependency grammars

![displacy parse](http://static.decontextualize.com/syntax_example.png)

[See in "displacy", spaCy's syntax visualization tool.](https://demos.explosion.ai/displacy/?text=Everyone%20has%20the%20right%20to%20life%2C%20liberty%20and%20security%20of%20person&model=en&cpu=1&cph=0)

The spaCy library parses the underlying sentences using a [dependency grammar](https://en.wikipedia.org/wiki/Dependency_grammar). Dependency grammars look different from the kinds of sentence diagramming you may have done in high school, and even from tree-based [phrase structure grammars](https://en.wikipedia.org/wiki/Phrase_structure_grammar) commonly used in descriptive linguistics. The idea of a dependency grammar is that every word in a sentence is a "dependent" of some other word, which is that word's "head." Those "head" words are in turn dependents of other words. The finite verb in the sentence is the ultimate "head" of the sentence, and is not itself dependent on any other word. (The dependents of a particular head are sometimes called its "children.")

The question of how to know what constitutes a "head" and a "dependent" is complicated. As a starting point, here's a passage from [Dependency Grammar and Dependency Parsing](http://stp.lingfil.uu.se/~nivre/docs/05133.pdf):

> Here are some of the criteria that have been proposed for identifying a syntactic relation between a head H and a dependent D in a construction C (Zwicky, 1985; Hudson, 1990):
>
> 1. H determines the syntactic category of C and can often replace C.
> 2. H determines the semantic category of C; D gives semantic specification.
> 3. H is obligatory; D may be optional.
> 4. H selects D and determines whether D is obligatory or optional.
> 5. The form of D depends on H (agreement or government).
> 6. The linear position of D is specified with reference to H."

There are different *types* of relationships between heads and dependents, and each type of relation has its own name. Use the displaCy visualizer (linked above) to see how a particular sentence is parsed, and what the relations between the heads and dependents are. (I've listed a few common relations below.)

Every token object in a spaCy document or sentence has attributes that tell you what the word's head is, what the dependency relationship is between that word and its head, and a list of that word's children (dependents). The following code prints out each word in the sentence, the tag, the word's head, the word's dependency relation with its head, and the word's children (i.e., dependent words):

In [148]:
for word in list(doc.sents)[2]:
    print "Word:", word.text
    print "Tag:", word.tag_
    print "Head:", word.head.text
    print "Dependency relation:", word.dep_
    print "Children:", list(word.children)
    print ""

Word: Everyone
Tag: NN
Head: has
Dependency relation: nsubj
Children: []
Subtree: [Everyone]

Word: has
Tag: VBZ
Head: has
Dependency relation: ROOT
Children: [Everyone, liberty, .]
Subtree: [Everyone, has, the, right, to, life, ,, liberty, and, security, of, person, .]

Word: the
Tag: DT
Head: right
Dependency relation: det
Children: []
Subtree: [the]

Word: right
Tag: NN
Head: liberty
Dependency relation: nmod
Children: [the, to]
Subtree: [the, right, to, life, ,]

Word: to
Tag: IN
Head: right
Dependency relation: prep
Children: [life]
Subtree: [to, life, ,]

Word: life
Tag: NN
Head: to
Dependency relation: pobj
Children: [,]
Subtree: [life, ,]

Word: ,
Tag: ,
Head: life
Dependency relation: punct
Children: []
Subtree: [,]

Word: liberty
Tag: NN
Head: has
Dependency relation: dobj
Children: [right, and, security, of]
Subtree: [the, right, to, life, ,, liberty, and, security, of, person]

Word: and
Tag: CC
Head: liberty
Dependency relation: cc
Children: []
Subtree: [and]

Word: securi

Here's a list of a few dependency relations and what they mean. ([A more complete list can be found here.](http://www.mathcs.emory.edu/~choi/doc/clear-dependency-2012.pdf))

* `nsubj`: this word's head is a verb, and this word is itself the subject of the verb
* `nsubjpass`: same as above, but for subjects in sentences in the passive voice
* `dobj`: this word's head is a verb, and this word is itself the direct object of the verb
* `iobj`: same as above, but indirect object
* `aux`: this word's head is a verb, and this word is an "auxiliary" verb (like "have", "will", "be")
* `attr`: this word's head is a copula (like "to be"), and this is the description attributed to the subject of the sentence (e.g., in "This product is a global brand", `brand` is dependent on `is` with the `attr` dependency relation)
* `det`: this word's head is a noun, and this word is a determiner of that noun (like "the," "this," etc.)
* `amod`: this word's head is a noun, and this word is an adjective describing that noun
* `prep`: this word is a preposition that modifies its head
* `pobj`: this word is a dependent (object) of a preposition

### Using .subtree for extracting syntactic units

The `.subtree` attribute evaluates to a generator that can be flatted by passing it to `list()`. This is a list of the word's syntactic dependents—essentially, the "clause" that the word belongs to.

This function merges a subtree and returns a string with the text of the words contained in it:

In [184]:
def flatten_subtree(st):
    return ''.join([w.text_with_ws for w in list(st)]).strip()

With this function in our toolbox, we can write a loop that prints out the subtree for each word in a sentence:

In [163]:
for word in list(doc.sents)[2]:
    print "Word:", word.text
    print "Flattened subtree: ", flatten_subtree(word.subtree)
    print ""

Word: Everyone
Flattened subtree:  Everyone

Word: has
Flattened subtree:  Everyone has the right to life, liberty and security of person.

Word: the
Flattened subtree:  the

Word: right
Flattened subtree:  the right to life,

Word: to
Flattened subtree:  to life,

Word: life
Flattened subtree:  life,

Word: ,
Flattened subtree:  ,

Word: liberty
Flattened subtree:  the right to life, liberty and security of person

Word: and
Flattened subtree:  and

Word: security
Flattened subtree:  security

Word: of
Flattened subtree:  of person

Word: person
Flattened subtree:  person

Word: .
Flattened subtree:  .



Using the subtree and our knowledge of dependency relation types, we can write code that extracts larger syntactic units based on their relationship with the rest of the sentence. For example, to get all of the noun phrases that are subjects of a verb:

In [164]:
subjects = []
for word in doc:
    if word.dep_ in ('nsubj', 'nsubjpass'):
        subjects.append(flatten_subtree(word.subtree))

In [166]:
subjects

[u'All human beings', u'They', u'Everyone']

Or every prepositional phrase:

In [168]:
prep_phrases = []
for word in doc:
    if word.dep_ == 'prep':
        prep_phrases.append(flatten_subtree(word.subtree))

In [185]:
prep_phrases

[u'in dignity and rights',
 u'with reason and conscience',
 u'towards one another',
 u'in a spirit of brotherhood',
 u'of brotherhood',
 u'to life,',
 u'of person']

## Entity extraction

A common task in NLP is taking a text and extracting "named entities" from it—basically, proper nouns, or names of companies, products, locations, etc. You can easily access this information using the `.ents` property of a document.

In [190]:
doc2 = nlp("John McCain and I visited the Apple Store in Manhattan.")

In [192]:
for item in doc2.ents:
    print item

John McCain
the Apple Store
Manhattan


Entity objects have a `.label_` attribute that tells you the type of the entity. ([Here's a full list of the built-in entity types.](https://spacy.io/docs/usage/entity-recognition#entity-types))

In [208]:
for item in doc2.ents:
    print item.text, item.label_

John McCain PERSON
the Apple Store ORG
Manhattan GPE


[More on spaCy entity recognition.](https://spacy.io/docs/usage/entity-recognition)

## Loading data from a file

You can load data from a file easily with spaCy. You just have to make sure that the data is in Unicode format, not plain-text. An easy way to do this is to call `.decode('utf8')` on the string after you've loaded it:

In [210]:
doc3 = nlp(open("genesis.txt").read().decode('utf8'))

From here, we can see what entities were here with us from the very beginning:

In [209]:
for item in doc3.ents:
    print item.text, item.label_

earth LOC
the Spirit of God ORG
Day PERSON
Night TIME
first ORDINAL
second ORDINAL
one CARDINAL
Earth LOC
morning TIME
third ORDINAL
night TIME
seasons DATE
two CARDINAL
earth LOC
the day DATE
the night TIME

FAC
evening TIME
morning TIME
fourth ORDINAL
moveth TIME
earth LOC
fifth ORDINAL
earth LOC
earth LOC
sixth ORDINAL


## Further reading and resources

[A few example programs can be found here.](https://github.com/aparrish/rwet-examples/tree/master/spacy)

We've barely scratched the surface of what it's possible to do with spaCy. [There's a great page of tutorials on the official site](https://spacy.io/docs/usage/tutorials) that you should check out!