{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Read in all three of the data files.\n", "Split the play in `midsummer.txt` up so each scene can be considered individually." ] }, { "cell_type": "code", "execution_count": 199, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "SCENES:\n", "SCENE I. Athens. A room in the Palace of THESEUS.\n", "\n", "[Enter THESEUS, HIPPOLYTA, PHILOSTRATE, and Atten\n", "--------------------------------------------------\n", "SCENE II. The Same. A Room in a Cottage.\n", "\n", "[Enter SNUG, BOTTOM, FLUTE, SNOUT, QUINCE, and STARVELING.\n", "--------------------------------------------------\n", "SCENE I. A wood near Athens.\n", "\n", "[Enter a FAIRY at One door, and PUCK at another.]\n", "\n", "PUCK\n", "How now, spiri\n", "--------------------------------------------------\n", "SCENE II. Another part of the wood.\n", "\n", "[Enter TITANIA, with her Train.]\n", "\n", "TITANIA\n", "Come, now a roundel a\n", "==================================================\n", "POSITIVE WORDS:\n", "a+\tabound\tabounds\tabundance\tabundant\taccessable\taccessible\tacclaim\tacclaimed\tacclamation\n", "==================================================\n", "NEGATIVE WORDS:\n", "2-faced\t2-faces\tabnormal\tabolish\tabominable\tabominably\tabominate\tabomination\tabort\taborted\n" ] } ], "source": [ "import re\n", "\n", "# ------------------------------------\n", "# 全局变量\n", "# ------------------------------------\n", "\n", "PLAY_FILE = 'midsummer.txt'\n", "PLAY_FILE_ENCODING = 'UTF-8-Sig'\n", "NEGATIVE_WORDS_FILE = 'negative-words.txt'\n", "NEGATIVE_WORDS_FILE_ENCODING = 'cp1252'\n", "POSITIVE_WORDS_FILE = 'positive-words.txt'\n", "POSITIVE_WORDS_FILE_ENCODING = 'cp1252'\n", "\n", "# ------------------------------------\n", "# 剧本文字\n", "# ------------------------------------\n", "\n", "play_text = open(PLAY_FILE, encoding=PLAY_FILE_ENCODING).read()\n", "\n", "# 每幕(ACT)包含两场(SCENE)\n", "# 注意后续处理并不需要明确具体 ACT 以及 SCENE,故直接利用正则进行匹配\n", "# 首先匹配每幕\n", "acts_pat = re.compile(\n", " r'(?<=^ACT)(?:.*?\\n)(.*?)(?=ACT|End of Project)', re.S | re.M)\n", "acts_text = act_pat.findall(play_text)\n", "\n", "# 两场戏剧文字匹配模式\n", "scene1_pat = re.compile(r'(SCENE I.*?)(?=SCENE II)', re.S)\n", "scene2_pat = re.compile(r'(SCENE II.*?\\Z)', re.S)\n", "\n", "# 将每幕中文字归到所有场次中去\n", "scenes = []\n", "for act in acts_text:\n", " scenes.append(scene1_pat.search(act).group())\n", " scenes.append(scene2_pat.search(act).group())\n", "\n", "# ------------------------------------\n", "# 积极、消极词汇\n", "# ------------------------------------\n", "\n", "pos_text = open(POSITIVE_WORDS_FILE,\n", " encoding=POSITIVE_WORDS_FILE_ENCODING).read()\n", "neg_text = open(NEGATIVE_WORDS_FILE,\n", " encoding=NEGATIVE_WORDS_FILE_ENCODING).read()\n", "\n", "\n", "def parseValidWords(s):\n", " \"\"\"\n", " 从文本中析出有效词语\n", " \"\"\"\n", "\n", " words = []\n", " lines = s.splitlines()\n", " for line in lines:\n", " if line and not line.startswith(';'):\n", " words.append(line)\n", " return words\n", "\n", "\n", "neg_words = parseValidWords(neg_text)\n", "pos_words = parseValidWords(pos_text)\n", "\n", "# ------------------------------------\n", "# 粗略预览\n", "# ------------------------------------\n", "\n", "print('SCENES:')\n", "for i, scene in enumerate(scenes[:4]):\n", " if i:\n", " print('-' * 50)\n", " print(scene[:100])\n", "print('=' * 50)\n", "print('POSITIVE WORDS:')\n", "print('\\t'.join(pos_words[:10]))\n", "print('=' * 50)\n", "print('NEGATIVE WORDS:')\n", "print('\\t'.join(neg_words[:10]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Develop a single measure based on the word occurrences that will describe the positivity/negativity of the scene." ] }, { "cell_type": "code", "execution_count": 200, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[(165, 154), (41, 60), (136, 174), (85, 109), (115, 122), (215, 354), (128, 111), (17, 20), (179, 246), (26, 41)]\n" ] } ], "source": [ "# 编译积极、消极词汇正则,这里开启忽略大小写\n", "pos_pat = re.compile('|'.join(map(re.escape, pos_words)), re.I)\n", "neg_pat = re.compile('|'.join(map(re.escape, neg_words)), re.I)\n", "\n", "# 所有场戏的积极消极词语数量统计\n", "scene_emotions = []\n", "for scene in scenes:\n", " pos_cnt = len(pos_pat.findall(scene))\n", " neg_cnt = len(neg_pat.findall(scene))\n", " scene_emotions.append((pos_cnt, neg_cnt))\n", "\n", "print(scene_emotions)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "这里指定一个简单的策略:对于一场戏中出现的积极、消极词语出现的次数数组(积极词数,消极词数),计算其均值数,如果:\n", "- 积极词数高于均值数的一个百分比(比如 5%),那就说这场戏是积极的;\n", "- 消极词数高于均值数的一个百分比(比如 5%),那就说这场戏是消极的;\n", "- 其他情况为情感中立的。" ] }, { "cell_type": "code", "execution_count": 201, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(165, 154) ==> Neutral\n", "(41, 60) ==> Negative\n", "(136, 174) ==> Negative\n", "(85, 109) ==> Negative\n", "(115, 122) ==> Neutral\n", "(215, 354) ==> Negative\n", "(128, 111) ==> Positive\n", "(17, 20) ==> Negative\n", "(179, 246) ==> Negative\n", "(26, 41) ==> Negative\n" ] } ], "source": [ "def judge_emotion(pairs, threshold=0.05):\n", " \"\"\"\n", " 判断给定情感词数元组代表的积极性与消极性\n", " \"\"\"\n", "\n", " emo = ''\n", " mean = sum(pairs) / 2\n", " if pairs[0] > pairs[1] and pairs[0] / mean - 1 > threshold:\n", " emo = 'Positive'\n", " elif pairs[0] < pairs[1] and pairs[1] / mean - 1 > threshold:\n", " emo = 'Negative'\n", " else:\n", " emo = 'Neutral'\n", "\n", " return emo\n", "\n", "\n", "for p in scene_emotions:\n", " print('{} ==> {}'.format(p, judge_emotion(p)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "可以看出在 `情感因子` 为 5% 时,这 10 场戏中很少有积极的戏(仅 1 场)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Make a plot of the measure as a y-axis, with scene number as an x-axis." ] }, { "cell_type": "code", "execution_count": 202, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXoAAAD8CAYAAAB5Pm/hAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4wLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvqOYd8AAAGGJJREFUeJzt3XuQlPW95/H3RxgdEIOIo4UMZNyzxFURh3FENkTkaMLNU4LJaqCMt4IdK0tONBg36B8Rt44WqfKSQ9yFgKi4SxRWRK2IF0DdYIKGAQle0MBRDkzgyEgUFQNy+e4f/cycAQam59LdM898XlVd/fSvf0//vj3iZ575PU//WhGBmZml13GFLsDMzHLLQW9mlnIOejOzlHPQm5mlnIPezCzlHPRmZinnoDczSzkHvZlZyjnozcxSrmuhCwA49dRTo6ysrNBlmJl1KGvWrPk4Ikqa6tcugr6srIzq6upCl2Fm1qFI+tds+nnqxsws5Rz0ZmYp56A3M0u5djFHb2bptm/fPmpqatizZ0+hS+mQiouLKS0tpaioqEX7O+jNLOdqamo46aSTKCsrQ1Khy+lQIoKdO3dSU1PDmWee2aLX8NSNmeXcnj176N27t0O+BSTRu3fvVv015KA3s7xwyLdca392TQa9pGJJf5T0J0nvSLoraX9U0oeS1iW38qRdkmZK2iRpvaSKVlVoZmatks0c/V7g0oj4QlIR8Jqk55PnbouIJw/rPwYYkNwuAmYl92ZmAJRNe65NX2/zjMub7NOlSxfOO+889u/fz9lnn838+fPp3r17s8aZPHkyU6dO5ZxzzuGee+7hjjvuqH/um9/8Jn/4wx+aXXs+NBn0kfn28C+Sh0XJ7VjfKD4OeCzZ73VJJ0vqExHbW12tWUczvWcz+u7KXR1Gt27dWLduHQDXXHMNs2fPZurUqc16jYceeqh++/Cgb68hD1nO0UvqImkdsANYFhFvJE/dnUzPPCDphKStL7C1we41SZuZWbtw8cUXs2nTJgDuv/9+Bg4cyMCBA/nlL38JwO7du7n88ss5//zzGThwIAsXLgRgxIgRVFdXM23aNP72t79RXl7ONddcA0CPHj0A+P73v8/SpUvrx7rhhhtYvHgxBw4c4LbbbuPCCy9k0KBB/PrXv87b+83q8sqIOACUSzoZWCJpIHA78G/A8cAc4GfA/wAaO2twxF8AkqqAKoD+/fu3qHgzs+bav38/zz//PKNHj2bNmjU88sgjvPHGG0QEF110EZdccgkffPABZ5xxBs89l5li2rXr0L+2ZsyYwYMPPlj/F0JDEyZMYOHChYwdO5avvvqKFStWMGvWLObNm0fPnj1ZvXo1e/fuZdiwYYwcObLFl0w2R7OuuomIT4FXgdERsT0y9gKPAEOSbjVAvwa7lQLbGnmtORFRGRGVJSVNLr5mZtYqdUfglZWV9O/fn0mTJvHaa69x5ZVXcuKJJ9KjRw+++93vsnLlSs477zyWL1/Oz372M1auXEnPntlPwY0ZM4aXX36ZvXv38vzzzzN8+HC6devGSy+9xGOPPUZ5eTkXXXQRO3fuZOPGjTl8x/+uySN6SSXAvoj4VFI34NvAL+rm3ZW57mc88Hayy7PAjyQ9QeYk7C7Pz5tZoTWco6+TOZV4pG984xusWbOGpUuXcvvttzNy5Eh+/vOfZzVOcXExI0aM4MUXX2ThwoVMnDixfqxf/epXjBo1qnVvpAWyOaLvA7wiaT2wmswc/W+BBZLeAt4CTgX+Kem/FPgA2ATMBf5bm1dtZtYGhg8fztNPP82XX37J7t27WbJkCRdffDHbtm2je/fu/OAHP+CnP/0pa9euPWLfoqIi9u3b1+jrTpgwgUceeYSVK1fWB/uoUaOYNWtW/T5//vOf2b17d+7eXAPZXHWzHhjcSPulR+kfwJTWl2ZmaZXN5ZD5UFFRwQ033MCQIZmZ58mTJzN48GBefPFFbrvtNo477jiKioqYNWvWEftWVVUxaNAgKioqWLBgwSHPjRw5kuuuu44rrriC448/vv61N2/eTEVFBRFBSUkJTz/9dO7fJKCj/emST5WVleEvHrFU8uWVAGzYsIGzzz670GV0aI39DCWtiYjKpvb1EghmZinnoDczSzkHvZlZyjnozcxSzkFvZpZyDnozs5TzVwmaWf4157LTrF6v6UtTJTF16lTuu+8+AO69916++OILpk+f3qaltMfli31Eb2adwgknnMBTTz3Fxx9/nNNx7rnnnkMeFzrkwUFvZp1E165dqaqq4oEHHjjiudraWr73ve9x4YUXcuGFF/L73/++vv073/kOFRUV3HTTTXz961+v/0Uxfvx4LrjgAs4991zmzJkD0G6XL3bQm1mnMWXKFBYsWHDEssM333wzP/nJT1i9ejWLFy9m8uTJANx1111ceumlrF27liuvvJItW7bU7/Pwww+zZs0aqqurmTlzJjt37mTGjBn1i6cdvixC3fLFQP3yxWPHjj1k+eLVq1czd+5cPvzwwzZ9356jN7NO42tf+xrXXXcdM2fOpFu3bvXty5cv5913361//Nlnn/H555/z2muvsWTJEgBGjx5Nr1696vvMnDmz/rmtW7eyceNGevfufdSxx4wZw49//GP27t3LCy+8cMjyxevXr+fJJzPfyrpr1y42btzYpuvUO+jNrFO55ZZbqKio4MYbb6xvO3jwIKtWrTok/OHoyxi/+uqrLF++nFWrVtG9e3dGjBjBnj17jjluIZcv9tSNmXUqp5xyCldffTXz5s2rbxs5ciQPPvhg/eO6deu/9a1vsWjRIgBeeuklPvnkEyBz1N2rVy+6d+/Oe++9x+uvv16/b3tcvthH9GaWfwVeqfPWW289JNhnzpzJlClTGDRoEPv372f48OHMnj2bO++8k4kTJ7Jw4UIuueQS+vTpw0knncTo0aOZPXs2gwYN4qyzzmLo0KH1r9Uely/2MsVmueRlioGOu0zx3r176dKlC127dmXVqlX88Ic/bPR7YvOhNcsU+4jezOwotmzZwtVXX83Bgwc5/vjjmTt3bqFLahEHvZnZUQwYMIA333yz0GW0mk/GmlletIdp4o6qtT87B72Z5VxxcTE7d+502LdARLBz506Ki4tb/BqeujGznCstLaWmpoba2tpCl9IhFRcXU1pa2uL9mwx6ScXA74ATkv5PRsSdks4EngBOAdYC10bEV5JOAB4DLgB2At+PiM0trtDMOryioqI2/aSnNU82Uzd7gUsj4nygHBgtaSjwC+CBiBgAfAJMSvpPAj6JiP8IPJD0MzOzAmky6CPji+RhUXIL4FLgyaR9PjA+2R6XPCZ5/jJJarOKzcysWbI6GSupi6R1wA5gGfAvwKcRsT/pUgP0Tbb7AlsBkud3AUes9COpSlK1pGrP25mZ5U5WQR8RByKiHCgFhgCNfcSt7nR6Y0fvR5xqj4g5EVEZEZUlJSXZ1mtmZs3UrMsrI+JT4FVgKHCypLqTuaXAtmS7BugHkDzfE/hrWxRrZmbN12TQSyqRdHKy3Q34NrABeAX4L0m364Fnku1nk8ckz78cvnjWzKxgsrmOvg8wX1IXMr8YFkXEbyW9Czwh6Z+AN4G6NT/nAf9b0iYyR/ITclC3mZllqcmgj4j1wOBG2j8gM19/ePse4Ko2qc7MzFrNSyCYmaWcg97MLOUc9GZmKeegNzNLOQe9mVnKOejNzFLOQW9mlnIOejOzlHPQm5mlnIPezCzlHPRmZinnoDczSzkHvZlZyjnozcxSzkFvZpZyDnozs5Rz0JuZpZyD3sws5Rz0ZmYp56A3M0u5JoNeUj9Jr0jaIOkdSTcn7dMl/UXSuuQ2tsE+t0vaJOl9SaNy+QbMzOzYumbRZz9wa0SslXQSsEbSsuS5ByLi3oadJZ0DTADOBc4Alkv6RkQcaMvCzcwsO00e0UfE9ohYm2x/DmwA+h5jl3HAExGxNyI+BDYBQ9qiWDMza75mzdFLKgMGA28kTT+StF7Sw5J6JW19ga0Ndqvh2L8YzMwsh7IOekk9gMXALRHxGTAL+DugHNgO3FfXtZHdo5HXq5JULam6tra22YWbmVl2sgp6SUVkQn5BRDwFEBEfRcSBiDgIzOXfp2dqgH4Ndi8Fth3+mhExJyIqI6KypKSkNe/BzMyOocmTsZIEzAM2RMT9Ddr7RMT25OGVwNvJ9rPAbyTdT+Zk7ADgj21atZnZsUzv2Yy+u3JXRzuRzVU3w4BrgbckrUva7gAmSionMy2zGbgJICLekbQIeJfMFTtTfMWNmVnhNBn0EfEajc+7Lz3GPncDd7eiLjMzayP+ZKyZWco56M3MUs5Bb2aWcg56M7OUc9CbmaWcg97MLOUc9GZmKeegNzNLOQe9mVnKOejNzFLOQW9mlnIOejOzlHPQm5mlnIPezCzlHPRmZimXzRePmFkDZdOey7rv5uIcFmKWJR/Rm5mlnIPezCzlHPRmZinnoDczS7kmg15SP0mvSNog6R1JNyftp0haJmljct8raZekmZI2SVovqSLXb8LMzI4umyP6/cCtEXE2MBSYIukcYBqwIiIGACuSxwBjgAHJrQqY1eZVm5lZ1poM+ojYHhFrk+3PgQ1AX2AcMD/pNh8Yn2yPAx6LjNeBkyX1afPKzcwsK82ao5dUBgwG3gBOj4jtkPllAJyWdOsLbG2wW03SZmZmBZB10EvqASwGbomIz47VtZG2aOT1qiRVS6qura3NtgwzM2umrIJeUhGZkF8QEU8lzR/VTckk9zuS9hqgX4PdS4Fth79mRMyJiMqIqCwpKWlp/WZm1oRsrroRMA/YEBH3N3jqWeD6ZPt64JkG7dclV98MBXbVTfGYmVn+ZbPWzTDgWuAtSeuStjuAGcAiSZOALcBVyXNLgbHAJuBL4MY2rdjMzJqlyaCPiNdofN4d4LJG+gcwpZV1mZlZG/EnY83MUs5Bb2aWcg56M7OUc9CbmaWcg97MLOUc9GZmKeegNzNLOX85uJl1CP5S9pbzEb2ZWco56M3MUs5TN53F9J7N6Lsrd3WYWd75iN7MLOUc9GZmKeegNzNLOQe9mVnKdfiTsc26tnbG5TmsxMysffIRvZlZyjnozcxSzkFvZpZyDnozs5RrMuglPSxph6S3G7RNl/QXSeuS29gGz90uaZOk9yWNylXhZmaWnWyO6B8FRjfS/kBElCe3pQCSzgEmAOcm+/wvSV3aqlgzM2u+Ji+vjIjfSSrL8vXGAU9ExF7gQ0mbgCHAqhZX2Ja83ouZdUKtmaP/kaT1ydROr6StL7C1QZ+apM3MzAqkpUE/C/g7oBzYDtyXtKuRvtHYC0iqklQtqbq2traFZZiZWVNaFPQR8VFEHIiIg8BcMtMzkDmC79egaymw7SivMSciKiOisqSkpCVlmJlZFloU9JL6NHh4JVB3Rc6zwARJJ0g6ExgA/LF1JZqZWWs0eTJW0uPACOBUSTXAncAISeVkpmU2AzcBRMQ7khYB7wL7gSkRcSA3pZuZWTayuepmYiPN847R/27g7tYUZWZmbcefjDUzSzkHvZlZyjnozcxSrsN/8Uhn1qwvXSnOYSFm1q75iN7MLOUc9GZmKeegNzNLOQe9mVnK+WSsWQfTrJPwMy7PYSXWUfiI3sws5Rz0ZmYp56kbSz9/s5h1cj6iNzNLOQe9mVnKOejNzFLOQW9mlnI+GWsdkhd0M8uej+jNzFLOR/RmaeZLSw0HveWLA8esYDx1Y2aWck0GvaSHJe2Q9HaDtlMkLZO0MbnvlbRL0kxJmyStl1SRy+LNzKxp2RzRPwqMPqxtGrAiIgYAK5LHAGOAAcmtCpjVNmWamVlLNRn0EfE74K+HNY8D5ifb84HxDdofi4zXgZMl9WmrYs3MrPlaOkd/ekRsB0juT0va+wJbG/SrSdqOIKlKUrWk6tra2haWYWZmTWnrk7FqpC0a6xgRcyKiMiIqS0pK2rgMMzOr09Kg/6huSia535G01wD9GvQrBba1vDwzM2utlgb9s8D1yfb1wDMN2q9Lrr4ZCuyqm+IxM7PCaPIDU5IeB0YAp0qqAe4EZgCLJE0CtgBXJd2XAmOBTcCXwI05qNnMzJqhyaCPiIlHeeqyRvoGMKW1RZmZWdvxJ2PNzFLOQW9mlnIOejOzlPPqldZi/vIPs47BR/RmZinnoDczSzkHvZlZyjnozcxSzkFvZpZyDnozs5Rz0JuZpZyD3sws5Rz0ZmYp56A3M0s5B72ZWco56M3MUs5Bb2aWcg56M7OU8zLF+TS9ZzP67spdHWbWqfiI3sws5Vp1RC9pM/A5cADYHxGVkk4BFgJlwGbg6oj4pHVlmplZS7XFEf3fR0R5RFQmj6cBKyJiALAieWxmZgWSi6mbccD8ZHs+MD4HY5iZWZZaG/QBvCRpjaSqpO30iNgOkNyf1soxzMysFVp71c2wiNgm6TRgmaT3st0x+cVQBdC/f/9WlmFmZkfTqiP6iNiW3O8AlgBDgI8k9QFI7nccZd85EVEZEZUlJSWtKcPMzI6hxUEv6URJJ9VtAyOBt4FngeuTbtcDz7S2SDMza7nWTN2cDiyRVPc6v4mIFyStBhZJmgRsAa5qfZlmZtZSLQ76iPgAOL+R9p3AZa0pyszM2o6XQGilsmnPZd13c3EOCzGz9qOdLXfiJRDMzFLOQW9mlnIOejOzlHPQm5mlnIPezCzlHPRmZinnoDczSzkHvZlZyvkDU2ZmWejIH470Eb2ZWco56M3MUs5TN2aWG+1svZfOzEf0ZmYp5yN6M8taRz4h2Zn5iN7MLOUc9GZmKeegNzNLOQe9mVnKOejNzFIuZ0EvabSk9yVtkjQtV+OYmdmx5SToJXUB/icwBjgHmCjpnFyMZWZmx5arI/ohwKaI+CAivgKeAMblaCwzMzuGXAV9X2Brg8c1SZuZmeWZIqLtX1S6ChgVEZOTx9cCQyLiHxv0qQKqkodnAe+3eSFHOhX4OA/jtMfx/d4739iFHr+zjp3P8b8eESVNdcrVEgg1QL8Gj0uBbQ07RMQcYE6Oxm+UpOqIqMznmO1lfL/3zjd2ocfvrGO3h/EPl6upm9XAAElnSjoemAA8m6OxzMzsGHJyRB8R+yX9CHgR6AI8HBHv5GIsMzM7tpytXhkRS4GluXr9FsrrVFE7G9/vvfONXejxO+vY7WH8Q+TkZKyZmbUfXgLBzCzlOkXQS3pY0g5Jbxdg7H6SXpG0QdI7km7O8/jFkv4o6U/J+Hflc/ykhi6S3pT02wKMvVnSW5LWSarO89gnS3pS0nvJf///nKdxz0reb93tM0m35GPsBjX8JPn39rakxyXl7WtIJN2cjPtOPt53Y/ki6RRJyyRtTO575bqOY+kUQQ88Cowu0Nj7gVsj4mxgKDAlz8tB7AUujYjzgXJgtKSheRwf4GZgQ57HbOjvI6K8AJe7/TPwQkT8J+B88vQziIj3k/dbDlwAfAksycfYAJL6Aj8GKiNiIJkLMibkaeyBwH8l8+n884F/kDQgx8M+ypH5Mg1YEREDgBXJ44LpFEEfEb8D/lqgsbdHxNpk+3My/7Pn7VPCkfFF8rAoueXtxIykUuBy4KF8jdkeSPoaMByYBxARX0XEpwUo5TLgXyLiX/M8blegm6SuQHcO+xxNDp0NvB4RX0bEfuD/AVfmcsCj5Ms4YH6yPR8Yn8samtIpgr69kFQGDAbeyPO4XSStA3YAyyIin+P/EvjvwME8jtlQAC9JWpN8Gjtf/gNQCzySTFs9JOnEPI5fZwLweD4HjIi/APcCW4DtwK6IeClPw78NDJfUW1J3YCyHfngzX06PiO2QOdgDTitADfUc9HkiqQewGLglIj7L59gRcSD5M74UGJL8eZtzkv4B2BERa/Ix3lEMi4gKMiupTpE0PE/jdgUqgFkRMRjYTZ7/fE8+rHgF8H/zPG4vMke0ZwJnACdK+kE+xo6IDcAvgGXAC8CfyEyfdmoO+jyQVEQm5BdExFOFqiOZOniV/J2vGAZcIWkzmRVML5X0f/I0NgARsS2530FmnnpInoauAWoa/PX0JJngz6cxwNqI+CjP434b+DAiaiNiH/AU8M18DR4R8yKiIiKGk5lS2ZivsRv4SFIfgOR+RwFqqOegzzFJIjNPuyEi7i/A+CWSTk62u5H5n/C9fIwdEbdHRGlElJGZQng5IvJyZAcg6URJJ9VtAyPJ/GmfcxHxb8BWSWclTZcB7+Zj7AYmkudpm8QWYKik7sm//8vI48l4Sacl9/2B71KYn8GzwPXJ9vXAMwWooV7OPhnbnkh6HBgBnCqpBrgzIublafhhwLXAW8k8OcAdySeH86EPMD/5MpjjgEURkffLHAvkdGBJJmvoCvwmIl7I4/j/CCxIplA+AG7M18DJ/PR3gJvyNWadiHhD0pPAWjLTJm+S30+KLpbUG9gHTImIT3I5WGP5AswAFkmaROYX31W5rKEp/mSsmVnKeerGzCzlHPRmZinnoDczSzkHvZlZyjnozcxSzkFvZpZyDnozs5Rz0JuZpdz/Bwy+o/tvl8iEAAAAAElFTkSuQmCC\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "\n", "# 参考:\n", "# https://matplotlib.org/gallery/lines_bars_and_markers/barchart.html\n", "# https://python-graph-gallery.com/10-barplot-with-number-of-observation\n", "\n", "scene_emotions_pos = [x[0] for x in scene_emotions]\n", "scene_emotions_neg = [x[1] for x in scene_emotions]\n", "\n", "ind = np.arange(1, len(scenes)+1)\n", "width = 0.35\n", "\n", "plt.bar(ind - width / 2, scene_emotions_pos, width, label='Positive')\n", "plt.bar(ind + width / 2, scene_emotions_neg, width, label='Negative')\n", "plt.xticks(ind)\n", "plt.grid(False)\n", "plt.legend()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When a character starts speaking, their name appears in capitals, on its own line. Which character(s) speak most often?" ] }, { "cell_type": "code", "execution_count": 203, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\"LYSANDER\" appears 50 times.\n", "\"THESEUS\" appears 48 times.\n", "\"HERMIA\" appears 48 times.\n", "\"DEMETRIUS\" appears 47 times.\n", "\"BOTTOM\" appears 47 times.\n", "\"QUINCE\" appears 38 times.\n", "\"HELENA\" appears 36 times.\n", "\"PUCK\" appears 33 times.\n", "\"OBERON\" appears 29 times.\n", "\"TITANIA\" appears 23 times.\n" ] } ], "source": [ "from collections import defaultdict\n", "\n", "# 全部戏剧正文文本\n", "play_content = '\\n'.join(scenes)\n", "\n", "# 人名匹配规则\n", "name_pat = re.compile(r'^[A-Z]+$', re.M)\n", "\n", "# 匹配所有人名\n", "characters = name_pat.findall(play_content)\n", "\n", "# 人物出现次数列表\n", "characters_dct = defaultdict(int)\n", "for c in characters:\n", " characters_dct[c] += 1\n", "\n", "sorted_characters = sorted(characters_dct.items(), key=lambda kv: kv[1], reverse=True)\n", "for k, v in sorted_characters[:10]:\n", " print('\"{}\" appears {} times.'.format(k, v))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\"Lysander\" is a Talkaholic!" ] }, { "cell_type": "code", "execution_count": 204, "metadata": {}, "outputs": [], "source": [ "from jupyterthemes import jtplot\n", "jtplot.reset()\n", "# jtplot.style(theme='oceans16')" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.7" } }, "nbformat": 4, "nbformat_minor": 2 }