kmike · October 16, 2022 16:57 · Oct 16, 2022 · Oct 16, 2022
diff --git a/notebook.ipynb b/notebook.ipynb
@@ -0,0 +1,226 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "fd8192cf",
+   "metadata": {},
+   "source": [
+    "Two implementations of url_has_any_extension:\n",
+    "\n",
+    "* The one merged in https://github.com/scrapy/scrapy/pull/5450 (url_has_any_extension_27)\n",
+    "* Version which is used in Scrapy 2.6 (url_has_any_extension_26)\n",
+    "\n",
+    "The new implementation is more correct, because it works for extensions like .tar.gz."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "8f646b9b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import posixpath\n",
+    "\n",
+    "from scrapy.utils.url import parse_url\n",
+    "\n",
+    "def url_has_any_extension_27(url, extensions):\n",
+    "    \"\"\"Return True if the url ends with one of the extensions provided\"\"\"\n",
+    "    lowercase_path = parse_url(url).path.lower()\n",
+    "    return any(lowercase_path.endswith(ext) for ext in extensions)\n",
+    "\n",
+    "\n",
+    "def url_has_any_extension_26(url, extensions):\n",
+    "    \"\"\"Return True if the url ends with one of the extensions provided\"\"\"\n",
+    "    return posixpath.splitext(parse_url(url).path)[1].lower() in extensions"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ccf338bd",
+   "metadata": {},
+   "source": [
+    "Let's use extension list from Scrapy's linkextractors; it's going the be by far the most common list of extensions used."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 31,
+   "id": "3fbcbd78",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from scrapy.linkextractors import IGNORED_EXTENSIONS\n",
+    "\n",
+    "# Extensions must start with \".\"; FilteringLinkExtractor does the same.\n",
+    "# We're using set because LinkExtractor uses set for deny_extensions.\n",
+    "extensions = {'.' + e for e in IGNORED_EXTENSIONS}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b1b93a99",
+   "metadata": {},
+   "source": [
+    "Case 1: an URL where an extension is present.\n",
+    "\n",
+    "Note than Scrapy's Link Extractor uses parse_url before passing URL to url_has_any_extension; we'll do the same."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 25,
+   "id": "305279c3",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "(True, True)"
+      ]
+     },
+     "execution_count": 25,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "url = parse_url(\"http://example.com/files/1ch124h1/video.mp4?hello\")\n",
+    "url_has_any_extension_26(url, extensions), url_has_any_extension_27(url, extensions)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 26,
+   "id": "ebd8a37b",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "1.14 µs ± 4.84 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%timeit url_has_any_extension_26(url, extensions)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 27,
+   "id": "60af1899",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "7.44 µs ± 6.11 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%timeit url_has_any_extension_27(url, extensions)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5ce4aa3d",
+   "metadata": {},
+   "source": [
+    "New version is slower, but both are super-fast. There is probably nothing to worry about.\n",
+    "\n",
+    "Case 2: extension is not present in URL."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 28,
+   "id": "f56ff91c",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "(False, False)"
+      ]
+     },
+     "execution_count": 28,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "url = parse_url(\"http://example.com/files/1ch124h1/page.html?hello\")\n",
+    "url_has_any_extension_26(url, extensions), url_has_any_extension_27(url, extensions)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 29,
+   "id": "9d16127d",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "1.12 µs ± 7.14 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%timeit url_has_any_extension_26(url, extensions)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 30,
+   "id": "ad968d6d",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "10.4 µs ± 11.1 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%timeit url_has_any_extension_27(url, extensions)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "43a62e75",
+   "metadata": {},
+   "source": [
+    "Again, the new version is slower, but both are very fast. There is probably nothing to worry about."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.11"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}