Last active
October 16, 2022 16:57
-
-
Save kmike/1fd10869a1af9a54cddbeca38694454a to your computer and use it in GitHub Desktop.
Revisions
-
kmike revised this gist
Oct 16, 2022 . No changes.There are no files selected for viewing
-
kmike created this gist
Oct 16, 2022 .There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,226 @@ { "cells": [ { "cell_type": "markdown", "id": "fd8192cf", "metadata": {}, "source": [ "Two implementations of url_has_any_extension:\n", "\n", "* The one merged in https://github.com/scrapy/scrapy/pull/5450 (url_has_any_extension_27)\n", "* Version which is used in Scrapy 2.6 (url_has_any_extension_26)\n", "\n", "The new implementation is more correct, because it works for extensions like .tar.gz." ] }, { "cell_type": "code", "execution_count": 8, "id": "8f646b9b", "metadata": {}, "outputs": [], "source": [ "import posixpath\n", "\n", "from scrapy.utils.url import parse_url\n", "\n", "def url_has_any_extension_27(url, extensions):\n", " \"\"\"Return True if the url ends with one of the extensions provided\"\"\"\n", " lowercase_path = parse_url(url).path.lower()\n", " return any(lowercase_path.endswith(ext) for ext in extensions)\n", "\n", "\n", "def url_has_any_extension_26(url, extensions):\n", " \"\"\"Return True if the url ends with one of the extensions provided\"\"\"\n", " return posixpath.splitext(parse_url(url).path)[1].lower() in extensions" ] }, { "cell_type": "markdown", "id": "ccf338bd", "metadata": {}, "source": [ "Let's use extension list from Scrapy's linkextractors; it's going the be by far the most common list of extensions used." ] }, { "cell_type": "code", "execution_count": 31, "id": "3fbcbd78", "metadata": {}, "outputs": [], "source": [ "from scrapy.linkextractors import IGNORED_EXTENSIONS\n", "\n", "# Extensions must start with \".\"; FilteringLinkExtractor does the same.\n", "# We're using set because LinkExtractor uses set for deny_extensions.\n", "extensions = {'.' + e for e in IGNORED_EXTENSIONS}" ] }, { "cell_type": "markdown", "id": "b1b93a99", "metadata": {}, "source": [ "Case 1: an URL where an extension is present.\n", "\n", "Note than Scrapy's Link Extractor uses parse_url before passing URL to url_has_any_extension; we'll do the same." ] }, { "cell_type": "code", "execution_count": 25, "id": "305279c3", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(True, True)" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "url = parse_url(\"http://example.com/files/1ch124h1/video.mp4?hello\")\n", "url_has_any_extension_26(url, extensions), url_has_any_extension_27(url, extensions)" ] }, { "cell_type": "code", "execution_count": 26, "id": "ebd8a37b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1.14 µs ± 4.84 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)\n" ] } ], "source": [ "%timeit url_has_any_extension_26(url, extensions)" ] }, { "cell_type": "code", "execution_count": 27, "id": "60af1899", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "7.44 µs ± 6.11 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)\n" ] } ], "source": [ "%timeit url_has_any_extension_27(url, extensions)" ] }, { "cell_type": "markdown", "id": "5ce4aa3d", "metadata": {}, "source": [ "New version is slower, but both are super-fast. There is probably nothing to worry about.\n", "\n", "Case 2: extension is not present in URL." ] }, { "cell_type": "code", "execution_count": 28, "id": "f56ff91c", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(False, False)" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "url = parse_url(\"http://example.com/files/1ch124h1/page.html?hello\")\n", "url_has_any_extension_26(url, extensions), url_has_any_extension_27(url, extensions)" ] }, { "cell_type": "code", "execution_count": 29, "id": "9d16127d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1.12 µs ± 7.14 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)\n" ] } ], "source": [ "%timeit url_has_any_extension_26(url, extensions)" ] }, { "cell_type": "code", "execution_count": 30, "id": "ad968d6d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "10.4 µs ± 11.1 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)\n" ] } ], "source": [ "%timeit url_has_any_extension_27(url, extensions)" ] }, { "cell_type": "markdown", "id": "43a62e75", "metadata": {}, "source": [ "Again, the new version is slower, but both are very fast. There is probably nothing to worry about." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.11" } }, "nbformat": 4, "nbformat_minor": 5 }