Skip to content

Instantly share code, notes, and snippets.

@almost
Last active November 22, 2018 09:30
Show Gist options
  • Select an option

  • Save almost/7f10568539b9a079cb8aca3a13d52dc8 to your computer and use it in GitHub Desktop.

Select an option

Save almost/7f10568539b9a079cb8aca3a13d52dc8 to your computer and use it in GitHub Desktop.

Revisions

  1. almost revised this gist Nov 22, 2018. 1 changed file with 2 additions and 0 deletions.
    2 changes: 2 additions & 0 deletions crawler.js
    Original file line number Diff line number Diff line change
    @@ -1,3 +1,5 @@
    // Solution https://gist.github.com/almost/9ee99b1a3e7fa240c596be3820c0b6b0

    "use strict";
    const url = require('url');

  2. almost revised this gist Nov 21, 2018. 1 changed file with 16 additions and 5 deletions.
    21 changes: 16 additions & 5 deletions crawler.js
    Original file line number Diff line number Diff line change
    @@ -4,12 +4,23 @@ const url = require('url');
    const rp = require("request-promise-native");
    const getHrefs = require("get-hrefs");

    const MAX_CONCURRENT = 10;
    const MAX_COUNT = 2000;
    const MAX_CONCURRENT = 3;
    const MAX_COUNT = 5;
    const ALLOW_DOMAINS = new Set(["almostobsolete.net", "tomparslow.co.uk"]);
    const START_URLS = ["http://almostobsolete.net/"];

    async function getHrefsFromUrl(url) {
    const body = await rp({ url });
    return getHrefs(body, { baseUrl: url});
    async function getHrefsFromUrl(currentUrl) {
    const body = await rp({ url: currentUrl });
    return getHrefs(body, { baseUrl: currentUrl});
    }

    function isAllowedDomain(currentUrl) {
    return ALLOW_DOMAINS.has(url.parse(currentUrl).hostname));
    }

    // TODO
    // Starting from START_URLS find links and crawl them.
    // Only follow links to pages in th ALLOW_DOMAINS
    // Do not make more that MAX_CONCURRENT requests at any one time
    // Do not make more than MAX_COUNT requests overall
    // Do not request the same url twice
  3. almost created this gist Nov 21, 2018.
    15 changes: 15 additions & 0 deletions crawler.js
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,15 @@
    "use strict";
    const url = require('url');

    const rp = require("request-promise-native");
    const getHrefs = require("get-hrefs");

    const MAX_CONCURRENT = 10;
    const MAX_COUNT = 2000;
    const ALLOW_DOMAINS = new Set(["almostobsolete.net", "tomparslow.co.uk"]);
    const START_URLS = ["http://almostobsolete.net/"];

    async function getHrefsFromUrl(url) {
    const body = await rp({ url });
    return getHrefs(body, { baseUrl: url});
    }