Skip to content

Instantly share code, notes, and snippets.

@arun057
Created November 13, 2019 08:28
Show Gist options
  • Save arun057/d8dee65ece54de42b574f287ed9e601d to your computer and use it in GitHub Desktop.
Save arun057/d8dee65ece54de42b574f287ed9e601d to your computer and use it in GitHub Desktop.

Revisions

  1. arun057 created this gist Nov 13, 2019.
    108 changes: 108 additions & 0 deletions data_ops.sql
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,108 @@
    -- Imported data into postgresql running locally. For larger data sets I would go the map reduce route, seemed overkill here.
    -- All of the questions are answered with sql - I did end up double checking some of the results with python - happy to provide that code on request.


    -- Consider only the rows with country_id = "BDV" (there are 844 such rows).
    -- For each site_id, we can compute the number of unique user_id's found in these 844 rows.
    -- Which site_id has the largest number of unique users? And what's the number?


    SELECT x.site_id, COUNT(x.site_id) as site_count
    FROM (SELECT DISTINCT user_id, site_id FROM visits WHERE country_id = 'BDV') AS x
    GROUP BY x.site_id
    ORDER BY site_count DESC
    LIMIT 1

    -- site_id site_count
    -- "5NPAU" "544"





    -- Between 2019-02-03 00:00:00 and 2019-02-04 23:59:59,
    -- there are four users who visited a certain site more than 10 times.
    -- Find these four users & which sites they (each) visited more than 10 times.
    -- (Simply provides four triples in the form (user_id, site_id, number of visits)
    -- in the box below.)

    SELECT user_id, site_id, (COUNT(site_id)) AS site_count
    FROM visits
    WHERE ts > '2019-02-03 00:00:00'
    and ts < '2019-02-04 23:59:59'
    GROUP BY user_id, site_id
    HAVING COUNT(site_id) > 10
    ORDER BY site_count DESC;

    -- user_id site_id site_count
    -- "LC3A59" "N0OTG" "26"
    -- "LC06C3" "N0OTG" "25"
    -- "LC3C9D" "N0OTG" "17"
    -- "LC3C7E" "3POLC" "15"



    -- For each site, compute the unique number of users whose last visit
    -- (found in the original data set) was to that site.
    -- For instance, user "LC3561"'s last visit is to "N0OTG" based on
    -- timestamp data. Based on this measure, what are top three sites?
    -- (hint: site "3POLC" is ranked at 5th with 28 users whose last
    -- visit in the data set was to 3POLC;
    -- simply provide three pairs in the form (site_id, number of users).)

    SELECT v.site_id, COUNT(v.user_id) as site_count
    FROM visits AS v
    INNER JOIN (
    SELECT MAX(ts) as latest_ts, user_id
    FROM visits
    GROUP BY user_id
    ORDER BY user_id
    ) AS x
    ON v.user_id=x.user_id AND v.ts=x.latest_ts
    GROUP BY v.site_id
    ORDER BY site_count DESC
    LIMIT 3

    -- site_id site_count
    -- "5NPAU" "992"
    -- "N0OTG" "561"
    -- "QGO3G" "289"






    -- For each user, determine the first site he/she visited and
    -- the last site he/she visited based on the timestamp data.
    -- Compute the number of users whose first/last visits
    -- are to the same website. What is the number?


    SELECT COUNT(*) FROM (
    SELECT a.user_id, a.site_id, a.ts
    FROM visits AS a
    INNER JOIN
    (
    SELECT MIN(c.ts) as first_visit, c.user_id
    FROM visits AS c
    GROUP BY c.user_id
    ) AS b
    ON b.first_visit=a.ts AND b.user_id=a.user_id
    ) AS low
    INNER JOIN (
    SELECT a.user_id, a.site_id, a.ts
    FROM visits AS a
    INNER JOIN
    (
    SELECT MAX(c.ts) as first_visit, c.user_id
    FROM visits AS c
    GROUP BY c.user_id
    ) AS b
    ON b.first_visit=a.ts AND b.user_id=a.user_id
    ) AS high
    ON low.site_id=high.site_id AND low.user_id=high.user_id


    -- count
    -- "1670"