Skip to content

Instantly share code, notes, and snippets.

@tazjin
Last active September 28, 2019 17:32
Show Gist options
  • Select an option

  • Save tazjin/c929e4cceb269cd7fd2117c45a4d1ab1 to your computer and use it in GitHub Desktop.

Select an option

Save tazjin/c929e4cceb269cd7fd2117c45a4d1ab1 to your computer and use it in GitHub Desktop.

Revisions

  1. Vincent Ambo revised this gist Sep 28, 2019. 1 changed file with 4 additions and 2 deletions.
    6 changes: 4 additions & 2 deletions description.md
    Original file line number Diff line number Diff line change
    @@ -26,9 +26,11 @@ only 31% of the files. This leads me to believe that most of the `.asc` files
    are actually GPG public keys, but draw your own conclusions.

    *Fun fact #2*: The plain text format (`.txt`) outclasses all of these by a large
    margin, but I suspect that a lot of those are also actually Markdown files.
    margin (9829711 files, 473 gigabytes), but I suspect that a lot of that is data
    of some kind and that a lot of the actual documentation is actually Markdown
    without the `.md` extension.

    Without writing some more sophisticated content analysis function these won't be
    Without writing some more sophisticated content analysis function this won't be
    possible to include in the stats.

    [bq]: https://console.cloud.google.com/marketplace/details/github/github-repos
  2. Vincent Ambo revised this gist Sep 28, 2019. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion description.md
    Original file line number Diff line number Diff line change
    @@ -11,7 +11,7 @@ for the identifiers).
    Read the query for more information, or behold these results:

    | Language | # of files | # of megabytes |
    |------------------------+------------+----------------|
    |------------------------|------------|----------------|
    | Markdown | 7982489 | 26187 |
    | AsciiDoc (with `.asc`) | 124213 | 9059 |
    | AsciiDoc (no `.asc`) | 86765 | 823 |
  3. Vincent Ambo revised this gist Sep 28, 2019. 1 changed file with 17 additions and 7 deletions.
    24 changes: 17 additions & 7 deletions description.md
    Original file line number Diff line number Diff line change
    @@ -10,16 +10,26 @@ for the identifiers).

    Read the query for more information, or behold these results:

    |Language|# of files|# of megabytes|
    |--------|----------|--------------|
    |Markdown|7982489 |26187 |
    |AsciiDoc|124213 |9059 |
    |Org-Mode|24779 |314 |
    | Language | # of files | # of megabytes |
    |------------------------+------------+----------------|
    | Markdown | 7982489 | 26187 |
    | AsciiDoc (with `.asc`) | 124213 | 9059 |
    | AsciiDoc (no `.asc`) | 86765 | 823 |
    | Org-Mode | 24779 | 314 |

    Fun fact: The plain text format (`.txt`) outclasses all of these by a large
    *Fun fact #1*: The first AsciiDoc table row is giving it a significant benefit
    of the doubt by including files with a `.asc` extension since AsciiDoc has no
    [standard extension][adext].

    Removing this extension removes 91% of the content attributed to AsciiDoc, but
    only 31% of the files. This leads me to believe that most of the `.asc` files
    are actually GPG public keys, but draw your own conclusions.

    *Fun fact #2*: The plain text format (`.txt`) outclasses all of these by a large
    margin, but I suspect that a lot of those are also actually Markdown files.

    Without writing some more sophisticated content analysis function these won't be
    possible to include in the stats.

    [bq]: https://console.cloud.google.com/marketplace/details/github/github-repos
    [bq]: https://console.cloud.google.com/marketplace/details/github/github-repos
    [adext]: https://asciidoctor.org/docs/asciidoc-recommended-practices/#document-extension
  4. Vincent Ambo created this gist Sep 28, 2019.
    25 changes: 25 additions & 0 deletions description.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,25 @@
    Documentation format popularity
    ===============================

    The SQL query below queries the [GitHub dataset on Google BigQuery][bq] for
    information about the popularity of different documentation formats.

    Identification is based on file extensions in this version, and duplicates (e.g.
    READMEs across forks) are deduplicated by content hashes (which the table uses
    for the identifiers).

    Read the query for more information, or behold these results:

    |Language|# of files|# of megabytes|
    |--------|----------|--------------|
    |Markdown|7982489 |26187 |
    |AsciiDoc|124213 |9059 |
    |Org-Mode|24779 |314 |

    Fun fact: The plain text format (`.txt`) outclasses all of these by a large
    margin, but I suspect that a lot of those are also actually Markdown files.

    Without writing some more sophisticated content analysis function these won't be
    possible to include in the stats.

    [bq]: https://console.cloud.google.com/marketplace/details/github/github-repos
    44 changes: 44 additions & 0 deletions query.sql
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,44 @@
    -- Queries the public BigQuery dataset for Github for all documentation
    -- files and their sizes. Files are identified by content hashes, and can
    -- thus be deduplicated (to exclude duplicated READMEs in forks and such).
    --
    -- Query results in the total count of unique per language, and the
    -- total number of bytes per language.
    #StandardSQL
    WITH
    -- Subtable of all distinct files (deduplicated per file hash) tagged
    -- with the documentation language they are written in, as guessed per
    -- extension.
    docs AS (
    SELECT
    id,
    ANY_VALUE(
    CASE WHEN ENDS_WITH(path, '.md') THEN 'markdown'
    WHEN ENDS_WITH(path, '.asciidoc') THEN 'asciidoc'
    WHEN ENDS_WITH(path, '.adoc') THEN 'asciidoc'
    -- including .asc might also match various GPG keys, but
    -- Markdown will still outclass AsciiDoc significantly.
    WHEN ENDS_WITH(path, '.asc') THEN 'asciidoc'
    WHEN ENDS_WITH(path, '.org') THEN 'org'
    -- Plain text files might actually be Markdown, too, but
    -- lets ignore that
    WHEN ENDS_WITH(path, '.txt') THEN 'plain'
    ELSE NULL
    END
    ) AS language
    FROM `bigquery-public-data.github_repos.files`
    GROUP BY id),

    docs_files AS (
    SELECT d.id AS id, d.language AS language, c.size AS size
    FROM docs d
    INNER JOIN `bigquery-public-data.github_repos.contents` c
    ON d.id = c.id AND c.binary = false
    WHERE d.language IS NOT NULL)

    SELECT
    language,
    COUNT(id) AS total_count,
    SUM(size) AS total_size
    FROM docs_files
    GROUP BY language