Last active
September 28, 2019 17:32
-
-
Save tazjin/c929e4cceb269cd7fd2117c45a4d1ab1 to your computer and use it in GitHub Desktop.
Revisions
-
Vincent Ambo revised this gist
Sep 28, 2019 . 1 changed file with 4 additions and 2 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -26,9 +26,11 @@ only 31% of the files. This leads me to believe that most of the `.asc` files are actually GPG public keys, but draw your own conclusions. *Fun fact #2*: The plain text format (`.txt`) outclasses all of these by a large margin (9829711 files, 473 gigabytes), but I suspect that a lot of that is data of some kind and that a lot of the actual documentation is actually Markdown without the `.md` extension. Without writing some more sophisticated content analysis function this won't be possible to include in the stats. [bq]: https://console.cloud.google.com/marketplace/details/github/github-repos -
Vincent Ambo revised this gist
Sep 28, 2019 . 1 changed file with 1 addition and 1 deletion.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -11,7 +11,7 @@ for the identifiers). Read the query for more information, or behold these results: | Language | # of files | # of megabytes | |------------------------|------------|----------------| | Markdown | 7982489 | 26187 | | AsciiDoc (with `.asc`) | 124213 | 9059 | | AsciiDoc (no `.asc`) | 86765 | 823 | -
Vincent Ambo revised this gist
Sep 28, 2019 . 1 changed file with 17 additions and 7 deletions.There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -10,16 +10,26 @@ for the identifiers). Read the query for more information, or behold these results: | Language | # of files | # of megabytes | |------------------------+------------+----------------| | Markdown | 7982489 | 26187 | | AsciiDoc (with `.asc`) | 124213 | 9059 | | AsciiDoc (no `.asc`) | 86765 | 823 | | Org-Mode | 24779 | 314 | *Fun fact #1*: The first AsciiDoc table row is giving it a significant benefit of the doubt by including files with a `.asc` extension since AsciiDoc has no [standard extension][adext]. Removing this extension removes 91% of the content attributed to AsciiDoc, but only 31% of the files. This leads me to believe that most of the `.asc` files are actually GPG public keys, but draw your own conclusions. *Fun fact #2*: The plain text format (`.txt`) outclasses all of these by a large margin, but I suspect that a lot of those are also actually Markdown files. Without writing some more sophisticated content analysis function these won't be possible to include in the stats. [bq]: https://console.cloud.google.com/marketplace/details/github/github-repos [adext]: https://asciidoctor.org/docs/asciidoc-recommended-practices/#document-extension -
Vincent Ambo created this gist
Sep 28, 2019 .There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,25 @@ Documentation format popularity =============================== The SQL query below queries the [GitHub dataset on Google BigQuery][bq] for information about the popularity of different documentation formats. Identification is based on file extensions in this version, and duplicates (e.g. READMEs across forks) are deduplicated by content hashes (which the table uses for the identifiers). Read the query for more information, or behold these results: |Language|# of files|# of megabytes| |--------|----------|--------------| |Markdown|7982489 |26187 | |AsciiDoc|124213 |9059 | |Org-Mode|24779 |314 | Fun fact: The plain text format (`.txt`) outclasses all of these by a large margin, but I suspect that a lot of those are also actually Markdown files. Without writing some more sophisticated content analysis function these won't be possible to include in the stats. [bq]: https://console.cloud.google.com/marketplace/details/github/github-repos This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,44 @@ -- Queries the public BigQuery dataset for Github for all documentation -- files and their sizes. Files are identified by content hashes, and can -- thus be deduplicated (to exclude duplicated READMEs in forks and such). -- -- Query results in the total count of unique per language, and the -- total number of bytes per language. #StandardSQL WITH -- Subtable of all distinct files (deduplicated per file hash) tagged -- with the documentation language they are written in, as guessed per -- extension. docs AS ( SELECT id, ANY_VALUE( CASE WHEN ENDS_WITH(path, '.md') THEN 'markdown' WHEN ENDS_WITH(path, '.asciidoc') THEN 'asciidoc' WHEN ENDS_WITH(path, '.adoc') THEN 'asciidoc' -- including .asc might also match various GPG keys, but -- Markdown will still outclass AsciiDoc significantly. WHEN ENDS_WITH(path, '.asc') THEN 'asciidoc' WHEN ENDS_WITH(path, '.org') THEN 'org' -- Plain text files might actually be Markdown, too, but -- lets ignore that WHEN ENDS_WITH(path, '.txt') THEN 'plain' ELSE NULL END ) AS language FROM `bigquery-public-data.github_repos.files` GROUP BY id), docs_files AS ( SELECT d.id AS id, d.language AS language, c.size AS size FROM docs d INNER JOIN `bigquery-public-data.github_repos.contents` c ON d.id = c.id AND c.binary = false WHERE d.language IS NOT NULL) SELECT language, COUNT(id) AS total_count, SUM(size) AS total_size FROM docs_files GROUP BY language