Skip to content

Instantly share code, notes, and snippets.

@darencard
Last active June 1, 2025 03:49
Show Gist options
  • Save darencard/9497e151882c3ff366335040e20b6714 to your computer and use it in GitHub Desktop.
Save darencard/9497e151882c3ff366335040e20b6714 to your computer and use it in GitHub Desktop.

Revisions

  1. Daren Card revised this gist Jan 19, 2018. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion CDS_extract.md
    Original file line number Diff line number Diff line change
    @@ -53,4 +53,4 @@ awk -v OFS="\t" '{ print $1, "0", $2, ($4 + $5) / $2 }'
    ```


    ![Analytics](https://ga-beacon.appspot.com/UA-112753663-1/CDS_extract?pixel)
    [![Analytics](https://ga-beacon.appspot.com/UA-112753663-1/gist/CDS_extract?pixel)](https://gist.github.com/darencard/9497e151882c3ff366335040e20b6714)
  2. Daren Card revised this gist Jan 19, 2018. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion CDS_extract.md
    Original file line number Diff line number Diff line change
    @@ -53,4 +53,4 @@ awk -v OFS="\t" '{ print $1, "0", $2, ($4 + $5) / $2 }'
    ```


    [![Analytics](https://ga-beacon.appspot.com/UA-112753663-1/9497e151882c3ff366335040e20b6714?pixel)](https://gist.github.com/darencard/9497e151882c3ff366335040e20b6714)
    ![Analytics](https://ga-beacon.appspot.com/UA-112753663-1/CDS_extract?pixel)
  3. Daren Card revised this gist Jan 19, 2018. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion CDS_extract.md
    Original file line number Diff line number Diff line change
    @@ -53,4 +53,4 @@ awk -v OFS="\t" '{ print $1, "0", $2, ($4 + $5) / $2 }'
    ```


    [![Analytics](https://ga-beacon.appspot.com/UA-112753663-1/darencard/9497e151882c3ff366335040e20b6714)](https://gist.github.com/darencard/9497e151882c3ff366335040e20b6714)
    [![Analytics](https://ga-beacon.appspot.com/UA-112753663-1/9497e151882c3ff366335040e20b6714?pixel)](https://gist.github.com/darencard/9497e151882c3ff366335040e20b6714)
  4. Daren Card revised this gist Jan 19, 2018. 1 changed file with 3 additions and 0 deletions.
    3 changes: 3 additions & 0 deletions CDS_extract.md
    Original file line number Diff line number Diff line change
    @@ -51,3 +51,6 @@ gffread -x - -g <genome.fasta> <annotation.gff> | \
    seqtk comp - | \
    awk -v OFS="\t" '{ print $1, "0", $2, ($4 + $5) / $2 }'
    ```


    [![Analytics](https://ga-beacon.appspot.com/UA-112753663-1/darencard/9497e151882c3ff366335040e20b6714)](https://gist.github.com/darencard/9497e151882c3ff366335040e20b6714)
  5. Daren Card revised this gist Jan 17, 2018. 1 changed file with 2 additions and 2 deletions.
    4 changes: 2 additions & 2 deletions CDS_extract.md
    Original file line number Diff line number Diff line change
    @@ -40,7 +40,7 @@ printf "%s%s", $i, (i+3>NF?"\n":FS) }' | \
    awk -v name="$name" '{ print ">"name; print $1 }'; \
    done | \
    seqtk comp - | \
    awk -v OFS="\t" '{ print $1, "1", $2, ($4 + $5) / $2 }'
    awk -v OFS="\t" '{ print $1, "0", $2, ($4 + $5) / $2 }'
    ```

    It is even easier if you want to look at GC across the entire CDS.
    @@ -49,5 +49,5 @@ It is even easier if you want to look at GC across the entire CDS.
    ```bash
    gffread -x - -g <genome.fasta> <annotation.gff> | \
    seqtk comp - | \
    awk -v OFS="\t" '{ print $1, "1", $2, ($4 + $5) / $2 }'
    awk -v OFS="\t" '{ print $1, "0", $2, ($4 + $5) / $2 }'
    ```
  6. Daren Card revised this gist Jan 17, 2018. 1 changed file with 3 additions and 3 deletions.
    6 changes: 3 additions & 3 deletions CDS_extract.md
    Original file line number Diff line number Diff line change
    @@ -26,7 +26,7 @@ done \
    > <out.fasta>
    ```

    Now it is easy to calculate something like GC3, the GC content of 3rd codon positions, using a tool like [`seqtk`](https://github.com/lh3/seqtk).
    Now it is easy to calculate something like GC3, the GC content of 3rd codon positions, using a tool like [`seqtk`](https://github.com/lh3/seqtk). We'll make the output in BED format.

    ```bash
    gffread -x - -g <genome.fasta> <annotation.gff> | \
    @@ -40,7 +40,7 @@ printf "%s%s", $i, (i+3>NF?"\n":FS) }' | \
    awk -v name="$name" '{ print ">"name; print $1 }'; \
    done | \
    seqtk comp - | \
    awk -v OFS="\t" '{ print $1, $2, ($4 + $5) / $2 }'
    awk -v OFS="\t" '{ print $1, "1", $2, ($4 + $5) / $2 }'
    ```

    It is even easier if you want to look at GC across the entire CDS.
    @@ -49,5 +49,5 @@ It is even easier if you want to look at GC across the entire CDS.
    ```bash
    gffread -x - -g <genome.fasta> <annotation.gff> | \
    seqtk comp - | \
    awk -v OFS="\t" '{ print $1, $2, ($4 + $5) / $2 }'
    awk -v OFS="\t" '{ print $1, "1", $2, ($4 + $5) / $2 }'
    ```
  7. Daren Card revised this gist Jan 17, 2018. 1 changed file with 1 addition and 1 deletion.
    2 changes: 1 addition & 1 deletion CDS_extract.md
    Original file line number Diff line number Diff line change
    @@ -10,7 +10,7 @@ Let's extract the CDS sequences for each transcript using a genome sequence and
    gffread -x <out.fasta> -g <genome.fasta> <annotation.gff>
    ```

    Pretty straight-forward. We can also set the command to output to STDOUT instead of a file by seeing the option `-x -`. Let's demo this by going a step further with the output and extracting every 3rd codon position (relies on [`bioawk`](https://github.com/lh3/bioawk)).
    Pretty straight-forward. We can also set the command to output to STDOUT instead of a file by seeing the option `-x -`. Let's demo this by going a step further with the output and extracting every 3rd codon position (relies on [`bioawk`](https://github.com/lh3/bioawk)). Key is an `awk` command documented [here](https://stackoverflow.com/questions/22354082/print-every-nth-column-of-a-file).

    ```bash
    gffread -x - -g <genome.fasta> <annotation.gff> | \
  8. Daren Card created this gist Jan 17, 2018.
    53 changes: 53 additions & 0 deletions CDS_extract.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,53 @@
    # Extracting spliced sequences (e.g., CDS) from GFF files

    GFF is a common format for storing genetic feature annotations. In the case of gene annotations, subsets of elements are split over multiple lines, as things like exons and CDS features will have gaps based on the full genome sequence. Therefore, while it is easy to extract exon and CDS lines, it can be difficult to associate them together based on a parent (e.g., transcript) ID and perform downstream operations. Even extracting the full CDS sequence using a GFF file can be tricky for this reason, even though it seems trivial.

    Here we'll overcome this difficulty using the [`gffread`](https://github.com/gpertea/gffread) tool. Installation is pretty easy and is documented in the GitHub README. `gffread` has a lot of options, but here we'll just document one that extracts the spliced CDS for each GFF transcript (`-x` option). Note that you can do the same thing for exons (`-w` option) and can also produce the protein sequence (`-y` option).

    Let's extract the CDS sequences for each transcript using a genome sequence and a GFF annotation file.

    ```bash
    gffread -x <out.fasta> -g <genome.fasta> <annotation.gff>
    ```

    Pretty straight-forward. We can also set the command to output to STDOUT instead of a file by seeing the option `-x -`. Let's demo this by going a step further with the output and extracting every 3rd codon position (relies on [`bioawk`](https://github.com/lh3/bioawk)).

    ```bash
    gffread -x - -g <genome.fasta> <annotation.gff> | \
    bioawk -c fastx '{ print $name, $seq }' | \
    while read line; \
    do \
    name=$(echo $line | cut -f 1); \
    echo $line | cut -f 2 | \
    awk -F "" '{ for (i = 3; i <= NF; i += 3) \
    printf "%s%s", $i, (i+3>NF?"\n":FS) }' | \
    awk -v name="$name" '{ print ">"name; print $1 }'; \
    done \
    > <out.fasta>
    ```

    Now it is easy to calculate something like GC3, the GC content of 3rd codon positions, using a tool like [`seqtk`](https://github.com/lh3/seqtk).

    ```bash
    gffread -x - -g <genome.fasta> <annotation.gff> | \
    bioawk -c fastx '{ print $name, $seq }' | \
    while read line; \
    do \
    name=$(echo $line | cut -f 1); \
    echo $line | cut -f 2 | \
    awk -F "" '{ for (i = 3; i <= NF; i += 3) \
    printf "%s%s", $i, (i+3>NF?"\n":FS) }' | \
    awk -v name="$name" '{ print ">"name; print $1 }'; \
    done | \
    seqtk comp - | \
    awk -v OFS="\t" '{ print $1, $2, ($4 + $5) / $2 }'
    ```

    It is even easier if you want to look at GC across the entire CDS.


    ```bash
    gffread -x - -g <genome.fasta> <annotation.gff> | \
    seqtk comp - | \
    awk -v OFS="\t" '{ print $1, $2, ($4 + $5) / $2 }'
    ```