Sample: S1_fastqc.zip

Date: 2017-04-10
Sample data: /Users/kassambara/Documents/R/MyPackages/fastqcr/inst/fastqc_results/S1_fastqc.zip
R packages: Report generated with the R package fastqcr version 0.1.0
Experiment description: Sequencing data

Required R packages

library(fastqcr)
library(dplyr)

Reading the file

# Read all modules
qc <- qc_read(qc.path)
# Elements contained in the qc object
names(qc)

##  [1] "summary"                       "basic_statistics"              "per_base_sequence_quality"     "per_tile_sequence_quality"    
##  [5] "per_sequence_quality_scores"   "per_base_sequence_content"     "per_sequence_gc_content"       "per_base_n_content"           
##  [9] "sequence_length_distribution"  "sequence_duplication_levels"   "overrepresented_sequences"     "adapter_content"              
## [13] "kmer_content"                  "total_deduplicated_percentage"

Plotting and Interpreting

Summary

Summary shows a summary of the modules which were tested, and the status of the test resuls:

normal results (PASS),
slightly abnormal (WARN: warning)
or very unusual (FAIL: failure).

Some experiments may be expected to produce libraries which are biased in particular ways. You should treat the summary evaluations therefore as pointers to where you should concentrate your attention and understand why your library may not look normal.

qc_plot(qc, "summary")

Basic Statistics

Basic statistics shows basic data metrics such as:

Total sequences: the number of reads (total sequences),
Sequence length: the length of reads (minimum - maximum)
%GC: GC content

qc_plot(qc, "Basic statistics")

Per base sequence quality

Per base sequence quality plot depicts the quality scores across all bases at each position in the reads. The background color delimits 3 different zones: very good quality (green), reasonable quality (orange) and poor quality (red). A good sample will have qualities all above 28:

qc_plot(qc, "Per base sequence quality")

Problems:

warning if the median for any base is less than 25.
failure if the median for any base is less than 20.

Common reasons for problems:

Degradation of (sequencing chemestry) quality over the duration of long runs. Remedy: Quality trimming.
Short loss of quality earlier in the run, which then recovers to produce later good quality sequence. Can be explained by a transient problem with the run (bubbles in the flowcell for example). In these cases trimming is not advisable as it will remove later good sequence, but you might want to consider masking bases during subsequent mapping or assembly.
Library with reads of varying length. Warning or error is generated because of very low coverage for a given base range. Before committing to any action, check how many sequences were responsible for triggering an error by looking at the sequence length distribution module results.

Per sequence quality scores

Per sequence quality scores plot shows the frequencies of quality scores in a sample. It allows you to see if a subset of your sequences have low quality values. If the reads are of good quality, the peak on the plot should be shifted to the right as far as possible (quality > 27).

qc_plot(qc, "Per sequence quality scores")

Problems:

warning if the most frequently observed mean quality is below 27 - this equates to a 0.2% error rate.
failure if the most frequently observed mean quality is below 20 - this equates to a 1% error rate.

Common reasons for problems:

General loss of quality within a run. Remedy: For long runs this may be alleviated through quality trimming.

Per base sequence content

Per base sequence content shows the four nucleotides’ proportions for each position. In a random library you expect no nucleotide bias and the lines should be almost parallel with each other. In a good sequence composition, the difference between A and T, or G and C is < 10% in any position.

qc_plot(qc, "Per base sequence content")

It’s worth noting that some types of library will always produce biased sequence composition, normally at the start of the read. For example, in RNA-Seq data, it is common to have bias at the beginning of the reads. This occurs during RNA-Seq library preparation, when “random” primers are annealed to the start of sequences. These primers are not truly random, and it leads to a variation at the beginning of the reads. We can remove these primers using a trim adaptors tool.

Problems:

warning if the difference between A and T, or G and C is greater than 10% in any position.
failure if the difference between A and T, or G and C is greater than 20% in any position.

Common reasons for problems:

Overrepresented sequences: adapter dimers or rRNA
Biased selection of random primers for RNA-seq. Nearly all RNA-Seq libraries will fail this module because of this bias, but this is not a problem which can be fixed by processing, and it doesn’t seem to adversely affect the ablity to measure expression.
Biased composition libraries: Some libraries are inherently biased in their sequence composition. For example, library treated with sodium bisulphite, which will then converted most of the cytosines to thymines, meaning that the base composition will be almost devoid of cytosines and will thus trigger an error, despite this being entirely normal for that type of library.
Library which has been aggressivley adapter trimmed.

Per sequence GC content

Per sequence GC content plot displays GC distribution over all sequences. In a random library you expect a roughly normal GC content distribution. An unusually sharped or shifted distribution could indicate a contamination or some systematic biase:

qc_plot(qc, "Per sequence GC content")

You can generate the theoretical GC content curves files using an R package called fastqcTheoreticalGC written by Mike Love.

Per base N content

Per base N content. If a sequencer is unable to make a base call with sufficient confidence then it will normally substitute an N rather than a conventional base call. This module plots out the percentage of base calls at each position for which an N was called.

qc_plot(qc, "Per base N content")

Problems:

warning if any position shows an N content of >5%.
failure if any position shows an N content of >20%.

Common reasons for problems:

General loss of quality.
Very biased sequence composition in the library.

Sequence length distribution

Sequence length distribution module reports if all sequences have the same length or not. For some sequencing platforms it is entirely normal to have different read lengths so warnings here can be ignored. In many cases this will produce a simple graph showing a peak only at one size. This module will raise an error if any of the sequences have zero length.

qc_plot(qc, "Sequence length distribution")

Sequence duplication levels

Sequence duplication levels. This module counts the degree of duplication for every sequence in a library and creates a plot showing the relative number of sequences with different degrees of duplication. A high level of duplication is more likely to indicate some kind of enrichment bias (eg PCR over amplification).

qc_plot(qc, "Sequence duplication levels")

Problems:

warning if non-unique sequences make up more than 20% of the total.
failure if non-unique sequences make up more than 50% of the total.

Common reasons for problems:

Technical duplicates arising from PCR artefacts
Biological duplicates which are natural collisions where different copies of exactly the same sequence are randomly selected.

In RNA-seq data, duplication levels can reach even 40%. Nevertheless, while analysing transcriptome sequencing data, we should not remove these duplicates because we do not know whether they represent PCR duplicates or high gene expression of our samples.

Overrepresented sequences

Overrepresented sequences section gives information about primer or adaptor contaminations. Finding that a single sequence is very overrepresented in the set either means that it is highly biologically significant, or indicates that the library is contaminated, or not as diverse as you expected. This module lists all of the sequence which make up more than 0.1% of the total.

qc_plot(qc, "Overrepresented sequences")

Problems:

warning if any sequence is found to represent more than 0.1% of the total.
failure if any sequence is found to represent more than 1% of the total.

Common reasons for problems:

small RNA libraries where sequences are not subjected to random fragmentation, and the same sequence may natrually be present in a significant proportion of the library.

Adapter content

Adapter content module checks the presence of read-through adapter sequences. It is useful to know if your library contains a significant amount of adapter in order to be able to assess whether you need to adapter trim or not.

qc_plot(qc, "Adapter content")

Problems:

warning if any sequence is present in more than 5% of all reads.
failure if any sequence is present in more than 10% of all reads.

A warning or failure means that the sequences will need to be adapter trimmed before proceeding with any downstream analysis.

Kmer content

qc_plot(qc, "Kmer content")

Useful Links

FastQC report for a good Illumina dataset
FastQC report for a bad Illumina dataset
Online documentation for each FastQC report

Sample: S1_fastqc.zip

Quality control of reads

Required R packages

Reading the file

Plotting and Interpreting

Summary

Basic Statistics

Per base sequence quality

Per sequence quality scores

Per base sequence content

Per sequence GC content

Per base N content

Sequence length distribution

Sequence duplication levels

Overrepresented sequences

Adapter content

Kmer content

Useful Links