library(fastqcr)
library(dplyr)
# Read all modules
qc <- qc_read(qc.path)
# Elements contained in the qc object
names(qc)
## [1] "summary" "basic_statistics" "per_base_sequence_quality" "per_tile_sequence_quality"
## [5] "per_sequence_quality_scores" "per_base_sequence_content" "per_sequence_gc_content" "per_base_n_content"
## [9] "sequence_length_distribution" "sequence_duplication_levels" "overrepresented_sequences" "adapter_content"
## [13] "kmer_content" "total_deduplicated_percentage"
Summary shows a summary of the modules which were tested, and the status of the test resuls:
Some experiments may be expected to produce libraries which are biased in particular ways. You should treat the summary evaluations therefore as pointers to where you should concentrate your attention and understand why your library may not look normal.
qc_plot(qc, "summary")
Basic statistics shows basic data metrics such as:
qc_plot(qc, "Basic statistics")
qc_plot(qc, "Per base sequence quality")
Problems:
Common reasons for problems:
Degradation of (sequencing chemestry) quality over the duration of long runs. Remedy: Quality trimming.
Short loss of quality earlier in the run, which then recovers to produce later good quality sequence. Can be explained by a transient problem with the run (bubbles in the flowcell for example). In these cases trimming is not advisable as it will remove later good sequence, but you might want to consider masking bases during subsequent mapping or assembly.
Library with reads of varying length. Warning or error is generated because of very low coverage for a given base range. Before committing to any action, check how many sequences were responsible for triggering an error by looking at the sequence length distribution module results.
qc_plot(qc, "Per sequence quality scores")
Problems:
Common reasons for problems:
General loss of quality within a run. Remedy: For long runs this may be alleviated through quality trimming.
qc_plot(qc, "Per base sequence content")
It’s worth noting that some types of library will always produce biased sequence composition, normally at the start of the read. For example, in RNA-Seq data, it is common to have bias at the beginning of the reads. This occurs during RNA-Seq library preparation, when “random” primers are annealed to the start of sequences. These primers are not truly random, and it leads to a variation at the beginning of the reads. We can remove these primers using a trim adaptors tool.
Problems:
Common reasons for problems:
Overrepresented sequences: adapter dimers or rRNA
Biased selection of random primers for RNA-seq. Nearly all RNA-Seq libraries will fail this module because of this bias, but this is not a problem which can be fixed by processing, and it doesn’t seem to adversely affect the ablity to measure expression.
Biased composition libraries: Some libraries are inherently biased in their sequence composition. For example, library treated with sodium bisulphite, which will then converted most of the cytosines to thymines, meaning that the base composition will be almost devoid of cytosines and will thus trigger an error, despite this being entirely normal for that type of library.
Library which has been aggressivley adapter trimmed.
qc_plot(qc, "Per sequence GC content")
You can generate the theoretical GC content curves files using an R package called fastqcTheoreticalGC written by Mike Love.
qc_plot(qc, "Per base N content")
Problems:
Common reasons for problems:
qc_plot(qc, "Sequence length distribution")
qc_plot(qc, "Sequence duplication levels")
Problems:
Common reasons for problems:
Technical duplicates arising from PCR artefacts
Biological duplicates which are natural collisions where different copies of exactly the same sequence are randomly selected.
In RNA-seq data, duplication levels can reach even 40%. Nevertheless, while analysing transcriptome sequencing data, we should not remove these duplicates because we do not know whether they represent PCR duplicates or high gene expression of our samples.
qc_plot(qc, "Overrepresented sequences")
Problems:
Common reasons for problems:
small RNA libraries where sequences are not subjected to random fragmentation, and the same sequence may natrually be present in a significant proportion of the library.
qc_plot(qc, "Adapter content")
Problems:
A warning or failure means that the sequences will need to be adapter trimmed before proceeding with any downstream analysis.
qc_plot(qc, "Kmer content")