Googling the cancer genome
Identification and prioritization of cancer-causing structural variations in whole genomes
Genome-wide detection of structural variants using deep learning
The workflow includes the following key steps: 1) Transform read alignments into channels First, split read positions are extracted from the BAM files as candidate regions for SV breakpoints. For each pair of split read positions (rightmost position of the first split part and leftmost position of the second split part) a 2D Numpy array called window is constructed. The shape of a window is [window_size, number_of_channels], where the genomic interval encompassing the window is centered on the split read position with a context of [-100 bp, +100 bp) for a window_size of 200 bp. From all the reads overlapping this genomic interval and from the relative segment subsequence of the reference sequence 79 (number_of_channels) channels are constructed, where each channel encode a signal that can be used for SV calling. The list of channels can be found here. The two windows are joined as linked-windows with a zero padding 2D array of shape [10, number_of_channels] in between to avoid artifacts related to the CNN kernel in the part at the interface between the two windows. The linked-windows are labelled as SV when the split read positions overlap the SV callset used as the ground truth and noSV otherwise, where SV is either DEL,INS,INV,DUP or CTX according to the SV type.
2) Model training The labelled linked-windows are used to train a 1D CNN to learn to classify them as either SV or noSV. Two cross-validation strategies are possible: 10-fold cross-validation and cross-validation by chromosome, where one chromosome is used as the test set and the other chromosomes as the training set.
3) SV calling with a trained model Once a trained model is generated and the BAM file for the test set is converted into linked-windows, the SV calling is performed using the predict.py script.