Remember this?
databio.org/seqcostsLower costs → More data
databio.org/seqcostsGenomics can be 'big data'.
Splitting a compute task, and then completing each split simultaneously.
PROs:
-c 16
)CON:
PRO:
CON:
PRO:
CON:
lapply(data, func) # serial
mclapply(data, func, mc.cores=detectCores()) # parallel
import subprocess
>>> subprocess.run(["ls", "-l"]) # doesn't capture output
CompletedProcess(args=['ls', '-l'], returncode=0)
>>> subprocess.run(["ls", "-l", "/dev/null"], capture_output=True)
CompletedProcess(args=['ls', '-l', '/dev/null'], returncode=0,
stdout=b'crw-rw-rw- 1 root root 1, 3 Jan 23 16:23 /dev/null\n', stderr=b'')
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
with Pool(5) as p:
print(p.map(f, [1, 2, 3]))
A workflow or pipeline is a repeatable sequence of tasks that process a piece of data.
A development toolkit that makes it easier to build workflows.
SAMPLES = ["A", "B"]
rule all:
input:
"plots/quals.svg"
rule bwa_map:
input:
"data/genome.fa",
"data/samples/{sample}.fastq"
output:
"mapped_reads/{sample}.bam"
shell:
"bwa mem {input} | samtools view -Sb - > {output}"
rule samtools_sort:
input:
"mapped_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam"
shell:
"samtools sort -T sorted_reads/{wildcards.sample} "
"-O bam {input} > {output}"
rule samtools_index:
input:
"sorted_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam.bai"
shell:
"samtools index {input}"
rule bcftools_call:
input:
fa="data/genome.fa",
bam=expand("sorted_reads/{sample}.bam", sample=SAMPLES),
bai=expand("sorted_reads/{sample}.bam.bai", sample=SAMPLES)
output:
"calls/all.vcf"
shell:
"samtools mpileup -g -f {input.fa} {input.bam} | "
"bcftools call -mv - > {output}"
rule plot_quals:
input:
"calls/all.vcf"
output:
"plots/quals.svg"
script:
"scripts/plot-quals.py"
// Script parameters
params.query = "/some/data/sample.fa"
params.db = "/some/path/pdb"
db = file(params.db)
query_ch = Channel.fromPath(params.query)
process blastSearch {
input:
file query from query_ch
output:
file "top_hits.txt" into top_hits_ch
"""
blastp -db $db -query $query -outfmt 6 > blast_result
cat blast_result | head -n 10 | cut -f 2 > top_hits.txt
"""
}
process extractTopHits {
input:
file top_hits from top_hits_ch
output:
file "sequences.txt" into sequences_ch
"""
blastdbcmd -db $db -entry_batch $top_hits > sequences.txt
"""
}
#!/usr/bin/env cwl-runner
cwlVersion: v1.0
class: Workflow
inputs:
tarball: File
name_of_file_to_extract: string
outputs:
compiled_class:
type: File
outputSource: compile/classfile
steps:
untar:
run: tar-param.cwl
in:
tarfile: tarball
extractfile: name_of_file_to_extract
out: [extracted_file]
compile:
run: arguments.cwl
in:
src: untar/extracted_file
out: [classfile]
The shell makes common and simple actions really simple, at the expense of making more complex things much more complex.
Typically, a small shell script will be shorter and simpler than the corresponding python program, but the python program will tend to gracefully accept modifications, whereas the shell script will tend to get less and less maintainable as code is added.
This has the consequence that for optimal day-to-day productivity you need shell-scripting, but you should use it mostly for throwaway scripts, and use python everywhere else. -Anonymous
Source: https://www.bigocheatsheet.com/
How big of data can you handle?
data.frame(C=1, logN=log(n), N=n, NLogN=n*log(n), NSq=n^2, TwotoN=2^n, NFactorial=factorial(n))
C logN N NLogN NSq TwotoN NFactorial
2 1 0.6931472 2 1.386294 4 4 2
3 1 1.0986123 3 3.295837 9 8 6
4 1 1.3862944 4 5.545177 16 16 24
5 1 1.6094379 5 8.047190 25 32 120
6 1 1.7917595 6 10.750557 36 64 720
7 1 1.9459101 7 13.621371 49 128 5040
8 1 2.0794415 8 16.635532 64 256 40320
9 1 2.1972246 9 19.775021 81 512 362880
10 1 2.3025851 10 23.025851 100 1024 3628800
11 1 2.3978953 11 26.376848 121 2048 39916800
12 1 2.4849066 12 29.818880 144 4096 479001600
13 1 2.5649494 13 33.344342 169 8192 6227020800
14 1 2.6390573 14 36.946803 196 16384 87178291200
15 1 2.7080502 15 40.620753 225 32768 1307674368000
16 1 2.7725887 16 44.361420 256 65536 20922789888000
17 1 2.8332133 17 48.164627 289 131072 355687428096000
18 1 2.8903718 18 52.026692 324 262144 6402373705728000
19 1 2.9444390 19 55.944341 361 524288 121645100408832000
20 1 2.9957323 20 59.914645 400 1048576 2432902008176640000
Factorial time arises in problems involving permutations
Traveling salesman: find the minimum distance path connecting all points.
Source: Travelling salesman problem
Shortest common superstring: Given set of strings S find shortest string containing the strings in S as substrings
Brute force: enumerate all orders
Exponential time arises in problems with nested subproblems
Longest common subsequence (LCS): find the longest subsequence common to all sequences in two sequences.
def longest_cmn_subseq(s1, s2, ind1, ind2):
if (ind1 == -1 or ind2 == -1): # base case
return 0
if (s1[ind1] == s2[ind2]): # match
return 1 + longest_cmn_subseq(s1, s2, ind1-1, ind2-1)
return max(longest_cmn_subseq(s1, s2, ind1-1, ind2), longest_cmn_subseq(s1, s2, ind1, ind2-1))
longest_cmn_subseq("TCGA", "TGCTA", 3, 4)
3
Exponential time comes from the double recursive call
Nested loops
# Naively align reads to a reference genome
for (r in reads): # Order 10^8 ?
for (p in reference_positions): # Order 10^9
score_alignment(r, p)
# Scan for motif matches
for (m in motifs): # Order 10^3
for (s in sequences): # Order 10^7
motif_scan(m, s)
Find the overlaps
Array lookup
regions[15]
Indexes
regions[ index(chr1, 1526) ]
Tabix indexing
Input:
chr1 10468 annotation1
chr1 10469 annotation2
chr1 10470 annotation3
Compress: bgzip file.tsv
Index: tabix -s 1 -b 2 -e 2 file.tsv.gz
Retrieve: tabix file.tsv.gz.tbi chr5:50000-100000
library("microbenchmark")
d = matrix(rnorm(10000000), 10, 1000000)
myMeans = function(d) {
means = c()
for (i in 1:ncol(d)) { means[i] = mean(d[,i]) }
means
}
microbenchmark(
myMeans(d),
apply(d, 2, mean),
colMeans(d),
times=3)
Unit: milliseconds
expr min lq mean median uq max neval
myMeans(d) 4693 4880 4997 5067 5149 5230 3
apply(d, 2, mean) 4856 5047 5152 5237 5300 5363 3
colMeans(d) 7 7 7 7 7 8 3
Method 1 pseudocode:
def quantify_accessibility(peaks_bed, reads_bam):
with(f as open(peaks_bed)):
count_reads(f.readline(), readsbam)
Method 2 pseudocode:
def quantify_accessibility(peaks_bed, reads_bam):
peaks = load_file(peaks_bed)
for (r in peaks):
count_reads(r, readsbam)
Which is faster? Which uses more memory?
Does smaller = faster ?
> c(rep(1,20), rep(0,15))
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> rle(c(rep(1,20), rep(0,15)))
Run Length Encoding
lengths: int [1:2] 20 15
values : num [1:2] 1 0
Duplicate string elimination
Input: "The compression and the decompression leave an impression. Hahahahaha!"
Output: "The compression and t [20 | 3] de [22 | 12] leave [28 | 3] i [42 | 7] . Hah [2 | 7] !"
Example credit: https://sudonull.com
Minimum redundancy codes: assign the fewest bits to the most common characters.
DNA Input: ACTGAACGATCAGTACAGAAG
Base Freq ASCII 2bit HuffCode
A 9 01000001 00 0
G 5 01000111 01 10
C 4 01000011 10 110
T 3 01010100 11 111
Requires prefix property for variable-width codes
Requires sequential reading (no random access)
See Okaily et al
@HD VN:1.6 SO:coordinate
@SQ SN:ref LN:45
r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA *
r003 0 ref 9 30 5S6M * 0 0 GCCTAAGCTAA * SA:Z:ref,29,-,6H5M,17,0;
r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC *
r003 2064 ref 29 17 6H5M * 0 0 TAGGC * SA:Z:ref,9,+,5S6M,30,1;
r001 147 ref 37 30 9M = 7 -39 CAGCGGCAT * NM:i:1
@HD VN:1.6 SO:coordinate
@SQ SN:ref LN:45
r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA *
r003 0 ref 9 30 5S6M * 0 0 GCCTAAGCTAA * SA:Z:ref,29,-,6H5M,17,0;
r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC *
r003 2064 ref 29 17 6H5M * 0 0 TAGGC * SA:Z:ref,9,+,5S6M,30,1;
r001 147 ref 37 30 9M = 7 -39 CAGCGGCAT * NM:i:1
99 = 01100011
1 = 00000001
2 = 00000010
32 = 00100000
64 = 01000000
>>> 99 & 0x1
1
>>> 99 & 0x2
2
>>> 99 & 0x4
0
>>> 99 & 0x8
0
>>> 99 & 0x10 # hexadecimal 16
0
>>> 99 & 0x20 # hexadecimal 32
32
>>> 99 & 0x40 # hexadecimal 64
64
>>> 99 & 0x80 # hexadecimal 128
0
@HD VN:1.6 SO:coordinate
@SQ SN:ref LN:45
r001 99 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA *
r003 0 ref 9 30 5S6M * 0 0 GCCTAAGCTAA * SA:Z:ref,29,-,6H5M,17,0;
r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC *
r003 2064 ref 29 17 6H5M * 0 0 TAGGC * SA:Z:ref,9,+,5S6M,30,1;
r001 147 ref 37 30 9M = 7 -39 CAGCGGCAT * NM:i:1
BGZF is block compression implemented on top of the standard gzip file format. The goal of BGZF is to provide good compression while allowing efficient random access to the BAM file for indexed queries. The BGZF format is ‘gunzip compatible’, in the sense that a compliant gunzip utility can decompress a BGZF compressed file.
.bam
file contains the aligned read data.bai
file contains the index (offsets to locations of bins)"Reference-based" compression (Fritz 2011)
CRAM 3.1 introduced by Bonfield 2022
CRAM 3.1: Bonfield 2022
Although reference compression is where the original work focussed, it is wrong to assume that this is the primary reason for CRAM’s reduced file size. BAM serializes all data together (first name, chromosome, position, sequence, quality and auxiliary fields, then second name, chromosome and so on). This leads to poor compression ratios as names, sequences and quality values all have very different characteristics. CRAM has a column-oriented approach, where a block of names are compressed together or a block of qualities together.