---
## What is a standard?
> A norm or specification applied to a repeated process. It includes a formal definition of terms and a description of formats and expected behavior of a system
Standards → Interoperability
---
What happens when there is no standard
(or, equivalently, too many standards)
---
---
### Existing standards in genomics
- Project metadata
- Ontologies
- File formats
- Reference genomes
- Web standards and APIs
Standards → Interoperability
---
### GA4GH
> The Global Alliance for Genomics and Health (GA4GH) is a policy-framing and technical standards-setting organization, seeking to enable responsible genomic data sharing within a human rights framework.
- [ga4gh.org](https://www.ga4gh.org/)
- formed in 2013
---
Research is organized in projects
How do we conceptualize a research project?
---
Each project has 3 components
---
Organizing multiple projects is a challenge
---
How do I re-use a component?
---
A project is a set of edges in a tripartite graph
---
Enable linking with interfaces
---
### Existing standards in genomics
- Project metadata (Data → Code)
- Ontologies (Data → Code)
- File formats (Data → Data)
- Reference genomes (Data → Data)
- Web standards and APIs (Data → Compute)
Standards → Interoperability
---
PEP: Portable Encapsulated Projects
---
---
PEP format
project_config.yaml
sample_table: /path/to/samples.tsv
output_dir: /path/to/output/folder
samples.csv
sample_name, protocol, organism, data_source
frog_0h, RNA-seq, frog, /path/to/frog0.gz
frog_1h, RNA-seq, frog, /path/to/frog1.gz
frog_2h, RNA-seq, frog, /path/to/frog2.gz
frog_3h, RNA-seq, frog, /path/to/frog3.gz
---
PEP portability features
Derived attributes
Implied attributes
Subprojects
---
Derived attributes
Build new sample attributes from existing ones
Without derived attribute:
| sample_name | t | protocol | organism | data_source |
| ------------- | ---- | :-------------: | -------- | ---------------------- |
| frog_0h | 0 | RNA-seq | frog | /path/to/frog0.gz |
| frog_1h | 1 | RNA-seq | frog | /path/to/frog1.gz |
| frog_2h | 2 | RNA-seq | frog | /path/to/frog2.gz |
| frog_3h | 3 | RNA-seq | frog | /path/to/frog3.gz |
Using derived attribute:
| sample_name | t | protocol | organism | data_source |
| ------------- | ---- | :-------------: | -------- | ---------------------- |
| frog_0h | 0 | RNA-seq | frog | my_samples |
| frog_1h | 1 | RNA-seq | frog | my_samples |
| frog_2h | 2 | RNA-seq | frog | my_samples |
| frog_3h | 3 | RNA-seq | frog | my_samples |
| crab_0h | 0 | RNA-seq | crab | your_samples |
| crab_3h | 3 | RNA-seq | crab | your_samples |
---
| sample_name | t | protocol | organism | data_source |
| ------------- | ---- | :-------------: | -------- | ---------------------- |
| frog_0h | 0 | RNA-seq | frog | my_samples |
| frog_1h | 1 | RNA-seq | frog | my_samples |
| frog_2h | 2 | RNA-seq | frog | my_samples |
| frog_3h | 3 | RNA-seq | frog | my_samples |
| crab_0h | 0 | RNA-seq | crab | your_samples |
| crab_3h | 3 | RNA-seq | crab | your_samples |
Project config file:
```yaml
sample_modifiers:
derive:
attributes: [data_source]
sources:
my_samples: "/path/to/my/samples/{organism}_{t}h.gz"
your_samples: "/your/samples/{organism}_{t}h.gz"
```
{variable} identifies sample annotation columns
Benefit: Enables distributed files, portability
---
Implied attributes
Add new sample attributes conditioned on values of existing attributes
Before:
| sample_name | protocol | organism |
| ------------- | :-------------: | -------- |
| human_1 | RNA-seq | human |
| human_2 | RNA-seq | human |
| human_3 | RNA-seq | human |
| mouse_1 | RNA-seq | mouse |
After:
| sample_name | protocol | organism | genome |
| ------------- | :-------------: | -------- | ------ |
| human_1 | RNA-seq | human | hg38 |
| human_2 | RNA-seq | human | hg38 |
| human_3 | RNA-seq | human | hg38 |
| mouse_1 | RNA-seq | mouse | mm10 |
---
| sample_name | protocol | organism |
| ------------- | :-------------: | -------- |
| human_1 | RNA-seq | human |
| human_2 | RNA-seq | human |
| human_3 | RNA-seq | human |
| mouse_1 | RNA-seq | mouse |
Project config file:
```yaml
sample_modifiers:
imply:
- if:
organism: human
then:
genome: hg38
- if:
organism: mouse
then:
genome: mm10
```
Benefit: Divides project from sample metadata
---
Project amendments
Define activatable project attributes.
```yaml
project_modifiers:
amend:
diverse:
sample_table: psa_rrbs_diverse.csv
cancer:
sample_table: psa_rrbs_intracancer.csv
```
Benefit: Defines multiple similar projects in a single file
---
### Is it enough?
Study 1:
| sample_name | t | protocol | organism | data_source |
| ------------- | ---- | :-------------: | -------- | ---------------------- |
| frog_2h | 2 | RNA-seq | frog | my_samples |
| frog_3h | 3 | RNA-seq | frog | my_samples
Study 2:
| sample_name | t | protocol | organism | data_source |
| ------------- | ---- | :-------------: | -------- | ---------------------- |
| frog_0m | 0 | RNAseq | frog | my_samples |
| frog_60m | 60 | RNAseq | frog | my_samples |
---
### Ontologies
> A formal vocabulary and definition of concepts, entities, and their relationships.
1. Terms
2. Relations
---
## Gene Ontology
- 3 Ontologies:
- Molecular Function
- Cellular Component
- Biological Process
1. Terms: [Hexose Biosynthetic Process](http://amigo.geneontology.org/amigo/term/GO:0019319)
2. Relations: is-a, is-part-of, regulates ([Relation Ontology](https://doi.org/10.1186/gb-2005-6-5-r46))
---
![](images/genomic-data-standards/hexose-biosynthetic-process.png)
---
### EDAM Ontology
![](images/genomic-data-standards/edam-ontology.png)
---
### Sequence Ontology
Term: `open_chromatin_region`
SO Accession: `SO:0001747`
![](images/genomic-data-standards/SO0001747_sm.png)
Definition:
A DNA sequence that in the normal state of the chromosome corresponds to an unfolded, un-complexed stretch of double-stranded DNA.
---
---
How can we enforce or validate terms?
Study 1:
| sample_name | t | protocol | organism | data_source |
| ------------- | ---- | :-------------: | -------- | ---------------------- |
| frog_2h | 2 | RNA-seq | frog | my_samples |
| frog_3h | 3 | RNA-seq | frog | my_samples
Study 2:
| sample_name | t | protocol | organism | data_source |
| ------------- | ---- | :-------------: | -------- | ---------------------- |
| frog_0m | 0 | RNAseq | frog | my_samples |
| frog_60m | 60 | RNAseq | frog | my_samples |
---
### JSON Schema
```
{
"productId": 1,
"productName": "A green door",
"price": 12.50,
"tags": [ "home", "green" ]
}
```
- What is productId?
- Is productName required?
- Can the price be zero (0)?
- Are all of the tags string values?
---
### RNA-seq analysis
```
{
"sample_name": frog_2h,
"t": "2",
"protocol": "RNA-seq",
"organism": "frog",
"data_source": "my_samples"
}
```
- Is sample_name required?
- What is t?
- What are the allowable protocols?
- What are the allowable organisms?
- What is the input file type expected for data source?
---
### Schema
```
{
"title": "Product",
"description": "A product from Acme's catalog",
"type": "object",
"properties": {
"productId": {
"description": "The unique identifier for a product",
"type": "integer"
},
"productName": {
"description": "Name of the product",
"type": "string"
}
},
"required": [ "productId", "productName" ]
}
```
---
### Validating objects with schemas
```shell
jsonschema --instance sample.json sample.schema
```
```python
validate([2, 3, 4], {"maxItems": 2})
Traceback (most recent call last):
...
ValidationError: [2, 3, 4] is too long
```
---
### What is a schema?
A formal definition of the properties and allowable values for data
JSON-Schema provides a standard format for how to describe generic data objects, and tools to validate them.
---
### A schema for ATAC-seq pipeline
```
description: A PEP for ATAC-seq samples for the PEPATAC pipeline.
imports:
- http://schema.databio.org/pep/2.0.0.yaml
properties:
samples:
type: array
items:
type: object
properties:
sample_name:
type: string
description: "Name of the sample"
organism:
type: string
description: "Organism"
protocol:
type: string
description: "Must be an ATAC-seq or DNAse-seq sample"
genome:
type: string
description: "Refgenie genome registry identifier"
read_type:
type: string
description: "Is this single or paired-end data?"
enum: ["SINGLE", "PAIRED"]
read1:
anyOf:
- type: string
description: "Fastq file for read 1"
- type: array
items:
type: string
read2:
anyOf:
- type: string
description: "Fastq file for read 2 (for paired-end experiments)"
- type: array
items:
type: string
required_files:
- read1
files:
- read1
- read2
required:
- sample_name
- protocol
- read1
- genome
required:
- samples
```
---
---
### Standards: File formats
- FASTA: DNA sequences
- FASTQ: Short DNA sequencer reads
- SAM/BAM: Aligned DNA sequences
- BED: genomic intervals
- VCF: genomic variation
---
### File formats are not enough
Say I have two BED files. Are they comparable?
File 1:
```
chr1 15300 17300
chr1 24900 25600
chr1 72420 74440
```
File 2:
```
1 18110 19300
1 64000 65600
1 72400 74700
```
sequence names must match
.
---
### File formats are not enough
Say I have two BED files. Are they comparable?
File 1:
```
chr1 15300 17300
chr1 24900 25600
chr1 72420 74440
```
File 2:
```
chr1 18110 19300
chr1 64000 65600
chr1 72400 74700
```
sequence names must match. How about now?
reference sequence must be compatible
---
Names Match
Names Mismatch
Reference Match
Great!
Adjust names?
Ref. Mismatch
Don't be fooled!
No compatibility
---
### Standards: Reference genomes
Reference genomes are versioned.
human reference: hg18, hg19, hg38/GRCh38, T2T
mouse reference: mm8, mm9, mm10
They are not compatible!
---
### Converting among genome builds
Liftover ([Luu et al. 2020](https://doi.org/10.1093/nargab/lqaa054))
---
### Standards: Reference genomes
There are also many variations of the *same* version.
- 3 GRCh38 providers: NCBI, Ensembl, UCSC
- hard- soft-, or no repeat masks?
- what are the chromosomes named (1 vs chr1 vs chr1, NC_000001.11)
- what secondary scaffolds are included?
- how is the assembly named (hg38, GRCh38, or GCF_000001405.39)?
- are any decoy sequences included (like EBV)?
Downstream results are not compatible if based on a different reference!
---
Andy Yates' "Genome provider analysis"
Used for hash tables, but also...
---
### Hashing
The output, called a *hash* or *digest*, can be seen as a fingerprint or barcode. It is a unique identifier of an input item.
As long as there are no collisions...
---
### How [GA4GH:refget](https://doi.org/10.1093/bioinformatics/btab524) works
#### Future approach
You implement DRS. Then, a standard DRS client could get data, making your data more accessible.
---
### GA4GH:WES
[Workflow Execution Service](https://ga4gh.github.io/workflow-execution-service-schemas/docs/)
Provides a standard way for users to submit workflow requests to workflow execution systems, and to monitor their execution
- GET `/runs` - to list current runs
- POST `/runs` - to start a new run
- GET `/runs/{run_id}` - information about a run
- GET `/runs/{run_id}/status` - status of a run
- POST `/runs/{run_id}/cancel` - cancels a run
---
### GA4GH:TES
[Task Execution Service](https://ga4gh.github.io/task-execution-schemas/docs/)
- GET `/tasks` - list tasks tracked by the TES server
- POST `/tasks` - create new task
- GET `/tasks/{id}` - get a single task
- POST `/tasks/{id}:cancel` - cancel a task
---
### Conclusion
Who is the target audience?
- humans?
- computers?
It's easy to build a human-friendly interface to data organized for computers. The opposite is not true.