Skip to main content

Processing VCF files

In this page, we aim to explore the full potential of Pheno-Ranker. Our focus will be on processing a VCF file - a challenging yet intriguing task.

A VCF is essentially a specialized form of a TSV

The Variant Call Format (VCF) is a bioinformatics standard created for the 1000 Genomes Project to store gene variations, evolving to version 4.3 and expanded with formats like gVCF for comprehensive data representation.

The body of the Variant Call Format (VCF), which is essentially a tab-separated values (TSV) file, comprises eight mandatory columns and an unlimited number of optional columns for additional sample information.

Steps

OK, so our goal is to compare samples within the VCF based on their genomic variations. To achieve this, we'll undertake the following steps:

  1. Transpose the VCF data into a TSV format, arranging it so that each row contains all variations for a specific sample.
  2. Transform the TSV into a format that is compatible with Pheno-Ranker, utilizing the provided utility.
  3. Execute Pheno-Ranker in cohort-mode to generate plots using R.
  4. Run Pheno-Ranker in patient-mode to identify the most similar sample.
  5. Generate QR codes for the first 10 samples (and decode them back).
What is the source of your VCF test data?

The dataset test_1000G.vcf.gz is a subset extracted from the 1000 Genomes Project, containing approximately 1K variations. For more detailed information, please visit this page.

About VCF size and content

The idea here is to compare samples (or individuals if you will) by their genomic variations, like if we were comparing a genomic fingerprint. A good example for this would be comparing samples in a multi-sample VCF from a gene panel (or even an Exome) after filtering variations. This method of course could be complemented by adding the phenotypic information on top of the genomic variations, and then use weights. etc.

We advise against using this method for comparing samples with millions of genomic variations.

Let's go!

Step1: Tranpose the VCF to TSV

We are going to be using the included Python script

Can the VCF be multi-allelic?

Yes, the VCF can be multi-allelic. This is how variant information is stored:

"1_15274_A_G,T" : "0|0",
"1_15274_A_G,T" : "0|1",
"1_15274_A_G,T" : "0|2",
"1_15274_A_G,T" : "1|0",
"1_15274_A_G,T" : "1|1",
"1_15274_A_G,T" : "1|2",
"1_15274_A_G,T" : "2|0",
"1_15274_A_G,T" : "2|1",
"1_15274_A_G,T" : "2|2",

In this example, the genotypes are phased, but it works also with unphased genotypes (e.g., 0/1).

utils/csv2pheno_ranker/vcf/vcf2pheno-ranker.py -i test_1000G.vcf.gz -o output.tsv
Regarding paths for executables

Make sure to specify the correct paths for your executables. Here, we show the paths as they exist in the Github repository.

Step 2: Transform the TSV to a compatible format

utils/csv2pheno_ranker/csv2pheno-ranker -i output.tsv -primary-key-name 'Sample ID'

Where the created output.json has the following format:

[
{
"1_99490_C_T" : "0|0",
"1_99671_A_T" : "0|0",
...
"1_99687_C_T" : "0|0",
"1_99719_C_T" : "0|0",
"Sample ID" : "HG00096"
},
...
]

Step 3: Execute Pheno-Ranker in cohort-mode

bin/pheno-ranker -r output.json -config output_config.yaml

This created the file matrix.txt. It's a huge matrix of 2504 x 2504 pairwise-comparisons.

Large outputs

The default output is a dense matrix.txt, which is useful for the included R scripts. If you only need a sparse matrix for downstream tools, use Matrix Market output:

bin/pheno-ranker -r output.json -config output_config.yaml --matrix-format mtx -o output.mtx

For graph exports, filter edges explicitly with --graph-max-weight for Hamming distance or --graph-min-weight for Jaccard.

Now you can create a heatmap + clustering with the included script:

Rscript share/r/heatmap.R

(Running time < 2 min in Apple M2 Pro)

Heatmap
Intra-cohort pairwise comparison

Step 4: Execute Pheno-Ranker in patient-mode

Initially, we will extract a single sample from the cohort, specifically the first one listed in the VCF: HG00096.

bin/pheno-ranker -r output.json -config output_config.yaml --patients-of-interest HG00096

This creates HG00096.json.

Now we run Pheno-Ranker in patient-mode:

bin/pheno-ranker -r output.json -t HG00096.json -config output_config.yaml
See results
RANKREFERENCE(ID)TARGET(ID)FORMATLENGTHWEIGHTEDHAMMING-DISTANCEDISTANCE-Z-SCOREDISTANCE-P-VALUEDISTANCE-Z-SCORE(RAND)JACCARD-INDEXJACCARD-Z-SCOREJACCARD-P-VALUEREFERENCE-VARSTARGET-VARSINTERSECTINTERSECT-RATE(%)COMPLETENESS(%)
1HG00096HG00096CSV1043False0-3.4400.0002913-32.29551.0003.5490.0053956104310431043100.00100.00
2HG01537HG00096CSV1050False14-2.7120.0033449-31.53960.9872.7780.037713610431043103699.3399.33
3HG03598HG00096CSV1054False22-2.2960.0108361-31.11010.9792.3420.089865510431043103298.9598.95
4HG04141HG00096CSV1055False24-2.1920.0141860-31.00300.9772.2330.108781910431043103198.8598.85
5HG04033HG00096CSV1055False24-2.1920.0141860-31.00300.9772.2330.108781910431043103198.8598.85
6HG00237HG00096CSV1056False26-2.0880.0183923-30.89600.9752.1250.130361010431043103098.7598.75
7HG04096HG00096CSV1057False28-1.9840.0236175-30.78910.9742.0170.154684910431043102998.6698.66
8NA12827HG00096CSV1057False28-1.9840.0236175-30.78910.9742.0170.154684910431043102998.6698.66
9NA20534HG00096CSV1057False28-1.9840.0236175-30.78910.9742.0170.154684910431043102998.6698.66
10HG00234HG00096CSV1057False28-1.9840.0236175-30.78910.9742.0170.154684910431043102998.6698.66

Sample HG01537 is the closest. It has a distance of 14 to HG00096 and a p-value = 0.0033449.

Step 5: Generate QR codes for the first 10 samples

We are going to compress all variant information (1042 variations) into QR-codes

  • First we are going to export the needed files, excluding the primary key 'Sample ID':
bin/pheno-ranker -r output.json -config output_config.yaml -exclude-terms 'Sample ID' --export
  • Now we use the included utility pheno-ranker2barcode:
utils/barcode/pheno-ranker2barcode -i export.ref_binary_hash.json

This has created QR codes (PNG) for each sample inside the directory qr_codes.

See QR codes for the first 10 samples
QR
QR codes for 10 samples

To decode the QR codes back to Pheno-Ranker format:

utils/barcode/barcode2pheno-ranker -i $(ls -1 qr_codes/*png | head -10) -t export.glob_hash.json

This will create the file decoded.json