A bioinformatics pipeline for rare genetic diseases in South African patients

FUNDING: South African Medical Research Council The research fields of bioinformatics and computational biology are growing rapidly in South Africa. Bioinformatics pipelines play an integral part in handling sequencing data, which are used to investigate the aetiology of common and rare diseases. Bioinformatics platforms for common disease aetiology are well supported and continuously being developed in South Africa. However, the same is not the case for rare diseases aetiology research. Investigations into the latter rely on international cloud-based tools for data analyses and ultimately confirmation of a genetic disease. However, these tools are not necessarily optimised for ethnically diverse population groups. We present an in-house developed bioinformatics pipeline to enable researchers to annotate and filter variants in either exome or amplicon next-generation sequencing data. This pipeline was developed using next-generation sequencing data of a predominantly African cohort of patients diagnosed with rare disease.


Introduction
The research fields of bioinformatics and computational biology are both growing rapidly in South Africa, with an ever-increasing number of both small and large laboratories having access to next-generation sequencing (NGS) technologies.This increase in sequencing capability continues to stimulate the development of a variety of platforms, databases and initiatives, such as H3Africa (https://www.h3africa.org),the South African Human Genome Programme (http://sahgp.sanbi.ac.za), and the South African Bioinformatics Institute (https://www.sanbi.ac.za), to support sequencing data analysis.However, to date, the majority of these developments have been focused on common disease related research, because diseases like HIV, fibromyalgia, tuberculosis and malaria are major health challenges in southern Africa. 1 Southern African researchers doing rare-disease related research (diseases affecting 6-8% of the global population) 2 still rely on flexible, international cloud-based tools, such as Bystro 3 , BrowseVCF 4 , and RD-Connect 5 , as their analysis needs have not yet been fully met locally.For European populations, these tools allow researchers to leverage well-established genotype-phenotype correlations to guide investigations into rare disease aetiology.However, these correlations do not always hold for African populations, thus significantly reducing the power of and degree to which these online tools can be relied on in the South African research context.A niche therefore exists for a tool that is both sensitive to population heterogeneity and general enough in nature to enable effective domestic rare disease research.
African population groups are heterogeneous with large genetic variety and limited information on genotypephenotype correlations. 6Even though data are processed using the same reference genome, some aspects, such as allele frequency for disease-causing variants and genotype-phenotype correlation, could differ between ethnically diverse population groups 7 and must be taken into account.
In this paper, we present a bioinformatics pipeline developed in-house to address these limitations.This pipeline processes NGS data of ethnically diverse population groups without any strong prior assumptions regarding genotype-phenotype correlations.This offline pipeline, suitable for the analysis of both exome and amplicon sequencing data, is written in Bash (or the Bourne-Again SHell) and uses only open-source software.To allow other researchers to benefit from this work, the pipeline has been made available on GitHub under the GNU GPLv3 software licence. 8The Ensembl Variant Effect Predictor (VEP) 9 offline script is used for variant annotation, and the Genome Mining (GEMINI) 10 command line database management tool for variant filtering.The pipeline is easily adjustable with regard to what annotations are made, and how they are filtered, which is especially useful when working with NGS data from ethnically diverse patients.[13][14]

Bioinformatics pipeline
Workflows leveraging NGS technology mainly consist of two parts: primary and secondary analysis.Figure 1 shows what this workflow looks like when our newly developed pipeline is incorporated.During primary analysis, patient samples are prepared and sequenced on a specialised platform such as Ion Torrent (used in our research case) or Illumina.These platforms further perform the requisite signalcalling, basecalling, reference sequence alignment and variant calling.The output from this primary analysis, for a single sample, is a Variant Call Format (VCF) file listing all the variants for that sample (variants_patientX.vcf.gz in Figure 1). 15Secondary analysis, often with the identification of disease-causing variants in mind, entails variant annotation and filtering.Variant annotation is typically done using purpose-built tools, such as ANNOVAR 16 and VEP runner (used in our research case) that annotate variants with relevant metadata such as variant type, variant allele frequency and predicted impact.Variant filtering, in which variants of interest are identified, can be done using tools like BrowseVCF, RD-Connect and GEMINI (used in our research case).These tools filter variants based on the metadata associated with each, which allows a researcher to find the variants most relevant to their investigation.

Figure 1:
Next-generation sequencing bioinformatics pipeline followed for identifying disease-causing variants from Ion Torrent sequencing data.The first component of this illustration is primary data analysis, which is a semi-automated process done using the Ion Torrent Software (Torrent Suite).The second component illustrates software used for secondary data analysis.
The pipeline presented in this paper is focused exclusively on secondary analysis because the problematic potential population biases, which stem from variant annotations made based on European-populationcentric research results, influence only the secondary analysis results.
The pipeline utilises the offline VEP script's flexibility to annotate variants sufficiently comprehensively so that population-sensitive variant filtering can be achieved using GEMINI's high-specificity querying capabilities.Our pipeline is intended to form part of a larger NGS data analysis workflow, and is only responsible for variant annotation and filtering.

Secondary data analysis
Bash is the default command shell on most modern Unix-like systems (and Unix-like tools for Windows), and incorporates useful features from the Korn shell and the C shell. 17,18The ability for Bash users to write powerful scripts to automate analysis has made it a popular choice for research implementations, typically in the form of command line tools and analysis pipelines.Many of these command line tools are available as free software, giving researchers who use the Bash shell access to a plethora of bioinformatics tools with which to do their research.
The main advantage of using Bash is that it allows powerful automation of established tools in a natural way while minimising both the number of introduced software dependencies and the need for more advanced analysis infrastructure.These are important advantages considering the rare-disease focus of this pipeline.For researchers who focus exclusively on rare diseases, whose expertise and research interests often lie outside bioinformatics, a pipeline based on Bash allows flexible analyses that remain easily repeatable over long periods of time, while necessitating minimal skills development and infrastructure investment.
Here, we present the bioinformatics pipeline developed using these tools during our research.Scripts were run on Slackware v14.12 with GNU Bash, version 4.4.12.
As primary analysis, using the Ion Torrent system, delivers sequencing results in batches, it seemed prudent to write the secondary analysis scripts to operate in a batch fashion.This approach, coupled with the built-in automation that comes with scripting, ensures consistency across the samples in each batch and across different batches.
In the secondary analysis pipeline, the scripts implement two steps: variant annotation and data mining.Variant annotation is done using the VEP script, which can be downloaded from Ensembl's website. 9ata mining is done using the command line tool GEMINI, which uses the SQLite relational database management system to enable effective sequencing data mining. 10Each of these steps has a Bash script dedicated to it: vep_single.sh and gemini_single.sh,respectively (the '.sh' extension indicates that these text files are Bash scripts).These scripts are embedded into the vep_batch.shand gemini_batch.shscripts, respectively, which run them for each .vcf.gz file in a given directory.The hetero_annotate.shscript binds these two batch scripts into a full pipeline, calling each in sequence and managing their inputs and outputs.See Appendices 1-6 in the supplementary material for the full contents of these scripts.A brief description of what each script does is given below.
The vep_single.sh script (Supplementary appendix 1) takes an Ion Torrent-generated input .vcf.gz file as its first argument, and runs the VEP for it.It takes an output file as its second argument (a .vcffile), to which the annotated output of the VEP is written.Alongside these two arguments, a number of additional arguments are passed to the VEP script, with the most notable being the --fields [list] argument, which allows for the specification of the required annotation fields included in the annotated output .vcffile.This argument (as well as the specification of the VEP arguments) is handled in an extensible way by the vep_single.shscript, and the list of desired fields can be built up over multiple lines, or in multiple groups.A list of all possible fields and a list of all possible additional arguments can be found in the VEP's documentation. 9,19The VEP also generates a 'statistic run report' (.html file) when run, containing general statistics that give information on, among other things, the number of variants processed, number of overlapping genes, and number of novel/reported variants.
The gemini_single.sh (Supplementary appendix 2) script takes a VEPannotated input .vcffile as its first argument and loads it into a SQLite database (a .dbfile).This file is then mined for relevant variants using user-defined SQLite database query specifications.Each line in the queries_spec.txt(Supplementary appendix 3) auxiliary text file contains one such query specification, and consists of two comma-separated fields: the query's name and the relevant SQLite query snippet.Queries can easily be added to or removed from this file, or a different such file can be specified in the gemini_single.shscript.An example of such a query, which filters based on variants' allele frequency in African populations, is: rareAFR,aaf_1kg_afr <= 0.01.
What information these queries should return for each variant is controlled by the cols variable, which is used to build the full SQLite query.Some examples of columns that can be included in a query are is_coding (which is true if the variant is in a coding region), rs_ids (which lists the rsIDs associated with each variant), and aaf_1kg_afr (which stores the allele frequency of the variant in the AFR population as reported in the 1000 Genomes Project).For a list of all possible columns, see GEMINI's documentation. 20The cols variable is handled similarly to the --fields [list] argument to the VEP, and is similarly extensible.The output of each query is a list of variants, with information from the database columns specified in the cols variable, stored in a .txtfile that has the query's name as filename.These files constitute the main output of the pipeline.
Once all the queries have been performed, a meta_info.txtfile is generated that summarises the lengths of the lists contained in the query output files.This file, along with the generated query outputs, is saved in the folder specified as the second argument to the gemini_single.shscript.
The vep_batch.shand gemini_batch.shscripts (Supplementary appendices 4 and 5, respectively) each run their counterpart script, as discussed above, for all the .vcf.gz files in a given directory.Both of these scripts take the directory containing the relevant input files as their first argument, with the second argument being the directory to which output should be written.
Finally, the hetero_annotate.shscript (Supplementary appendix 6) administers the operation of the two batch scripts described above.
The first argument of this script is the directory containing the Ion Torrent output .vcf.gz files (the first argument for the vep_batch.shand vep_single.shscripts), and the second argument is the directory to which the GEMINI output should be written (the second argument to the gemini_batch.shand gemini_single.shscripts).
From here the user can further prioritise and filter the variants according to the criteria of their choice.For our African data set, variants were first filtered based on the novelty of these variants.Second, variants were filtered based on African allele frequencies, as reported in Exome aggregation consortium (ExAC 21 ).In addition, for mostly non-African population groups, known disease-causing variants can be identified from the data set using databases such as the Online Mendelian Inheritance in Man (OMIM 22 ) and ClinVar 23 .

Conclusion
Robust bioinformatics pipelines are key components for diagnosis and research of rare genetic diseases.Here we describe an offline, flexible and open-source bioinformatics pipeline that annotates variants using VEP and filtering of important disease-causing variants using GEMINI from NGS data.It was developed as a first prudent step towards data processing and offers unique advantages for the detection of multiple genetic alterations, including in patients of African descent for whom little information is available.The pipeline can be used on exome as well as amplicon NGS data and was designed using NGS data of a predominantly African cohort with rare disease.Identifying underlying genetic causes for rare disease is particularly challenging in the South African population, with limited bioinformatics support for researchers and non-bioinformaticians.With the increased burden to diagnose rare genetic diseases using NGS and genetic screening, and the limited support and resources in developing countries such as South Africa, it is equally important to develop and provide access to bioinformatics pipelines, such as this one.With continued development, pipelines in South Africa could be further refined and made more user-friendly, making them useful for both researchers and clinicians.These refinements could include, for instance, cloud-based interfaces like those used in developed countries.With such refinements in place, clinicians would more easily be able to investigate genotype-phenotype correlation in rare diseases for African population groups.