Saturday, July 25, 2020

Week #8: More improvements to sequential rs-gfatovcf

As you may remember from last week's post, there were some segfaults which made it impossible to produce the VCF from the COVID-19 GFA; also, it took an excessive amount of time and memory on complex graphs. This week I actually managed to get it to work, but the results weren't what I expected.

Earlier in the week, I discovered what was causing the segfaults: when finding paths between any two given nodes (in the find_all_paths_between function),  I would store all the possible paths to a given node, for each node. This is a well known problem NP-hard problem, so finding a solution that is both fast and space-efficient probably isn't possible. 

I was then suggested to use a threshold, so that after a certain amount of edges have been traversed the procedure stops, and returns all the paths found up to that point. There is also a new option called --max-edges (or -m for short), that allows the user to choose which threshold to use, according to how powerful his/her computer is and how much memory it has. The default value is 100, and it allows for generating the COVID-19 VCF in a matter of minutes. Obviously the higher this value is, the more accurate the VCF.

After finally obtaining a VCF, we had to make sure it was actually correct (i.e. to validate it), but how can we do that? Erik suggested me to use vg construct, a tool that takes in a VCF and a reference genome in FASTA format and returns a GFA. The general idea is that if both GFAs match (or are at least very similar), then the VCF is correct.

So, in order to use vg construct I needed a VCF file (which I obviously had, as it's the output of my program) and a genome in FASTA format. Unfortunately there were no references to genomes in the GFA file, so I had to find a way to discover what reference genome had been used to generate it. Fortunately it's possible to extract the genome from the GFA via odgi build and odgi paths.

I was finally able to run vg construct and then... the VCF file and the reference genome did not match. That means there's some problems in how the algorithm works. As a first attempt, I changed how the VCF header is created in this commit, which should improve the quality of the result. I still have to investigate more on what could be causing this issue.




No comments:

Post a Comment