Earlier in the week, I discovered what was causing the segfaults: when finding paths between any two given nodes (in the find_all_paths_between function), I would store all the possible paths to a given node, for each node. This is a well known problem NP-hard problem, so finding a solution that is both fast and space-efficient probably isn't possible.
I was then suggested to use a threshold, so that after a certain amount of edges have been traversed the procedure stops, and returns all the paths found up to that point. There is also a new option called --max-edges (or -m for short), that allows the user to choose which threshold to use, according to how powerful his/her computer is and how much memory it has. The default value is 100, and it allows for generating the COVID-19 VCF in a matter of minutes. Obviously the higher this value is, the more accurate the VCF.
After finally obtaining a VCF, we had to make sure it was actually correct (i.e. to validate it), but how can we do that? Erik suggested me to use vg construct, a tool that takes in a VCF and a reference genome in FASTA format and returns a GFA. The general idea is that if both GFAs match (or are at least very similar), then the VCF is correct.
So, in order to use vg construct I needed a VCF file (which I obviously had, as it's the output of my program) and a genome in FASTA format. Unfortunately there were no references to genomes in the GFA file, so I had to find a way to discover what reference genome had been used to generate it. Fortunately it's possible to extract the genome from the GFA via odgi build and odgi paths.
So, in order to use vg construct I needed a VCF file (which I obviously had, as it's the output of my program) and a genome in FASTA format. Unfortunately there were no references to genomes in the GFA file, so I had to find a way to discover what reference genome had been used to generate it. Fortunately it's possible to extract the genome from the GFA via odgi build and odgi paths.
I was finally able to run vg construct and then... the VCF file and the reference genome did not match. That means there's some problems in how the algorithm works. As a first attempt, I changed how the VCF header is created in this commit, which should improve the quality of the result. I still have to investigate more on what could be causing this issue.
No comments:
Post a Comment