Sunday, July 19, 2020

Week #7: Going back to sequential rs-gfatovcf

The last week was dedicated to fixing a few scalability issues that my program was experiencing on larger and more complex GFA files. The main objective of this GSoC is to create a tool that is both general and well-performing, so I decided to go back to sequential rs-gfatovcf to fix the "general" part before continuing on with parallelization. Due to its current international relavance, we are especially interested in finding variants (that is, a VCF file) from COVID-19 pangenomes, as part of the COVID-19 PubSeq Project. Maybe by looking at variants we will be able to understand how many different strains of the virus are currently circulating, or maybe we will be able to create ad-hoc treatments for each patient, according to how similar or different the virus samples extracted from him are to other known genomes.

But what actually were these scalability issues I was talking about previously? Basically the only GFA I used to write tests was the one already included in Flavia's GFAtoVCF repository. While it's not a very simple graph, it obviously does not match the complexity of actual "real-world" graphs. For instance, this graph is about 1 kb, while the one from the PubSeq project is about 40 mb. That's about 40 000 times bigger! Also, "real-world" graphs can contain loops or nested bubbles, which makes bubble detection even harder. 

A small section of the graph that was used for the testing.
Noticed something yet? My script now supports JSON output for both the starting graph and its bfs-tree.

When I tried to use my script on the COVID-19 GFA something unexpected happened: Rust returned a stack overflow, which was caused by the BFS function. I was actually surprised at first, because I was quite sure the idea behind the function was correct. What I did not consider was that, when dealing with larger inputs, memory management plays an important role in how a program works (and also that ideas that work well in theory sometimes don't when used in pratice!). So I decided to rewrite BFS in the standard way, that is by using a Queue. A similar problem was found in find_all_paths_between, a function that finds all the paths between two nodes, and was solved in a similar way. Extensive testing was possible thanks to Erik's HLA-Zoo, a collection of GFA files useful for testing programs that use variation graphs.

Another issue found was that sometimes the script would return a segfault. Fixing the stack overflows appears to have solved most segfaults as well. Unfortunately I'm not yet able to extract the VCF from the COVID-19, mostly because the current method is very expensive in terms of memory usage, and after 1.5 hours+ of running I got a segfault. I hope that parallelization can help that significantly; I'm also curious of what would happen when running this program on a machine with more memory.


No comments:

Post a Comment