GSOC - Parallel Graph Traversal for Variation Graphs: July 2020

Saturday, July 25, 2020

Week #8: More improvements to sequential rs-gfatovcf

As you may remember from last week's post, there were some segfaults which made it impossible to produce the VCF from the COVID-19 GFA; also, it took an excessive amount of time and memory on complex graphs. This week I actually managed to get it to work, but the results weren't what I expected.

Earlier in the week, I discovered what was causing the segfaults: when finding paths between any two given nodes (in the find_all_paths_between function), I would store all the possible paths to a given node, for each node. This is a well known problem NP-hard problem, so finding a solution that is both fast and space-efficient probably isn't possible.

I was then suggested to use a threshold, so that after a certain amount of edges have been traversed the procedure stops, and returns all the paths found up to that point. There is also a new option called --max-edges (or -m for short), that allows the user to choose which threshold to use, according to how powerful his/her computer is and how much memory it has. The default value is 100, and it allows for generating the COVID-19 VCF in a matter of minutes. Obviously the higher this value is, the more accurate the VCF.

After finally obtaining a VCF, we had to make sure it was actually correct (i.e. to validate it), but how can we do that? Erik suggested me to use vg construct, a tool that takes in a VCF and a reference genome in FASTA format and returns a GFA. The general idea is that if both GFAs match (or are at least very similar), then the VCF is correct.

So, in order to use vg construct I needed a VCF file (which I obviously had, as it's the output of my program) and a genome in FASTA format. Unfortunately there were no references to genomes in the GFA file, so I had to find a way to discover what reference genome had been used to generate it. Fortunately it's possible to extract the genome from the GFA via odgi build and odgi paths.

I was finally able to run vg construct and then... the VCF file and the reference genome did not match. That means there's some problems in how the algorithm works. As a first attempt, I changed how the VCF header is created in this commit, which should improve the quality of the result. I still have to investigate more on what could be causing this issue.

Sunday, July 19, 2020

Week #7: Going back to sequential rs-gfatovcf

The last week was dedicated to fixing a few scalability issues that my program was experiencing on larger and more complex GFA files. The main objective of this GSoC is to create a tool that is both general and well-performing, so I decided to go back to sequential rs-gfatovcf to fix the "general" part before continuing on with parallelization. Due to its current international relavance, we are especially interested in finding variants (that is, a VCF file) from COVID-19 pangenomes, as part of the COVID-19 PubSeq Project. Maybe by looking at variants we will be able to understand how many different strains of the virus are currently circulating, or maybe we will be able to create ad-hoc treatments for each patient, according to how similar or different the virus samples extracted from him are to other known genomes.

But what actually were these scalability issues I was talking about previously? Basically the only GFA I used to write tests was the one already included in Flavia's GFAtoVCF repository. While it's not a very simple graph, it obviously does not match the complexity of actual "real-world" graphs. For instance, this graph is about 1 kb, while the one from the PubSeq project is about 40 mb. That's about 40 000 times bigger! Also, "real-world" graphs can contain loops or nested bubbles, which makes bubble detection even harder.

A small section of the graph that was used for the testing.
Noticed something yet? My script now supports JSON output for both the starting graph and its bfs-tree.

When I tried to use my script on the COVID-19 GFA something unexpected happened: Rust returned a stack overflow, which was caused by the BFS function. I was actually surprised at first, because I was quite sure the idea behind the function was correct. What I did not consider was that, when dealing with larger inputs, memory management plays an important role in how a program works (and also that ideas that work well in theory sometimes don't when used in pratice!). So I decided to rewrite BFS in the standard way, that is by using a Queue. A similar problem was found in find_all_paths_between, a function that finds all the paths between two nodes, and was solved in a similar way. Extensive testing was possible thanks to Erik's HLA-Zoo, a collection of GFA files useful for testing programs that use variation graphs.

Another issue found was that sometimes the script would return a segfault. Fixing the stack overflows appears to have solved most segfaults as well. Unfortunately I'm not yet able to extract the VCF from the COVID-19, mostly because the current method is very expensive in terms of memory usage, and after 1.5 hours+ of running I got a segfault. I hope that parallelization can help that significantly; I'm also curious of what would happen when running this program on a machine with more memory.

Sunday, July 12, 2020

Weeks #5 and #6: Improvents to sequential rs-gfatovcf + beginning of concurrent version

These last two weeks have marked the beginning of the transition from a "standard" sequential version of rs-gfatovcf to a concurrent one. By adding concurrency, we hope to considerably improve the performance of this program, especially since Variation Graphs can be quite large. However, adding concurrency leads to an increased program complexity, and with Rust's strict semantics many problems are going to arise... but let's go in order.

Two weeks ago, as part of Google's GSOC timeline, I received a code review from chfi (you may remember him as the author of rs-handlegeaph). In his review, Chris pointed out a few ways to improve my code to make it more "Rust-like", and a few general tips on code structure (such as using rustfmt and clippy). I made a few commits to address these issues (see this commit as an example), and I think the final result is quite good. I also updated my code so that it uses the latest version of chfi's rs-handlegraph, and this only required changing a few function names.

After that was done, I started looking into concurrency. Let me preface this by saying that I'm not a concurrency expert, but I have some experience with concurrency in C++ and Java. To be honest, concurrency in Rust is quite similar to how it works in C++, but Rust's lifetimes rules make it much more difficult, at least in the beginning. As I said in the last post, Rust's online book provides an introduction to concurrency, but in my opinion it's too short and only provides the basics, which really isn't enough for a project like mine.

Another resource which was recommended to me was Programming Rust, Fast Safe Systems Development by Jim Blandy and Jason Orendorff which is way better, and helped me understand more about threads. Moreover, it helped me understand how to better use Arc (Atomic reference counters, useful for passing the same value to multiple threads) and Mutex (to also allow multiple threads to modify the same variable while keeping consistency).

So, I started re-writing a few functions using Threads, Arcs and Mutexes and... I really wasn't satisfied with the result, the code looked quite confusing and I was using too many locks which basically killed any performance gain from concurrency (see this commit as an example). Chris then suggested using Rayon, which promises concurrency by simply replacing standard iterators with parallel ones. I did actually manage to get concurrent paths_to_steps to work using Rayon, and I'm currently exploring ways to parallelize the other functions.

Using Rayon is currently proving to be difficult because, as I understand it, Rayon is supposed to be used in code where iterations have little to no side effects. Unfortunately my code makes great use of such iterators, so I will have to rethink how some functions work. Hopefully, I'll have these issues sorted out soon, and will be able to parallelize BFS and all the remaining functions.