GSOC - Parallel Graph Traversal for Variation Graphs

Saturday, July 25, 2020

Week #8: More improvements to sequential rs-gfatovcf

As you may remember from last week's post, there were some segfaults which made it impossible to produce the VCF from the COVID-19 GFA; also, it took an excessive amount of time and memory on complex graphs. This week I actually managed to get it to work, but the results weren't what I expected.

Earlier in the week, I discovered what was causing the segfaults: when finding paths between any two given nodes (in the find_all_paths_between function), I would store all the possible paths to a given node, for each node. This is a well known problem NP-hard problem, so finding a solution that is both fast and space-efficient probably isn't possible.

I was then suggested to use a threshold, so that after a certain amount of edges have been traversed the procedure stops, and returns all the paths found up to that point. There is also a new option called --max-edges (or -m for short), that allows the user to choose which threshold to use, according to how powerful his/her computer is and how much memory it has. The default value is 100, and it allows for generating the COVID-19 VCF in a matter of minutes. Obviously the higher this value is, the more accurate the VCF.

After finally obtaining a VCF, we had to make sure it was actually correct (i.e. to validate it), but how can we do that? Erik suggested me to use vg construct, a tool that takes in a VCF and a reference genome in FASTA format and returns a GFA. The general idea is that if both GFAs match (or are at least very similar), then the VCF is correct.

So, in order to use vg construct I needed a VCF file (which I obviously had, as it's the output of my program) and a genome in FASTA format. Unfortunately there were no references to genomes in the GFA file, so I had to find a way to discover what reference genome had been used to generate it. Fortunately it's possible to extract the genome from the GFA via odgi build and odgi paths.

I was finally able to run vg construct and then... the VCF file and the reference genome did not match. That means there's some problems in how the algorithm works. As a first attempt, I changed how the VCF header is created in this commit, which should improve the quality of the result. I still have to investigate more on what could be causing this issue.

Sunday, July 19, 2020

Week #7: Going back to sequential rs-gfatovcf

The last week was dedicated to fixing a few scalability issues that my program was experiencing on larger and more complex GFA files. The main objective of this GSoC is to create a tool that is both general and well-performing, so I decided to go back to sequential rs-gfatovcf to fix the "general" part before continuing on with parallelization. Due to its current international relavance, we are especially interested in finding variants (that is, a VCF file) from COVID-19 pangenomes, as part of the COVID-19 PubSeq Project. Maybe by looking at variants we will be able to understand how many different strains of the virus are currently circulating, or maybe we will be able to create ad-hoc treatments for each patient, according to how similar or different the virus samples extracted from him are to other known genomes.

But what actually were these scalability issues I was talking about previously? Basically the only GFA I used to write tests was the one already included in Flavia's GFAtoVCF repository. While it's not a very simple graph, it obviously does not match the complexity of actual "real-world" graphs. For instance, this graph is about 1 kb, while the one from the PubSeq project is about 40 mb. That's about 40 000 times bigger! Also, "real-world" graphs can contain loops or nested bubbles, which makes bubble detection even harder.

A small section of the graph that was used for the testing.
Noticed something yet? My script now supports JSON output for both the starting graph and its bfs-tree.

When I tried to use my script on the COVID-19 GFA something unexpected happened: Rust returned a stack overflow, which was caused by the BFS function. I was actually surprised at first, because I was quite sure the idea behind the function was correct. What I did not consider was that, when dealing with larger inputs, memory management plays an important role in how a program works (and also that ideas that work well in theory sometimes don't when used in pratice!). So I decided to rewrite BFS in the standard way, that is by using a Queue. A similar problem was found in find_all_paths_between, a function that finds all the paths between two nodes, and was solved in a similar way. Extensive testing was possible thanks to Erik's HLA-Zoo, a collection of GFA files useful for testing programs that use variation graphs.

Another issue found was that sometimes the script would return a segfault. Fixing the stack overflows appears to have solved most segfaults as well. Unfortunately I'm not yet able to extract the VCF from the COVID-19, mostly because the current method is very expensive in terms of memory usage, and after 1.5 hours+ of running I got a segfault. I hope that parallelization can help that significantly; I'm also curious of what would happen when running this program on a machine with more memory.

Sunday, July 12, 2020

Weeks #5 and #6: Improvents to sequential rs-gfatovcf + beginning of concurrent version

These last two weeks have marked the beginning of the transition from a "standard" sequential version of rs-gfatovcf to a concurrent one. By adding concurrency, we hope to considerably improve the performance of this program, especially since Variation Graphs can be quite large. However, adding concurrency leads to an increased program complexity, and with Rust's strict semantics many problems are going to arise... but let's go in order.

Two weeks ago, as part of Google's GSOC timeline, I received a code review from chfi (you may remember him as the author of rs-handlegeaph). In his review, Chris pointed out a few ways to improve my code to make it more "Rust-like", and a few general tips on code structure (such as using rustfmt and clippy). I made a few commits to address these issues (see this commit as an example), and I think the final result is quite good. I also updated my code so that it uses the latest version of chfi's rs-handlegraph, and this only required changing a few function names.

After that was done, I started looking into concurrency. Let me preface this by saying that I'm not a concurrency expert, but I have some experience with concurrency in C++ and Java. To be honest, concurrency in Rust is quite similar to how it works in C++, but Rust's lifetimes rules make it much more difficult, at least in the beginning. As I said in the last post, Rust's online book provides an introduction to concurrency, but in my opinion it's too short and only provides the basics, which really isn't enough for a project like mine.

Another resource which was recommended to me was Programming Rust, Fast Safe Systems Development by Jim Blandy and Jason Orendorff which is way better, and helped me understand more about threads. Moreover, it helped me understand how to better use Arc (Atomic reference counters, useful for passing the same value to multiple threads) and Mutex (to also allow multiple threads to modify the same variable while keeping consistency).

So, I started re-writing a few functions using Threads, Arcs and Mutexes and... I really wasn't satisfied with the result, the code looked quite confusing and I was using too many locks which basically killed any performance gain from concurrency (see this commit as an example). Chris then suggested using Rayon, which promises concurrency by simply replacing standard iterators with parallel ones. I did actually manage to get concurrent paths_to_steps to work using Rayon, and I'm currently exploring ways to parallelize the other functions.

Using Rayon is currently proving to be difficult because, as I understand it, Rayon is supposed to be used in code where iterations have little to no side effects. Unfortunately my code makes great use of such iterators, so I will have to rethink how some functions work. Hopefully, I'll have these issues sorted out soon, and will be able to parallelize BFS and all the remaining functions.

Thursday, June 25, 2020

Week #4: Finishing up GFAtoVCF and exploring concurrency in Rust

This first half of this week was dedicated to finishing up GFAtoVCF, now renamed rs-gfatovcf to follow chfi's naming conventions. The only things missing were:

Unit Tests (to make sure everything works correctly)
Documentation (both in terms of comments and repo description on GitHub)
An easier way to run the script (the current one was too long and complex)
A CI/CD pipeline (which is nice to have, especially to make sure there's no build problem)

Unit tests weren't too hard to write, but required me to rethink how certain sections of code were organized, since they were too tightly coupled to each other. Thankfully, moving some parts of the main function to other functions solved this issue rather quickly. I ended up writing more than 250 lines of unit tests, which I'd say is quite good for a relatively small project. Writing the documentation was a good way to reflect on how the overall script works. You can find it here.

Finding an alternative way to run the script was a bit harder. The main problem was finding a way to ship both the program and its dependencies in a simple way. Initially, I required the user to download the dependencies by himself. However, Njagi suggested using git submodules, and that helped me streamline the download process way more (and also add a simple CI/CD pipeline).

With all of the above done, I'd say that rs-gfatovcf is pretty much done... what now? Well, it would be nice if we could find Variants faster! One way we could do that is by using concurrency, as in using multiple threads to run multiple parts of the program at the same time (which also means using multiple cores of a single CPU).

But, how exactly could concurrency be used in rs-gfatovcf? I feel that there are a few main areas that could benefit the most from using concurrency:

BFS: create a new thread for each outgoing edge at a given node
Bubble Detection: when a bubble is opened, create a new thread for each outgoing edge
Variant Identification: for each reference, create as many threads as there are bubbles

Rust offers a rather basic support of concurrency. More specifically:

thread::spawn allows for the creation of a new thread; it requires a closure as a parameter, which specifies the function that will be executed by the thread. The move keyword can be used to pass ownership of certain objects from the main thread to the new thread.
mpsc::channel allows for the creation of communication channels between different threads. A channel is represented as a tuple (transmitter, receiver). By using .clone on the transmitter and/or the receiver, multiple threads are able to communicate on the same channel. This approach of using channels to communicate is called message passing.
std::sync::Mutex allows for sharing the same object between multiple threads, one at a time. This may be too limiting for certain applications, for example when trying to implement a counter where each thread tries to increment the counter by 1 at the same time. In such cases, an object of type Arc<T> (Atomic reference counter) can be used.

The Rust Book recommends referring to external crates for further concurrency needs. The most interesting one appears to be the parking_lot crate. Many concurrency aware data structures are also present, such as dashmap and arrayfire. These data structures will be considered if any particular problem arises when using the default ones. This reddit thread offers more suggestions that will be evaluated during development.

Sunday, June 21, 2020

Week #3: Getting GFAtoVCF to work

As promised last week, here's a first look at a VCF file produced by my program:

See the full size image here

I was actually able to get it to work quite early in the week, with a small caveat: it only worked with a specific dfs-tree given as input, more specifically the one produced by the Python version of GFAtoVCF. When a program only works with a specific set of input, it means there's something radically wrong with it -- I only had to figure out why it was wrong, and how to fix it.

The first thing I actually noticed was that different bubbles were being detected depending on the dfs-tree obtained by the visit. This has always confused me since, starting from the same graph, I expected to always find the same bubbles. More specifically, when starting from the dfs-tree produced by Rust, multiple bubbles of length 1 were being produced, leading to errors at the end of the program.

See the full size image here

So, what I tried to do was finding a way to obtain the same bubbles no matter the dfs-tree, i.e. to "compress" contiguous bubbles of length 1 into a single longer bubble. I never really liked this solution since it wasn't really a solution per se, but more of a workaround.

Then, after discussing for a while with Gianluca Della Vedova, another idea came to mind: what if the DFS was the problem, and instead we needed another kind of visit? And what if that visit was a BFS? If you recall from last week, we found out that both dfs-trees were actually correct, but some nodes had different distances from the root, depending on the visit; with a BFS, such a thing cannot ever happen.

So, I implemented the BFS in Rust, and everything started working correctly, since if a given node has a certain distance from the root element, it will always have that distance. Since I now have a greater understanding of how HandleGraph works, it wasn't really that hard to implement.

Now only tests and an in-depth documentation are missing. Expect a long blog post next week, where I'll explaining how the program works in detail, and what's going to happen next!

Sunday, June 14, 2020

Week #2: Debugging GFAtoVCF

As you can see from the title, this week was mostly spent debugging my Rust version of Flavia95's GFAtoVCF. Recall that GFAtoVCF is a script that is capable of detecting simple bubbles in a Variation Graph; this is in contrast to superbubbles (which can be nested) and ultrabubbles (which can contain loops).

The main topic for the week was the DFS (Depth First Search), or rather the spanning tree obtained by such algorithm. Consider the following image:

See the full size image here

In blue you can see the spanning tree produced by my Rust version of GFAtoVCF, while in red you can see (part of) the one produced by Flavia's. As you can see, nodes 9,10 and 11 are placed differently with respect to each other. Which one is correct, then?

Well, it turns out that both are correct. The difference is probably due to different underlying implementations of the Handlegraph model (in Python ODGI is used, while I'm using rs-handlegraph), so that when the DFS has to choose a new edge, a different one gets chosen.

I am currently trying to fix the last few issues in GFAtoVCF. By next week, I hope I'll be able to show you the resulting VCF!

Sunday, June 7, 2020

Week #1: Re-implementing GFAtoVCF in Rust

As part of my first GSoC week, I was asked to re-implement Flavia95's GFAtoVCF in Rust. This has two objectives:

Getting me familiar with Rust (a language I had no previous experience in)
Getting me familiar with Variation Graphs concepts, and with the Handlegraph API (which also has a Rust implementation)

From what I understand, GFA (which stands for Graphical Fragment Assembly) is a format for storing the results of an assembly in the form of a graph; it can also be used to represent a pangenome. VCF (which stands for Variation Calling Format) is instead a format for storing variants of a given reference.

So, in order convert from GFA to VCF, I have to:

Consider one (or more) paths as a reference
For each reference, explore all other possible paths
When a different node is reached, consider its sequence as a variant (a "bubble")

Looks simple enough! However Rust has a few quirks (such as Ownership, Closures, etc.) that make it somewhat difficult to work with, at least for a beginner.

With that in mind, I started writing Rust code by trying to stay as close as possible to Flavia's code. Unfortunately, since Python and Rust are very different languages, that didn't go as well as expected. For one, it was very hard to tell what variable types I had to use (since in Python types are implicit); secondly, I experienced many issues when trying to use closures/lambdas (something I had rarely done before).

Another unexpected thing I experienced is that some methods found in Flavia's code were missing from Handlegraph's Rust implementation (specifically for_each_path_handle and for_each_step_in_path). So, I created my own fork of rs-handlegraph but I'm still trying to figure out how to correctly implement them.

Currently, I still have to fix a few issues from the list above. Once I solve these issues, I'll start running comparisons between my program and Flavia's, to make sure they work in the same way.