GSOC - Parallel Graph Traversal for Variation Graphs: June 2020

Thursday, June 25, 2020

Week #4: Finishing up GFAtoVCF and exploring concurrency in Rust

This first half of this week was dedicated to finishing up GFAtoVCF, now renamed rs-gfatovcf to follow chfi's naming conventions. The only things missing were:

Unit Tests (to make sure everything works correctly)
Documentation (both in terms of comments and repo description on GitHub)
An easier way to run the script (the current one was too long and complex)
A CI/CD pipeline (which is nice to have, especially to make sure there's no build problem)

Unit tests weren't too hard to write, but required me to rethink how certain sections of code were organized, since they were too tightly coupled to each other. Thankfully, moving some parts of the main function to other functions solved this issue rather quickly. I ended up writing more than 250 lines of unit tests, which I'd say is quite good for a relatively small project. Writing the documentation was a good way to reflect on how the overall script works. You can find it here.

Finding an alternative way to run the script was a bit harder. The main problem was finding a way to ship both the program and its dependencies in a simple way. Initially, I required the user to download the dependencies by himself. However, Njagi suggested using git submodules, and that helped me streamline the download process way more (and also add a simple CI/CD pipeline).

With all of the above done, I'd say that rs-gfatovcf is pretty much done... what now? Well, it would be nice if we could find Variants faster! One way we could do that is by using concurrency, as in using multiple threads to run multiple parts of the program at the same time (which also means using multiple cores of a single CPU).

But, how exactly could concurrency be used in rs-gfatovcf? I feel that there are a few main areas that could benefit the most from using concurrency:

BFS: create a new thread for each outgoing edge at a given node
Bubble Detection: when a bubble is opened, create a new thread for each outgoing edge
Variant Identification: for each reference, create as many threads as there are bubbles

Rust offers a rather basic support of concurrency. More specifically:

thread::spawn allows for the creation of a new thread; it requires a closure as a parameter, which specifies the function that will be executed by the thread. The move keyword can be used to pass ownership of certain objects from the main thread to the new thread.
mpsc::channel allows for the creation of communication channels between different threads. A channel is represented as a tuple (transmitter, receiver). By using .clone on the transmitter and/or the receiver, multiple threads are able to communicate on the same channel. This approach of using channels to communicate is called message passing.
std::sync::Mutex allows for sharing the same object between multiple threads, one at a time. This may be too limiting for certain applications, for example when trying to implement a counter where each thread tries to increment the counter by 1 at the same time. In such cases, an object of type Arc<T> (Atomic reference counter) can be used.

The Rust Book recommends referring to external crates for further concurrency needs. The most interesting one appears to be the parking_lot crate. Many concurrency aware data structures are also present, such as dashmap and arrayfire. These data structures will be considered if any particular problem arises when using the default ones. This reddit thread offers more suggestions that will be evaluated during development.

Sunday, June 21, 2020

Week #3: Getting GFAtoVCF to work

As promised last week, here's a first look at a VCF file produced by my program:

See the full size image here

I was actually able to get it to work quite early in the week, with a small caveat: it only worked with a specific dfs-tree given as input, more specifically the one produced by the Python version of GFAtoVCF. When a program only works with a specific set of input, it means there's something radically wrong with it -- I only had to figure out why it was wrong, and how to fix it.

The first thing I actually noticed was that different bubbles were being detected depending on the dfs-tree obtained by the visit. This has always confused me since, starting from the same graph, I expected to always find the same bubbles. More specifically, when starting from the dfs-tree produced by Rust, multiple bubbles of length 1 were being produced, leading to errors at the end of the program.

See the full size image here

So, what I tried to do was finding a way to obtain the same bubbles no matter the dfs-tree, i.e. to "compress" contiguous bubbles of length 1 into a single longer bubble. I never really liked this solution since it wasn't really a solution per se, but more of a workaround.

Then, after discussing for a while with Gianluca Della Vedova, another idea came to mind: what if the DFS was the problem, and instead we needed another kind of visit? And what if that visit was a BFS? If you recall from last week, we found out that both dfs-trees were actually correct, but some nodes had different distances from the root, depending on the visit; with a BFS, such a thing cannot ever happen.

So, I implemented the BFS in Rust, and everything started working correctly, since if a given node has a certain distance from the root element, it will always have that distance. Since I now have a greater understanding of how HandleGraph works, it wasn't really that hard to implement.

Now only tests and an in-depth documentation are missing. Expect a long blog post next week, where I'll explaining how the program works in detail, and what's going to happen next!

Sunday, June 14, 2020

Week #2: Debugging GFAtoVCF

As you can see from the title, this week was mostly spent debugging my Rust version of Flavia95's GFAtoVCF. Recall that GFAtoVCF is a script that is capable of detecting simple bubbles in a Variation Graph; this is in contrast to superbubbles (which can be nested) and ultrabubbles (which can contain loops).

The main topic for the week was the DFS (Depth First Search), or rather the spanning tree obtained by such algorithm. Consider the following image:

See the full size image here

In blue you can see the spanning tree produced by my Rust version of GFAtoVCF, while in red you can see (part of) the one produced by Flavia's. As you can see, nodes 9,10 and 11 are placed differently with respect to each other. Which one is correct, then?

Well, it turns out that both are correct. The difference is probably due to different underlying implementations of the Handlegraph model (in Python ODGI is used, while I'm using rs-handlegraph), so that when the DFS has to choose a new edge, a different one gets chosen.

I am currently trying to fix the last few issues in GFAtoVCF. By next week, I hope I'll be able to show you the resulting VCF!

Sunday, June 7, 2020

Week #1: Re-implementing GFAtoVCF in Rust

As part of my first GSoC week, I was asked to re-implement Flavia95's GFAtoVCF in Rust. This has two objectives:

Getting me familiar with Rust (a language I had no previous experience in)
Getting me familiar with Variation Graphs concepts, and with the Handlegraph API (which also has a Rust implementation)

From what I understand, GFA (which stands for Graphical Fragment Assembly) is a format for storing the results of an assembly in the form of a graph; it can also be used to represent a pangenome. VCF (which stands for Variation Calling Format) is instead a format for storing variants of a given reference.

So, in order convert from GFA to VCF, I have to:

Consider one (or more) paths as a reference
For each reference, explore all other possible paths
When a different node is reached, consider its sequence as a variant (a "bubble")

Looks simple enough! However Rust has a few quirks (such as Ownership, Closures, etc.) that make it somewhat difficult to work with, at least for a beginner.

With that in mind, I started writing Rust code by trying to stay as close as possible to Flavia's code. Unfortunately, since Python and Rust are very different languages, that didn't go as well as expected. For one, it was very hard to tell what variable types I had to use (since in Python types are implicit); secondly, I experienced many issues when trying to use closures/lambdas (something I had rarely done before).

Another unexpected thing I experienced is that some methods found in Flavia's code were missing from Handlegraph's Rust implementation (specifically for_each_path_handle and for_each_step_in_path). So, I created my own fork of rs-handlegraph but I'm still trying to figure out how to correctly implement them.

Currently, I still have to fix a few issues from the list above. Once I solve these issues, I'll start running comparisons between my program and Flavia's, to make sure they work in the same way.