Week #4: Finishing up GFAtoVCF and exploring concurrency in Rust

This first half of this week was dedicated to finishing up GFAtoVCF, now renamed rs-gfatovcf to follow chfi's naming conventions. The only things missing were:

Unit Tests (to make sure everything works correctly)
Documentation (both in terms of comments and repo description on GitHub)
An easier way to run the script (the current one was too long and complex)
A CI/CD pipeline (which is nice to have, especially to make sure there's no build problem)

Unit tests weren't too hard to write, but required me to rethink how certain sections of code were organized, since they were too tightly coupled to each other. Thankfully, moving some parts of the main function to other functions solved this issue rather quickly. I ended up writing more than 250 lines of unit tests, which I'd say is quite good for a relatively small project. Writing the documentation was a good way to reflect on how the overall script works. You can find it here.

Finding an alternative way to run the script was a bit harder. The main problem was finding a way to ship both the program and its dependencies in a simple way. Initially, I required the user to download the dependencies by himself. However, Njagi suggested using git submodules, and that helped me streamline the download process way more (and also add a simple CI/CD pipeline).

With all of the above done, I'd say that rs-gfatovcf is pretty much done... what now? Well, it would be nice if we could find Variants faster! One way we could do that is by using concurrency, as in using multiple threads to run multiple parts of the program at the same time (which also means using multiple cores of a single CPU).

But, how exactly could concurrency be used in rs-gfatovcf? I feel that there are a few main areas that could benefit the most from using concurrency:

BFS: create a new thread for each outgoing edge at a given node
Bubble Detection: when a bubble is opened, create a new thread for each outgoing edge
Variant Identification: for each reference, create as many threads as there are bubbles

Rust offers a rather basic support of concurrency. More specifically:

thread::spawn allows for the creation of a new thread; it requires a closure as a parameter, which specifies the function that will be executed by the thread. The move keyword can be used to pass ownership of certain objects from the main thread to the new thread.
mpsc::channel allows for the creation of communication channels between different threads. A channel is represented as a tuple (transmitter, receiver). By using .clone on the transmitter and/or the receiver, multiple threads are able to communicate on the same channel. This approach of using channels to communicate is called message passing.
std::sync::Mutex allows for sharing the same object between multiple threads, one at a time. This may be too limiting for certain applications, for example when trying to implement a counter where each thread tries to increment the counter by 1 at the same time. In such cases, an object of type Arc<T> (Atomic reference counter) can be used.

The Rust Book recommends referring to external crates for further concurrency needs. The most interesting one appears to be the parking_lot crate. Many concurrency aware data structures are also present, such as dashmap and arrayfire. These data structures will be considered if any particular problem arises when using the default ones. This reddit thread offers more suggestions that will be evaluated during development.

GSOC - Parallel Graph Traversal for Variation Graphs

Thursday, June 25, 2020

Week #4: Finishing up GFAtoVCF and exploring concurrency in Rust

No comments:

Post a Comment