Sunday, August 23, 2020

Week #12: Implementing the Biedged Graph to Cactus Graph conversion algorithm

This week was mostly spent implementing the Biedged Graph to Cactus Graph conversion algorithm presented in the paper by Benedict Paten et al. which should help improve the accuracy and correctness of my tool, rs-gfatovcf. This change was mostly due to two main reasons:

  1. The current Bubble Detection approach does not detect nested Bubbles (also called Superbubbles) correctly
  2. The resulting VCF cannot be used with popular tools (such as vg), meaning that the results may not be 100% accurate
Assuming that Variant Identification phase of rs-gfatovcf is actually correct, replacing the current Bubble Detection phase with the new algorithm should address both issues at the same time. I therefore created a new library, called rs-cactusgraph, that provides various utilities for Biedged and Cactus Graphs; once rs-cactusgraph is fully complete, it will be used as a support library for rs-gfatovcf.

As you may remember from last week's post, one of the main concerns was that Petgraph (the library used for storing the graph) would not scale well with larger graphs. While some more accurate testing has yet to be done, the kind of graph that I'm using (called GraphMap) uses a sparse adjacency matrix representation, with a space complexity of O(|V|+|E|), which I think isn't too bad. Another reason why I chose Petgraph over a HandleGraph implementation is that it provides more flexibility and ease of use. I'm still thinking about doing a Handlegraph implementation, but it will come at a later time.

The first step of the algorithm is to contract all the gray edges. This was relatively straightforward to do: since I store the gray edges in a vector, I just call the edge contraction procedure on each one of them. The only real problem here is to maintain consistency between the graph and the vector, but it can be avoided by carefully updating the edges when necessary.

The second step is to merge all nodes in each 3-edge-connected-component. Detecting these components is a really complex task, but thankfully chfi has already implemented it in rs-3-edge. There's only one problem though: rs-3-edge does not work with Biedged Graphs. So I created a way to convert a Biedged Graph into the appropriate format for rs-3-edge. I can now call rs-3-edge from inside my script, and after the 3-edge-connected-components have been found, I just merge all the nodes inside each component.

The third and final step is to merge all nodes in each cyclic component. By doing a DFS, loops can be found by checking for back edges (edges that connect a node to one of its ancestors in the visit). This step still needs a bit of work, but I think it can be done before GSoC ends.

Next week I will complete step 3 and do some final bugfixing. I will also write a final blog post recapping what has been done during this GSoC.

No comments:

Post a Comment