But what actually were these scalability issues I was talking about previously? Basically the only GFA I used to write tests was the one already included in Flavia's GFAtoVCF repository. While it's not a very simple graph, it obviously does not match the complexity of actual "real-world" graphs. For instance, this graph is about 1 kb, while the one from the PubSeq project is about 40 mb. That's about 40 000 times bigger! Also, "real-world" graphs can contain loops or nested bubbles, which makes bubble detection even harder.
A small section of the graph that was used for the testing. Noticed something yet? My script now supports JSON output for both the starting graph and its bfs-tree. |
When I tried to use my script on the COVID-19 GFA something unexpected happened: Rust returned a stack overflow, which was caused by the BFS function. I was actually surprised at first, because I was quite sure the idea behind the function was correct. What I did not consider was that, when dealing with larger inputs, memory management plays an important role in how a program works (and also that ideas that work well in theory sometimes don't when used in pratice!). So I decided to rewrite BFS in the standard way, that is by using a Queue. A similar problem was found in find_all_paths_between, a function that finds all the paths between two nodes, and was solved in a similar way. Extensive testing was possible thanks to Erik's HLA-Zoo, a collection of GFA files useful for testing programs that use variation graphs.
Another issue found was that sometimes the script would return a segfault. Fixing the stack overflows appears to have solved most segfaults as well. Unfortunately I'm not yet able to extract the VCF from the COVID-19, mostly because the current method is very expensive in terms of memory usage, and after 1.5 hours+ of running I got a segfault. I hope that parallelization can help that significantly; I'm also curious of what would happen when running this program on a machine with more memory.
No comments:
Post a Comment