Saturday, August 1, 2020

Week #9: Refactoring rs-gfatovcf + VCF validation

As the second evaluation was getting close, all the mentors sent me their suggestions to improve my project last week. The #1 most reported problem was that the code was hard to read,  as all my project was basically contained in a single file, with over 1500 lines of code. I first decided to split it into three files: one containing the main function, one with all the I/O related functions, and another with all the remaining functions. Njagi then suggested to further split the functions.rs as into multiple files, and so I did. Now I have bubble_detections.rs and variant_identification.rs, which together with the readme.md should make it fairly straightforward to understand what each function actually does. Other smaller suggestions were implemented, such as adding logging to report errors, and adding more comments to make my code clearer.

This "improvement phase" also allowed me to add a few improvements I had been wanting to do for a while. One of these improvements was to find a better way to store Variants, which were previously represented in variant_identification.rs as a struct with 9 fields, all of them being Strings. This was wrong for two reasons:
  1. Position and Quality should be integers instead of Strings
  2. Some fields are optional as per the VCF format, but in my implementation they are all required
I fixed all of these issues in this commit. As you can see, multiple fields are now optional, and are correctly translated to a "." when written to a VCF file; also, Position and Quality are i32 now. A smaller work was also done on creating a Bubble structure in bubble_detection.rs. Now both files first present the data structure that will be used, and then all the related functions, which should make the code even easier to understand.

I then went back to VCF validation, which as you may remember from last week, didn't go as planned. The main idea was to compare the VCF returned by my tool to the one returned by vg deconstruct. Since this tool returns the VCF relative to a specific path, I had to add the ability to choose which reference path(s) should be used to detect variants. This can now be done via the -p (or --reference-paths), where a single path (or a list of paths separated by commas) can be specified.

By using the --reference-paths x,z option, only variants relative to paths x and z will be shown in the resulting VCF.

I then used bedtools jaccard to compare the VCFs from the COVID-19 pangenome, but unfortunately the result was quite low, meaning they don't really match. After more testing I discovered that nested bubbles (also known as superbubbles) are not detected correctly, and that may the reason the VCFs are so different. In order to solve this issue, the idea is to create a new bubble detection algorithm based on this paper, which will be able to detect superbubbles and also achieve greater accuracy overall.

No comments:

Post a Comment