This "improvement phase" also allowed me to add a few improvements I had been wanting to do for a while. One of these improvements was to find a better way to store Variants, which were previously represented in variant_identification.rs as a struct with 9 fields, all of them being Strings. This was wrong for two reasons:
- Position and Quality should be integers instead of Strings
- Some fields are optional as per the VCF format, but in my implementation they are all required
I fixed all of these issues in this commit. As you can see, multiple fields are now optional, and are correctly translated to a "." when written to a VCF file; also, Position and Quality are i32 now. A smaller work was also done on creating a Bubble structure in bubble_detection.rs. Now both files first present the data structure that will be used, and then all the related functions, which should make the code even easier to understand.
I then went back to VCF validation, which as you may remember from last week, didn't go as planned. The main idea was to compare the VCF returned by my tool to the one returned by vg deconstruct. Since this tool returns the VCF relative to a specific path, I had to add the ability to choose which reference path(s) should be used to detect variants. This can now be done via the -p (or --reference-paths), where a single path (or a list of paths separated by commas) can be specified.
By using the --reference-paths x,z option, only variants relative to paths x and z will be shown in the resulting VCF. |
I then used bedtools jaccard to compare the VCFs from the COVID-19 pangenome, but unfortunately the result was quite low, meaning they don't really match. After more testing I discovered that nested bubbles (also known as superbubbles) are not detected correctly, and that may the reason the VCFs are so different. In order to solve this issue, the idea is to create a new bubble detection algorithm based on this paper, which will be able to detect superbubbles and also achieve greater accuracy overall.
No comments:
Post a Comment