Open discovery and exchange for all
Biological data are being generated at an increasingly fast pace which can most prominently be observed in DNA sequence databases. In the CBS DNA barcoding project at the CBS-KNAW Fungal Biodiversity Centre in the Netherlands alone, researchers have sequenced around 200,000 sequences of two genes ITS (internal transcribed spacer) and LSU (large subunit) in the last five years for the classification and identification of fungal species.
In order to publish these sequences, they have to be checked and validated. However, researchers experienced that sequence validation was the most severe bottleneck of the DNA barcoding project. Despite the high degree of complexity of currently available search routines, the massive number of sequences makes the quick and correct identification of large groups of similar sequences practically impossible. The problem is more evident for much larger databases like GenBank where approximately six million fungal DNA sequences are currently available for download. There is a need for clustering tools for automatic knowledge extraction enabling the curation/validation of large-scale databases.
This project aims to build a validation tool for biological data that is efficient in terms of time and memory and is capable of dealing with large-scale datasets with high accuracy. With eScience Research Engineers’ expertise on Efficient Computing and Big Data Analytics, a complete software/tool to the end-users/researchers can be delivered that is capable of handling large-scale datasets on a normal desktop computer or on a cloud based infrastructure.
This tool will not only be used to validate and curate massive number of DNA sequences, but can also be used in all fields of data partitioning where one deals with similarity functions such as protein family detection or metagenomics for example.