[My goal is this series of posts is to help share some of the eDNA knowledge I’ve accumulated over the past few years. Hopefully, it will be useful to students or others just starting out. Here, I walk through the first steps of creating a DNA reference database for Arthropod metabarcoding: downloading the raw sequences…]
Raw DNA sequences in public databases are the ore that is mined and then refined into metabarcoding reference database gold: the final trimmed and labeled amplicon sequences used to attach species names to unknown DNA sequences. Making a good DNA sequence reference database for a diverse group like Arthropods is a challenge. Given that there are many ways to do anything in bioinformatics, this posts describes one way to download and format raw Arthropod reference sequences for metabarcoding. A variety of reference database curation software exists. This solution mainly makes use of CRABS software and was implemented on a Linux cluster.
For amplicons in the COI mitochondrial region, it is prudent to get data from both the Barcode of Life Database (BOLD) AND one of the large genetic repositories like GenBank or EMBL. I lump GenBank and EMBL as these sources regularly share sequences and overlap extensively. BOLD also shares data with GenBank etc. but there are many additional COI sequences hiding in BOLD that are not found in other databases. Each of these data sources has idiosyncrasies when it comes to the downloading process. Here, I use a combination of BOLD, EMBL, and GenBank (MIDORI2) data. If a quicker solution is desired, you could probably omit EMBL. I’ll start with BOLD, which primarily contains COI sequences as this is the official DNA barcode region for animals.
BOLD
Advantages: BOLD is a high quality ore, one with accurate taxonomic labels backed by actual reference specimens. Disadvantages: BOLD has a fickle search feature that doesn’t recognize all of the taxonomic groups that work when searching in NCBI/GenBank. For example, the web-based search returns zero sequences for some orders like Mantophasmatodea, but it works when you use the one family within that order, Mantophasmatidae. One could try downloading all Arthropoda at once, but 1) on the BOLD website it produces a very large file that you then usually need to upload to a cluster for processing; 2) I have had strange results when attempting to download all of Arthropoda from BOLD via other programs straight to the cluster (e.g., returning only ~4000 sequences). How to do it: The way I have found to conquer this issue is to use CRABS to download sequences order-by-order, check for cases with download troubles (e.g., zero or few sequences), and then do those cases family-by-family. The end result is a list of taxonomic names covering all of Arthropoda that seem to work for BOLD. I then loop through this list using a BASH script to download the sequences into individual fasta files. This hodgepodge list of order and family names is shown below.