Forging a metabarcoding reference database for Arthropods. Part 1: mining the ore

[My goal is this series of posts is to help share some of the eDNA knowledge I’ve accumulated over the past few years. Hopefully, it will be useful to students or others just starting out. Here, I walk through the first steps of creating a DNA reference database for Arthropod metabarcoding: downloading the raw sequences…]

Raw DNA sequences in public databases are the ore that is mined and then refined into metabarcoding reference database gold: the final trimmed and labeled amplicon sequences used to attach species names to unknown DNA sequences. Making a good DNA sequence reference database for a diverse group like Arthropods is a challenge. Given that there are many ways to do anything in bioinformatics, this posts describes one way to download and format raw Arthropod reference sequences for metabarcoding. A variety of reference database curation software exists. This solution mainly makes use of CRABS software and was implemented on a Linux cluster.

For amplicons in the COI mitochondrial region, it is prudent to get data from both the Barcode of Life Database (BOLD) AND one of the large genetic repositories like GenBank or EMBL. I lump GenBank and EMBL as these sources regularly share sequences and overlap extensively. BOLD also shares data with GenBank etc. but there are many additional COI sequences hiding in BOLD that are not found in other databases. Each of these data sources has idiosyncrasies when it comes to the downloading process. Here, I use a combination of BOLD, EMBL, and GenBank (MIDORI2) data. If a quicker solution is desired, you could probably omit EMBL. I’ll start with BOLD, which primarily contains COI sequences as this is the official DNA barcode region for animals.

BOLD

Advantages: BOLD is a high quality ore, one with accurate taxonomic labels backed by actual reference specimens. Disadvantages: BOLD has a fickle search feature that doesn’t recognize all of the taxonomic groups that work when searching in NCBI/GenBank. For example, the web-based search returns zero sequences for some orders like Mantophasmatodea, but it works when you use the one family within that order, Mantophasmatidae. One could try downloading all Arthropoda at once, but 1) on the BOLD website it produces a very large file that you then usually need to upload to a cluster for processing; 2) I have had strange results when attempting to download all of Arthropoda from BOLD via other programs straight to the cluster (e.g., returning only ~4000 sequences). How to do it: The way I have found to conquer this issue is to use CRABS to download sequences order-by-order, check for cases with download troubles (e.g., zero or few sequences), and then do those cases family-by-family. The end result is a list of taxonomic names covering all of Arthropoda that seem to work for BOLD. I then loop through this list using a BASH script to download the sequences into individual fasta files. This hodgepodge list of order and family names is shown below.

artnames=("Entomobryomorpha" "Neelipleona" "Poduromorpha" "Symphypleona" "Rhabdura" "Dicellurata" "Acerentomata" "Eosentomata" "Sinentomata"  "Zygentoma" "Archaeognatha" "Lepidoptera" "Trichoptera" "Coleoptera" "Diptera" "Hymenoptera" "Mecoptera" "Megaloptera" "Neuroptera" "Raphidioptera" "Siphonaptera" "Strepsiptera" "Hemiptera" "Psocodea" "Thysanoptera" "Dermaptera" "Dictyoptera" "Embioptera" "Grylloblattidae" "Mantophasmatodea" "Austrophasmatidae" "Mantophasmatidae" "Orthoptera" "Phasmatodea" "Plecoptera" "Zoraptera" "Ephemeroptera" "Odonata" "Anomopoda" "Ctenopoda" "Haplopoda" "Onychopoda" "Cyclestherida" "Laevicaudata" "Spinicaudata" "Notostraca" "Anostraca" "Brachypoda" "Copepoda" "Tantulocarida" "Deoterthridae" "Microdajidae" "Amphionidacea" "Decapoda" "Euphausiacea" "Stomatopoda" "Amphipoda" "Cumacea" "Ingolfiellida" "Isopoda" "Mictacea" "Mysidacea" "Spelaeogriphacea" "Tanaidacea" "Thermosbaenacea" "Anaspidacea" "Bathynellacea" "Leptostraca" "Dendrogastrida" "Laurida" "Cryptophialida" "Lithoglyptida" "Chthamalophilidae" "Clistosaccidae" "Lernaeodiscidae" "Mycetomorphidae" "Peltogastridae" "Polyascidae" "Polysaccidae" "Sacculinidae" "Sylonidae" "Thompsoniidae" "Iblomorpha" "Balanomorpha" "Calanticomorpha" "Pollicipedomorpha" "Scalpellomorpha" "Verrucomorpha" "Facetotecta" "Hansenocaris" "Branchiura" "Cephalobaenida" "Porocephalida" "Raillietiellida" "Reighardiida" "Mystacocaridida" "Halocyprida" "Myodocopida" "Platycopida" "Podocopida" "Nectiopoda" "Sarcoptiformes" "Trombidiformes" "Opilioacariformes" "Opilioacaridae" "Holothyrida" "Ixodida" "Mesostigmata" "Araneae" "Opiliones" "Amblypygi" "Palpigradi" "Pseudoscorpiones" "Ricinulei" "Schizomida" "Scorpiones" "Solifugae" "Uropygi" "Xiphosura" "Limulidae" "Pantopoda" "Chilopoda" "Diplopoda" "Pauropoda" "Symphyla" "Apioceridae" "Apsilocephalidae" "Asilidae" "Bombyliidae" "Evocoidae" "Hilarimorphidae" "Mydidae" "Mythicomyiidae" "Ocoidae" "Scenopinidae" "Therevidae" "Ironomyiidae" "Lonchopteridae" "Phoridae" "Platypezidae" "Sciadoceridae" "Pipunculidae" "Syrphidae" "Australimyzidae" "Braulidae" "Canacidae" "Carnidae" "Chloropidae" "Cryptochetidae" "Inbiomyiidae" "Milichiidae" "Tethinidae" "Conopidae" "Diopsidae" "Megamerinidae" "Nothybidae" "Psilidae" "Somatiidae" "Strongylophthalmyiidae" "Syringogastridae" "Tanypezidae" "Camillidae" "Campichoetidae" "Curtonotidae" "Diastatidae" "Drosophilidae" "Ephydridae" "Celyphidae" "Chamaemyiidae" "Lauxaniidae" "Cypselosomatidae" "Micropezidae" "Neriidae" "Pseudopomyzidae" "Acartophthalmidae" "Agromyzidae" "Anthomyzidae" "Asteiidae" "Aulacigastridae" "Clusiidae" "Fergusoninidae" "Marginidae" "Nannodastiidae" "Neminidae" "Neurochaetidae" "Odiniidae" "Opomyzidae" "Paraleucopidae" "Periscelididae" "Teratomyzidae" "Xenasteiidae" "Coelopidae" "Dryomyzidae" "Helcomyzidae" "Helosciomyzidae" "Heterocheilidae" "Natalimyzidae" "Sciomyzidae" "Sepsidae" "Chyromyidae" "Heleomyzidae" "Mormotomyiidae" "Sphaeroceridae" "Ctenostylidae" "Lonchaeidae" "Pallopteridae" "Piophilidae" "Platystomatidae" "Pyrgotidae" "Richardiidae" "Tephritidae" "Ulidiidae" "Glossinidae" "Hippoboscidae" "Nycteribiidae" "Streblidae" "Anthomyiidae" "Fanniidae" "Muscidae" "Scathophagidae" "Calliphoridae" "Mystacinobiidae" "Oestridae" "Polleniidae" "Rhiniidae" "Rhinophoridae" "Sarcophagidae" "Tachinidae" "Ulurumyiidae" "Atelestidae" "Brachystomatidae" "Dolichopodidae" "Empididae" "Hybotidae" "Acroceridae" "Nemestrinidae" "Stratiomyidae" "Xylomyidae" "Athericidae" "Austroleptidae" "Oreoleptidae" "Pelecorhynchidae" "Rhagionidae" "Tabanidae" "Vermileonidae" "Pantophthalmidae" "Xylophagaidae" "Axymyiidae" "Bibionidae" "Hesperinidae" "Pleciidae" "Pachyneuridae" "Bolitophilidae" "Cecidomyiidae" "Diadocidiidae" "Ditomyiidae" "Keroplatidae" "Mycetophilidae" "Rangomaramidae" "Sciaridae" "Blephariceridae" "Deuterophlebiidae" "Nymphomyiidae" "Ceratopogonidae" "Chironomidae" "Simuliidae" "Thaumaleidae" "Chaoboridae" "Culicidae" "Dixidae" "Psychodidae" "Canthyloscelidae" "Scatopsidae" "Anisopodidae" "Perissommatidae" "Synneuridae" "Trichoceridae" "Ptychopteridae" "Tanyderidae" "Cylindrotomidae" "Limoniidae" "Pediciidae" "Tipulidae")

Here is the Bash script to loop through each taxonomic group and download sequences using CRABS:

for i in "${artnames[@]}"; 
do crabs db_download --source bold --database "$i" --output "bold_${i}.fasta" --keep_original no 
done

This results in a series of fasta files containing sequences for each taxonomic group in the artnames list above. Some are still 0 bytes, but it is because there are truly no sequences available for those groups. (Note: Ultimately, I will transition to an exclusively family-by-family approach as it was sometimes hard to tell which orders that CRABS had trouble downloading, e.g., the flies, for which CRABS only downloaded a small fraction of sequences when queried at the order level. Family lists could be obtained using the databases created by programs such as ncbitax2lin.)

EMBL

Advantages: All invertebrate sequences on EMBL can be downloaded in one complete, versioned batch. So, it comes with none of the problems associated with search queries. This is an advantage over GenBank queries, where missing or idiosyncratic data fields (e.g., COI vs. CO1) make it so that you are never quite sure that you got all sequences. Disadvantages: The drawback is that EMBL files are quite large and have a strange text file (.dat) format. CRABS is a convenient way to download the EMBL .dat files and format them into fasta files that are easy to work with and have a smaller file size. But the database is still huge, even for just one group like invertebrates. How to do it: To chop up the EMBL database and download it in smaller batches I use a Bash “for loop” and subsetting code inspired by this comment thread on the CRABS issue page (see below). It would also work to download the full invertebrate file all at once, but I find the loop method convenient as it produces smaller files and it saves the intermediate steps just in case the job gets interrupted. Downloading and processing EMBL is no small feat. It took me over 24 h of downloading and processing time.

Here is the Bash code to download EMBL invertebrate sequences in batches:

# make a list of INV suffixes
suf=("1." "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15" "16" "17" "18" "19")

# loop through the suffixes to download EMBL in manageable batches
for j in "${suf[@]}"; do

crabs db_download --source embl --database "INV_${j}*" --output "embl_inv_${j}.fasta" --keep_original no

done

GenBank via MIDORI2

Advantages: MIDORI2 is a curated database of selected genes scraped from GenBank (COI, 12S, and 16S are the most useful for metabarcoding). Because it only includes the trimmed gene of interest (e.g., the ~ 650 base pair COI gene) this results in a smaller file size. The sequences also undergo a quality checking procedure so erroneously labeled sequences are less likely to occur compared with direct GenBank downloads. MIDORI2 has the advantage of being released in citable versions that are easily downloaded as a single file and not dependent on search queries. MIDORI2 also contains data for all Eukaryotes, so it is an easy way to include non-Arthropods into your database, which is useful to make contaminant and non-target species “visible”. Disadvantages: MIDORI2 publishes updates only a few times per year, so it isn’t 100% up to date compared with direct GenBank downloads. I haven’t tested whether it is as complete as EMBL or raw GenBank downloads. It is still a relatively large file at ~ 1 GB after importing into CRABS. How to do it: You can download MIDORI2 straight to the cluster using wget and unzip it as illustrated below. Optionally, you can use a program like SeqKit (grep command) to subset to just include Arthropods if desired. From there, I use a sed command to fix the header before importing it into CRABS format:

# download the MIDORI2 COI fasta file
wget https://www.reference-midori.info/download/Databases/GenBank260_2024-04-15/RAW/uniq/MIDORI2_UNIQ_NUC_GB260_CO1_RAW.fasta.gz

# unzip it
gunzip *.fasta.gz

# convert periods into spaces so that crabs can read the accession numbers
sed 's/\./ /g' MIDORI2_UNIQ_NUC_GB260_CO1_RAW.fasta > midori2.fix.fasta" 

# import sequences into crabs                                      
crabs db_import --input midori2.fix.fasta \
--output midori2.fix.crb.fasta --seq_header accession \
--delim ' '

Wrapping up

After all of this data mining, you now possess the raw ore necessary to make reference database gold: that is, fasta files of raw sequences from BOLD, EMBL and GenBank. This ore must next be processed and refined, and then combined and molded into one final polished ingot of a reference database. These processing and refinement steps will be discussed in a future post.

Read about Bald Eagle recovery

Reporter Kyle Bagenstose wrote a nice piece about the recovery of Bald Eagles in New Jersey and Pennsylvania. I was interviewed and quoted in the piece after being referred by a colleague at Rutgers. This allowed me to meet Kyle and hear about his cool plans for more conservation-oriented regional news stories. Good to hear about positive (or any) conservation stories in the news! Burlington County Times, Bucks County Courier.

Here’s an animation of the Bald Eagle recovery based on BBS data:

Bald Eagle.BBSindex_states_1966-2015.gif