I’ll be presenting a poster at the Neobiota conference in Portugal. My topic will be some habitat modeling work conducted with European colleagues over the past few months. The purpose of the work is to inform invasion risk of the Chinese Pond Mussel (Sinanodonta woodiana) in North America. The species was first discovered to occur in North America in the Delaware Watershed, New Jersey, USA in 2010. (See picture caption for more info.) This project is made possible through a grant from the National Fish and Wildlife Foundation. Here is a link to the poster.
Forging a metabarcoding reference database for Arthropods. Part 1: mining the ore
[My goal is this series of posts is to help share some of the eDNA knowledge I’ve accumulated over the past few years. Hopefully, it will be useful to students or others just starting out. Here, I walk through the first steps of creating a DNA reference database for Arthropod metabarcoding: downloading the raw sequences…]
Raw DNA sequences in public databases are the ore that is mined and then refined into metabarcoding reference database gold: the final trimmed and labeled amplicon sequences used to attach species names to unknown DNA sequences. Making a good DNA sequence reference database for a diverse group like Arthropods is a challenge. Given that there are many ways to do anything in bioinformatics, this posts describes one way to download and format raw Arthropod reference sequences for metabarcoding. A variety of reference database curation software exists. This solution mainly makes use of CRABS software and was implemented on a Linux cluster.
For amplicons in the COI mitochondrial region, it is prudent to get data from both the Barcode of Life Database (BOLD) AND one of the large genetic repositories like GenBank or EMBL. I lump GenBank and EMBL as these sources regularly share sequences and overlap extensively. BOLD also shares data with GenBank etc. but there are many additional COI sequences hiding in BOLD that are not found in other databases. Each of these data sources has idiosyncrasies when it comes to the downloading process. Here, I use a combination of BOLD, EMBL, and GenBank (MIDORI2) data. If a quicker solution is desired, you could probably omit EMBL. I’ll start with BOLD, which primarily contains COI sequences as this is the official DNA barcode region for animals.
BOLD
Advantages: BOLD is a high quality ore, one with accurate taxonomic labels backed by actual reference specimens. Disadvantages: BOLD has a fickle search feature that doesn’t recognize all of the taxonomic groups that work when searching in NCBI/GenBank. For example, the web-based search returns zero sequences for some orders like Mantophasmatodea, but it works when you use the one family within that order, Mantophasmatidae. One could try downloading all Arthropoda at once, but 1) on the BOLD website it produces a very large file that you then usually need to upload to a cluster for processing; 2) I have had strange results when attempting to download all of Arthropoda from BOLD via other programs straight to the cluster (e.g., returning only ~4000 sequences). How to do it: The way I have found to conquer this issue is to use CRABS to download sequences order-by-order, check for cases with download troubles (e.g., zero or few sequences), and then do those cases family-by-family. The end result is a list of taxonomic names covering all of Arthropoda that seem to work for BOLD. I then loop through this list using a BASH script to download the sequences into individual fasta files. This hodgepodge list of order and family names is shown below.
artnames=("Entomobryomorpha" "Neelipleona" "Poduromorpha" "Symphypleona" "Rhabdura" "Dicellurata" "Acerentomata" "Eosentomata" "Sinentomata" "Zygentoma" "Archaeognatha" "Lepidoptera" "Trichoptera" "Coleoptera" "Diptera" "Hymenoptera" "Mecoptera" "Megaloptera" "Neuroptera" "Raphidioptera" "Siphonaptera" "Strepsiptera" "Hemiptera" "Psocodea" "Thysanoptera" "Dermaptera" "Dictyoptera" "Embioptera" "Grylloblattidae" "Mantophasmatodea" "Austrophasmatidae" "Mantophasmatidae" "Orthoptera" "Phasmatodea" "Plecoptera" "Zoraptera" "Ephemeroptera" "Odonata" "Anomopoda" "Ctenopoda" "Haplopoda" "Onychopoda" "Cyclestherida" "Laevicaudata" "Spinicaudata" "Notostraca" "Anostraca" "Brachypoda" "Copepoda" "Tantulocarida" "Deoterthridae" "Microdajidae" "Amphionidacea" "Decapoda" "Euphausiacea" "Stomatopoda" "Amphipoda" "Cumacea" "Ingolfiellida" "Isopoda" "Mictacea" "Mysidacea" "Spelaeogriphacea" "Tanaidacea" "Thermosbaenacea" "Anaspidacea" "Bathynellacea" "Leptostraca" "Dendrogastrida" "Laurida" "Cryptophialida" "Lithoglyptida" "Chthamalophilidae" "Clistosaccidae" "Lernaeodiscidae" "Mycetomorphidae" "Peltogastridae" "Polyascidae" "Polysaccidae" "Sacculinidae" "Sylonidae" "Thompsoniidae" "Iblomorpha" "Balanomorpha" "Calanticomorpha" "Pollicipedomorpha" "Scalpellomorpha" "Verrucomorpha" "Facetotecta" "Hansenocaris" "Branchiura" "Cephalobaenida" "Porocephalida" "Raillietiellida" "Reighardiida" "Mystacocaridida" "Halocyprida" "Myodocopida" "Platycopida" "Podocopida" "Nectiopoda" "Sarcoptiformes" "Trombidiformes" "Opilioacariformes" "Opilioacaridae" "Holothyrida" "Ixodida" "Mesostigmata" "Araneae" "Opiliones" "Amblypygi" "Palpigradi" "Pseudoscorpiones" "Ricinulei" "Schizomida" "Scorpiones" "Solifugae" "Uropygi" "Xiphosura" "Limulidae" "Pantopoda" "Chilopoda" "Diplopoda" "Pauropoda" "Symphyla" "Apioceridae" "Apsilocephalidae" "Asilidae" "Bombyliidae" "Evocoidae" "Hilarimorphidae" "Mydidae" "Mythicomyiidae" "Ocoidae" "Scenopinidae" "Therevidae" "Ironomyiidae" "Lonchopteridae" "Phoridae" "Platypezidae" "Sciadoceridae" "Pipunculidae" "Syrphidae" "Australimyzidae" "Braulidae" "Canacidae" "Carnidae" "Chloropidae" "Cryptochetidae" "Inbiomyiidae" "Milichiidae" "Tethinidae" "Conopidae" "Diopsidae" "Megamerinidae" "Nothybidae" "Psilidae" "Somatiidae" "Strongylophthalmyiidae" "Syringogastridae" "Tanypezidae" "Camillidae" "Campichoetidae" "Curtonotidae" "Diastatidae" "Drosophilidae" "Ephydridae" "Celyphidae" "Chamaemyiidae" "Lauxaniidae" "Cypselosomatidae" "Micropezidae" "Neriidae" "Pseudopomyzidae" "Acartophthalmidae" "Agromyzidae" "Anthomyzidae" "Asteiidae" "Aulacigastridae" "Clusiidae" "Fergusoninidae" "Marginidae" "Nannodastiidae" "Neminidae" "Neurochaetidae" "Odiniidae" "Opomyzidae" "Paraleucopidae" "Periscelididae" "Teratomyzidae" "Xenasteiidae" "Coelopidae" "Dryomyzidae" "Helcomyzidae" "Helosciomyzidae" "Heterocheilidae" "Natalimyzidae" "Sciomyzidae" "Sepsidae" "Chyromyidae" "Heleomyzidae" "Mormotomyiidae" "Sphaeroceridae" "Ctenostylidae" "Lonchaeidae" "Pallopteridae" "Piophilidae" "Platystomatidae" "Pyrgotidae" "Richardiidae" "Tephritidae" "Ulidiidae" "Glossinidae" "Hippoboscidae" "Nycteribiidae" "Streblidae" "Anthomyiidae" "Fanniidae" "Muscidae" "Scathophagidae" "Calliphoridae" "Mystacinobiidae" "Oestridae" "Polleniidae" "Rhiniidae" "Rhinophoridae" "Sarcophagidae" "Tachinidae" "Ulurumyiidae" "Atelestidae" "Brachystomatidae" "Dolichopodidae" "Empididae" "Hybotidae" "Acroceridae" "Nemestrinidae" "Stratiomyidae" "Xylomyidae" "Athericidae" "Austroleptidae" "Oreoleptidae" "Pelecorhynchidae" "Rhagionidae" "Tabanidae" "Vermileonidae" "Pantophthalmidae" "Xylophagaidae" "Axymyiidae" "Bibionidae" "Hesperinidae" "Pleciidae" "Pachyneuridae" "Bolitophilidae" "Cecidomyiidae" "Diadocidiidae" "Ditomyiidae" "Keroplatidae" "Mycetophilidae" "Rangomaramidae" "Sciaridae" "Blephariceridae" "Deuterophlebiidae" "Nymphomyiidae" "Ceratopogonidae" "Chironomidae" "Simuliidae" "Thaumaleidae" "Chaoboridae" "Culicidae" "Dixidae" "Psychodidae" "Canthyloscelidae" "Scatopsidae" "Anisopodidae" "Perissommatidae" "Synneuridae" "Trichoceridae" "Ptychopteridae" "Tanyderidae" "Cylindrotomidae" "Limoniidae" "Pediciidae" "Tipulidae")
Here is the Bash script to loop through each taxonomic group and download sequences using CRABS:
for i in "${artnames[@]}";
do crabs db_download --source bold --database "$i" --output "bold_${i}.fasta" --keep_original no
done
This results in a series of fasta files containing sequences for each taxonomic group in the artnames list above. Some are still 0 bytes, but it is because there are truly no sequences available for those groups. (Note: Ultimately, I will transition to an exclusively family-by-family approach as it was sometimes hard to tell which orders that CRABS had trouble downloading, e.g., the flies, for which CRABS only downloaded a small fraction of sequences when queried at the order level. Family lists could be obtained using the databases created by programs such as ncbitax2lin.)
EMBL
Advantages: All invertebrate sequences on EMBL can be downloaded in one complete, versioned batch. So, it comes with none of the problems associated with search queries. This is an advantage over GenBank queries, where missing or idiosyncratic data fields (e.g., COI vs. CO1) make it so that you are never quite sure that you got all sequences. Disadvantages: The drawback is that EMBL files are quite large and have a strange text file (.dat) format. CRABS is a convenient way to download the EMBL .dat files and format them into fasta files that are easy to work with and have a smaller file size. But the database is still huge, even for just one group like invertebrates. How to do it: To chop up the EMBL database and download it in smaller batches I use a Bash “for loop” and subsetting code inspired by this comment thread on the CRABS issue page (see below). It would also work to download the full invertebrate file all at once, but I find the loop method convenient as it produces smaller files and it saves the intermediate steps just in case the job gets interrupted. Downloading and processing EMBL is no small feat. It took me over 24 h of downloading and processing time.
Here is the Bash code to download EMBL invertebrate sequences in batches:
# make a list of INV suffixes
suf=("1." "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15" "16" "17" "18" "19")
# loop through the suffixes to download EMBL in manageable batches
for j in "${suf[@]}"; do
crabs db_download --source embl --database "INV_${j}*" --output "embl_inv_${j}.fasta" --keep_original no
done
GenBank via MIDORI2
Advantages: MIDORI2 is a curated database of selected genes scraped from GenBank (COI, 12S, and 16S are the most useful for metabarcoding). Because it only includes the trimmed gene of interest (e.g., the ~ 650 base pair COI gene) this results in a smaller file size. The sequences also undergo a quality checking procedure so erroneously labeled sequences are less likely to occur compared with direct GenBank downloads. MIDORI2 has the advantage of being released in citable versions that are easily downloaded as a single file and not dependent on search queries. MIDORI2 also contains data for all Eukaryotes, so it is an easy way to include non-Arthropods into your database, which is useful to make contaminant and non-target species “visible”. Disadvantages: MIDORI2 publishes updates only a few times per year, so it isn’t 100% up to date compared with direct GenBank downloads. I haven’t tested whether it is as complete as EMBL or raw GenBank downloads. It is still a relatively large file at ~ 1 GB after importing into CRABS. How to do it: You can download MIDORI2 straight to the cluster using wget and unzip it as illustrated below. Optionally, you can use a program like SeqKit (grep command) to subset to just include Arthropods if desired. From there, I use a sed command to fix the header before importing it into CRABS format:
# download the MIDORI2 COI fasta file
wget https://www.reference-midori.info/download/Databases/GenBank260_2024-04-15/RAW/uniq/MIDORI2_UNIQ_NUC_GB260_CO1_RAW.fasta.gz
# unzip it
gunzip *.fasta.gz
# convert periods into spaces so that crabs can read the accession numbers
sed 's/\./ /g' MIDORI2_UNIQ_NUC_GB260_CO1_RAW.fasta > midori2.fix.fasta"
# import sequences into crabs
crabs db_import --input midori2.fix.fasta \
--output midori2.fix.crb.fasta --seq_header accession \
--delim ' '
Wrapping up
After all of this data mining, you now possess the raw ore necessary to make reference database gold: that is, fasta files of raw sequences from BOLD, EMBL and GenBank. This ore must next be processed and refined, and then combined and molded into one final polished ingot of a reference database. These processing and refinement steps will be discussed in a future post.
New paper! Repurposing landfills for grassland biodiversity
Landfills are a common landscape feature in the Anthropocene. Despite their unfortunate environmental impacts, they can also make great sites for restoring grassland ecosystems, inspiring visitors and providing homes for grassland-associated plants and animals. In a new commentary in Ecological Restoration, I discuss this silver lining in the context of landfills in the Mid-Atlantic US, and specifically a recent re-restoration of an important landfill grassland in Croton, NY. The site, which was championed by both Roger Tory Peterson and Pete Seeger, was restored in the 1990s and again in recent years (see Webster, 2021, same issue). The map in the paper was based the EPA’s LMOP database. You can explore the data and get the code to make the map here: https://github.com/mikeallen-eco/NEUS_landfills
Allen, M. C. (2021). Landfills as islands of grassland biodiversity: placing a next-generation habitat restoration plan in context. Ecological Restoration, 39:284-287. (link)
Following the eDNA trail of the Spotted Lanternfly
All organisms leave behind a trail of shed DNA in the environment. Such environmental DNA (eDNA) has proven a powerful and convenient way to sample for aquatic organisms, but efforts to move eDNA onto land are increasingly gaining steam (e.g., see here). Terrestrial animals like insects also leave behind a DNA trail on vegetation, especially sap-sucking insects like the invasive Spotted Lanternfly (pictured above top-middle on a grape vine).
How easy is insect DNA to detect on foliage? Can it be useful for monitoring? How long does it last on vegetation? The answers to these questions are “very!”, “yes!”, and “about 1 week!”. We recently published 2 open access papers on the topic related to invasive insects in the journal Environmental DNA. These describe: 1) a head-to-head comparison in which we show eDNA surveys outperform visual surveys for Spotted Lanternfly in vineyards; and 2) a series of experiments showing Brown-marmorated Stink Bug DNA can persist on foliage in a detectable form for >1 week in the absence of rainfall.
The Spotted Lanternfly paper: http://doi.org/10.1002/edn3.2
Stinkbug DNA degradation paper: http://doi.org/10.1002/edn3.229
How will we monitor the ecological impacts of offshore wind in North America?
Thousands of wind turbines are coming to the Atlantic continental shelf in the next few decades. What will the ecological monitoring at these installations look like? How will we know what effect they have on the marine ecosystem?
Well… I’ve now finished my 2-month gap period researching and preparing a report on this subject for the NJ Climate Change Alliance. The report, prepared with my colleague Matt Campo from the Bloustein School of Planning and Public Policy (and reviewed by many very smart people!) was just released. You can download it here. It details the policy and methodological aspects of ecological monitoring at offshore wind farms in the eastern U.S. (so far) and in three countries in Europe with three decades of offshore wind experience.
The report: https://doi.org/doi:10.7282/t3-wn1p-cz80
I’m now starting my second postdoc in the Lockwood Lab researching the intersection between spatial habitat modeling of coastal birds and coastal resilience planning.
Finished Environmental DNA postdoc; working on offshore energy
I officially finished up my eDNA postdoc in the Lockwood Lab last month and will be starting a second one (in coastal conservation; same lab) in mid-May. In the meantime, I’ve become involved in a short project related to ecological monitoring efforts at offshore wind turbines.
The US continental shelf is poised to increase from its current 6 offshore wind turbines (Block Island Wind Farm) to more than 6000 in a very short time period, on the order of one decade. That number could quadruple again by 2050. I’ve joined a 2-month project with the Rutgers Bloustein School of Planning and Public Policy to research options to guide and inform environmental monitoring approaches that will accompany this offshore energy boom. It will be a win for carbon emissions reduction and a real (but achievable) challenge to make sure it is done in an environmentally sound manner. Read more about the coming offshore wind rush here and here.
Starting a new Postdoc: environmental DNA
On August 1st I’ll be starting a 1+ year postdoc with Dr. Julie Lockwood’s lab at Rutgers University. There I’ll be studying detection and sampling design issues associated with eDNA surveys for invasive and native animals. I’ll also be continuing my work on various conservation issues.
Read about Bald Eagle recovery
Reporter Kyle Bagenstose wrote a nice piece about the recovery of Bald Eagles in New Jersey and Pennsylvania. I was interviewed and quoted in the piece after being referred by a colleague at Rutgers. This allowed me to meet Kyle and hear about his cool plans for more conservation-oriented regional news stories. Good to hear about positive (or any) conservation stories in the news! Burlington County Times, Bucks County Courier.
Here’s an animation of the Bald Eagle recovery based on BBS data:
A fun local conference (and nicely designed poster)
To Maine for the USDA GradCAP program meeting
I spent 3 days in Portland, Maine last week with a great group of grad students and mentors from the USDA, Manomet, Maine DEP, Gulf of Maine Research Institute, and UMaine. We spent a day thinking and learning about climate adaptation in agriculture, forestry and aquaculture, and drafting a paper about what is needed to enable new leaders to tackle these challenges. Again, I definitely recommend applying for this program if you are at a land grant University in the Northeast and your work deals with climate. Learned a lot, good opportunities, and relatively low time commitment. Here is a link to more info: https://www.climatehubs.oce.usda.gov/hubs/northeast/topic/gradcap And here is a link to the info sheet for the webinar I gave as part of the program: https://www.climatehubs.oce.usda.gov/sites/default/files/allen_final_web508.pdf
Here’s a news snippet with more info.