Evan Fields

This stream offers two broad projects focused on improving current detection efforts at SecureBio. The first is to characterize when AI-bio or general AI tools are actually useful for large-scale metagenomic detection, including tradeoffs between compute cost, sequencing cost, model type, model size, and pipeline stage. The second is to explore genomic language models as novelty detectors—for example, using perplexity-style metrics to flag surprising sequences—and to evaluate whether this approach can complement traditional bioinformatics systems in a cost-effective, sensitive, and interpretable way.

Apply

View all streams

Stream overview

I'm excited about two broad projects, both focused on being directly useful for current detection efforts:

Map out the cost, feasibility, and promise of applying AI-bio (or just 'regular' AI) tools to large scale metagenomic detection. Many people (including me) have a general intuition that AI will enable new capabilities in analyzing genomic data. But the kind of metagenomic data we generate at SecureBio is very large; for example, we currently generate ~50 billion read pairs per week. Even doing a single forward pass of a moderately sized model for each read could be tremendously expensive. I'm not aware of any serious work characterizing what kinds of AI tools are promising for what kinds / stages of metagenomic pathogen detection.

A sort of "phase diagram" characterizing the tradeoffs between compute cost, sequencing cost, tool types, etc. would be very valuable. Bonus points for extrapolating that diagram into the future or exploring how it changes as a function of model size, model capability, and sequencing cost. The actual math here is mostly arithmetic, but useful arithmetic! And arithmetic that requires a decent amount of deep understanding of both how metagenomic pathogen detection works and how different AI tools work.

It'd be harder but even more valuable to form a theory of which kinds of tools (general purpose LLMs, protein folding models, genomic language models, multi-track biological foundation models, ...) are in principle most promising for different parts of a metagenomic pathogen detection pipeline.

Many pandemic detection schemes are, broadly, some kind of novelty detection: is this genomic sequence something we haven't seen before, or at least haven't seen before here? At SecureBio we've built a few such systems, but mostly using traditional bioinformatics approaches. It's also possible to use a genomic language model as a novelty detection system, e.g. with a perplexity based metric: how much does this sequence surprise the model? We haven't deeply explored this detection modality.

The project I'm imagining starts with setting up such a detection system, then working to tune and characterize:

what kinds of sequences does the system flag? Are there biological patterns evident? Can we discover anything new and unexpected?
how does the system respond to reads flagged by other detection systems? Can it corroborate other flags?
what's the signal to noise ratio? How can we improve sensitivity and specificity?
how cost effective is the system? How can we enrich inputs without losing too much sensitivity?

This is very much an exploratory project: I don't have a cleanly labeled set of exactly the reads that should or should not be flagged; the question is to explore whether and how this kind of detection scheme provides a valuable complement to more traditional approaches.

---

Note that by MATS time likely one of these projects will be underway in a parallel fellowship.

Mentors

Evan Fields

SecureBio

Senior research scientist

Boston

—

Biorisk

Pathogen Detection

Evan is a research scientist at SecureBio Detection, where he works on computational pipelines and detection methods for uncovering threats in deep metagenomic sequencing data. Prior to transitioning his career into biosecurity, he was the VP of data science and engineering at Zoba, a startup providing optimization services to the shared mobility industry. He holds a PhD in operations research and, in his free time, devotes lots of thought cycles to sourdough pizza.

Mentorship style

By default, we'll mostly collaborate via a standing weekly meeting (~1 hour), wherein we'll discuss recent progress and next directions. I'm available via Slack for quick back-and-forth on ideas, sanity checks, and unblocking (data access, etc.), but will rely on the fellow to manage their own implementations, code review, debugging, etc.

Fellows we are looking for

Essential:

Interested in biosecurity on its own merits, not just as a flavor of AI safety
Willing to do a "direct work first and foremost, maybe if we're lucky there's a publishable research output" project
Interest and ability to own a fairly self-directed exploratory project: acting as both product manager and research fellow
Quantitative "white-box" thinking about AI-bio models: not just what they do, but how they do it, and what this implies about how they can be used for bio-surveillance
Ability to write, evaluate, tweak, debug AI pipelines (i.e. the structure around a model) in a language of your choice.

Preferred:

Prior experience with any of biosecurity, public health, biosurveillance, metagenomics, virology, wastewater epidemiology, nucleic acid sequencing, genomic language models, protein language models, etc.
Capable of fine tuning a genomic language model and/or training heads
Experience working with large volumes of data, biological or otherwise, and therefore comfortable with things like streaming algorithms, minimizing data movement costs, pipeline efficiency

Not a good fit:

Fellows most interested in interpretability and/or published research

Project selection

I'll determine which of the two broad project ideas we're running with based on SecureBio Detection needs, which fellows match to me, etc. Within that broad project, I'll guide with what I think is helpful / interesting / relevant to SecureBio Detection, and I expect the fellow to have both autonomy and responsibility to pick concrete work directions.