The Chan Zuckerberg Initiative is committed to leveraging our core principles of collaboration, open science and diversity, equity and inclusion to realize a better future for all. In doing so, we build tools with dedicated technology teams and fund technology development to accelerate progress in science and education. From software development to research analysis, CZI’s “Tech Talks” series features members of our technology teams working to develop tools that drive innovation and create impact.

In this edition of Tech Talks, we speak to Katrina Kalantar, a computational biologist on the CZI Infectious Disease team working closely with scientists, engineers, and the Chan Zuckerberg ID (CZ ID) product team to help develop tools that address challenges associated with infectious diseases. Having studied Computer Science in undergrad, Katrina always had a knack for science. However, it was during her PhD in Bioengineering that she became passionate about infectious diseases. During her program, she joined a project piloting the use of metagenomic next-generation sequencing (mNGS) to identify unknown causes of meningitis in pediatric patients in Bangladesh. Through this work, Katrina became inspired to build technologies that local communities can implement to improve public health.

Katrina offered us an even deeper look at the CZ ID technology she helps build.

One thing I’ve realized about working at CZI is that solving technical problems is really only part of the work. Making those solutions available, interpretable, and easy to use helps researchers readily realize the impacts of the work.

What does it mean to be a computational biologist, especially at CZI?

My team is responsible for collaborating with scientists and providing the research backbone to support the ongoing development of tools. I’m responsible for understanding the questions that scientists are seeking to answer, the technical tools that can be used to facilitate these goals, and working with the team to include these in our products.

At CZI in particular, this means staying in close contact with scientists externally, while also working closely with many different cross-functional partners including engineers, designers, and product managers. The role is extremely collaborative!

Tell us about the CZI technology you are currently working on.

I’m currently working on Chan Zuckerberg ID (CZ ID) — an open source cloud-based pipeline and analysis service for metagenomic pathogen detection. The CZ ID pipeline is composed of numerous steps that process raw sequencing data and enable scientists to identify trends regarding what pathogens, such as viruses, bacteria, and other parasites, might be present in their data.

The entire process, from sample to result, extends beyond the software technology. The wet lab portion consists of several laboratory steps required to generate the raw data that goes into CZ ID. First you must obtain a sample, whether it is from an organism, a patient, or the environment, and extract the DNA. Next, the sample must be prepared for sequencing and loaded onto the sequencing machine. For many new labs these steps can be complex, but the Chan Zuckerberg Biohub has supported training for grantees around the world, so that scientists in their home countries can use this technology to answer many different scientific questions that matter to their communities.

Once a sample is sequenced, the data comes out as files containing millions of individual DNA sequences. Each of these sequences come from an organism, and by determining the identity of each sequence, we can reconstruct an idea of what pathogens might have been present in the original sample.

My work focuses on the dry lab side and begins as the CZ ID pipeline takes the raw data and starts identifying which organisms each individual DNA sequence belonged to. The pipeline itself is composed of many steps that process raw sequencing data, first performing quality controls and then mapping the sequences to huge databases of all known organisms — from viruses to tapeworms — to identify the organisms from which the sequences originated.

The CZ ID web application makes it possible for scientists to easily explore and understand their data and to rapidly identify trends regarding what pathogens may be present in their data. This information can be used to improve knowledge of infectious diseases, and in turn, public health.

What problem are you looking to solve?

Our team aims to enable infectious disease researchers and public health departments to independently determine the identity, origin and spread of infectious disease. This allows our partners to combat endemic diseases, outbreaks, and sources of undiagnosed illness in their communities.

We do this in two ways. Firstly, our capacity building efforts include training scientists across the world on how to collect raw samples, prepare them and interpret results. Secondly, through technology and tool building, we are focused on lowering the barrier to analysis and making data generated from cutting-edge technologies broadly accessible.

One of the biggest ways that the work we’re doing can scale is in providing an unbiased view of pathogens circulating a region or population. Knowing this type of information can help clinics, public health departments and other partners prioritize resources in a way that is effective towards limiting the spread of infectious diseases. For example, our partners in Cambodia set up a cohort to sequence patients with febrile illnesses, which show symptoms of fevers. In doing so, mNGS results identified a number of organisms that weren’t previously thought to exist in the region. They could then use this information when planning local infectious disease interventions.

Why does this problem resonate with you?

I’ve always been interested in the intersection of technology and biology, and more specifically, human health. I first realized this during undergrad when I learned that there was a field that combined the technical aspects of computer science with biological applications.

The challenges particular to identifying and tracking infectious disease transmission interest me because of the immediacy of their impact. Infections can become life-threatening in a matter of days. Additionally, collaborating with my CZI team, CZ Biohub colleagues and partners around the world keeps me on my toes, as the challenges in each region are unique. There is always another interesting question to answer.

How have you been involved in advancing the technology on your team?

When I joined the team, there was really just one pipeline for metagenomics. However, when the pandemic came, everyone in the research community for infectious diseases switched their priority to COVID-19, and our team responded by supporting the addition of workflows to analyze COVID-19 genomes. Since July 2020, over 50,000 SARS-CoV-2 genomes have been built in CZ ID by scientists from over 30 countries. It has also been really interesting to see how the technology that we added to support COVID genomes also works for other viruses. In fact, since making this feature available for other viruses, we’ve seen scientists build ~1,000 additional virus genomes to understand other viruses circulating in their regions, and we know there is potential for many more genomes to be analyzed. Being able to use what we’ve already built for COVID-19 for other viruses has been a new direction for the CZ ID team.

Chan Zuckerberg ID is an open source software platform that helps scientists worldwide identify pathogens in metagenomic sequencing data.

Can you share a little bit about the technologies used to build Chan Zuckerberg ID?

The CZ ID metagenomics pipeline relies heavily on open source bioinformatics tools for data quality control, assembly, and alignment that all together make it possible to identify the organism of origin for each individual sequence in a sample containing millions of sequences. The team’s goal is to make these tools easy to use for laboratories where the necessary resources may be lacking.

Metagenomics and sequencing have been around for a little while, but this technology is emerging because of the accessibility provided by the portability of certain sequencers. Being able to apply it to new problem spaces where it hasn’t quite been accessible will make a meaningful impact.

How are you seeing this tool make a difference in science?

This tool is enabling scientists with limited computational experience to leverage more modern technologies in their own infectious disease research and work. By lowering the barrier to analysis, scientists with a variety of infectious disease research questions are able to generate impact in their particular domains — from identification of novel viruses to routine surveillance. We’re seeing that scientists globally have many different research questions and by putting the tools in their hands, they are each moving the needle for their communities in a meaningful way.

Another example is from the pilot study in Bangladesh that I joined in during my PhD program. The Bangladesh team, consistent with most laboratories, typically could only identify a pathogen in ~50% of meningitis cases using all of the available clinical diagnostics, severely limiting the ability to treat patients. By applying mNGS to survey samples that were suspected to contain meningitis of unknown origin, we found that Chikungunya virus — which was previously not known to cause meningitis — was in fact causing meningitis in children. The scientists could then develop a low-cost test to search for Chikungunya in other patient samples more broadly.

There are lots of stories on the public health impact in local communities that have adopted the use of mNGS into their infectious disease research. Plus, as I previously mentioned, the tools we added to support COVID genomes have provided valuable insights to researchers throughout the pandemic and continue to extend those impacts for other viruses.

How can this technology make a difference one year from now, 10 years from now, or even 50-100 years from now?

Currently CZ ID has been used by scientists in over 70 countries to process over 100,000 metagenomic samples. One year from now, using this technology, we can envision that labs in even more countries will be processing samples with more regularity. The increase in sample processing using CZ ID will continue to make impacts in local public health responses, such as with the COVID-19 pandemic. The most immediate impacts are at a local scale — reduced burden of undiagnosed infections, more targeted public health responses, and general increased understanding of the relationship between pathogens and their hosts.

10 years from now, it is possible that this technology will unlock targeted allocation of resources to mitigate infectious disease challenges earlier. You can also imagine that as more laboratories around the world integrate unbiased pathogen detection tools into their workflows, they will be able to identify potential infectious disease threats earlier.

On the 50-100 year horizon, groups around the world will continue to leverage information about how pathogens move to predict and respond to challenges.

Tech@CZI