
- Write to the Help Desk
- Knowledge Articles
- NLM Support Center
- Knowledge Base

What is genome annotation?
Genome annotation is the process of finding and designating locations of individual genes and other features on raw DNA sequences, called assemblies. Annotation gives meaning to a given sequence and makes it much easier for researchers to view and analyze its contents. To visualize what annotation adds to our understanding of the sequence, you can compare the raw sequence (in FASTA format) with the GenBank or Graphics formats, both of which contain annotations. In both instances note the placement of individual genes and other features on the sequence. When a group of researchers assemble a genome, they may also — with processes they establish themselves — annotate it at the same time. In the past, an assembly with annotation was known as a build . These days, the term build is rarely used, as the genome assembly process and its annotation process are often completely uncoupled. They can be conducted at different times by different parties. For example, the Genome Reference Consortium (GRC) is maintaining and updating the human reference assembly . GRC releases assembly (sequence) updates and deposits these to the International Nucleotide Sequence Database Collaboration (INSDC) without annotation. GRC prepared the latest major assembly update (major release designated as GRCh38) in December 2013 and it has since followed with several minor updates (patches). In further processing of an assembly update, the NCBI staff creates a RefSeq version of the submitted INSDC assembly. Following that, NCBI annotates the RefSeq version of the assembly. Each annotation release has its own designation and time stamp. For example, the latest (as of August 2023) NCBI annotation release is designated as GCF_000001405.40-RS_2023_03 . In addition to the human reference genome, NCBI staff annotate numerous eukaryotic genomes via the powerful Eukaryotic Genome Annotation Pipeline . Visit the Eukaryotic Genome Annotation at NCBI page to start exploring extensive documentation on the annotation process, and to follow the progress of individual genome annotation. NCBI staff have also developed the Prokaryotic Genome Annotation Pipeline that is available as a service to GenBank submitters and also as a stand-alone software package .
An Introduction to Genome Annotation
Affiliations.
- 1 Eccles Institute of Human Genetics, University of Utah, Salt Lake City, Utah.
- 2 USTAR Center for Genetic Discovery, University of Utah, Salt Lake City, Utah.
- PMID: 26678385
- DOI: 10.1002/0471250953.bi0401s52
Genome projects have evolved from large international undertakings to tractable endeavors for a single lab. Accurate genome annotation is critical for successful genomic, genetic, and molecular biology experiments. These annotations can be generated using a number of approaches and available software tools. This unit describes methods for genome annotation and a number of software tools commonly used in gene annotation.
Copyright © 2015 John Wiley & Sons, Inc.
Publication types
- Research Support, U.S. Gov't, Non-P.H.S.
- Data Curation
- Exome / genetics
- Molecular Sequence Annotation / methods*
- Quality Control
- Sequence Alignment
Volume 9 Supplement 5
Proceedings of the 10th Bio-Ontologies Special Interest Group Workshop 2007. Ten years past and looking to the future
- Proceedings
- Open access
- Published: 29 April 2008
Gene Ontology annotations: what they mean and where they come from
- David P Hill 1 ,
- Barry Smith 2 ,
- Monica S McAndrews-Hill 1 &
- Judith A Blake 1
BMC Bioinformatics volume 9 , Article number: S2 ( 2008 ) Cite this article
32k Accesses
110 Citations
14 Altmetric
Metrics details
To address the challenges of information integration and retrieval, the computational genomics community increasingly has come to rely on the methodology of creating annotations of scientific literature using terms from controlled structured vocabularies such as the Gene Ontology (GO). Here we address the question of what such annotations signify and of how they are created by working biologists. Our goal is to promote a better understanding of how the results of experiments are captured in annotations, in the hope that this will lead both to better representations of biological reality through annotation and ontology development and to more informed use of GO resources by experimental scientists.
The PubMed literature database contains over 15 million citations and it is beyond the ability of anyone to comprehend information in such amounts without computational help. One avenue to which bioinformaticians have turned is the discipline of ontology that allows experimental data to be stored in such a way that it constitutes a formal, structured representation of the reality captured by the underlying biological science. An ontology of a given domain represents types and the relations between them, and is designed to support computational reasoning about the instances of these types. From the perspective of the biologist, the development of bio-ontologies has enabled and facilitated the analysis of very large datasets. This utility comes not from the ontologies per se, but from the use to which they are put during the curation process that results in ‘annotations’.
The principal use of an ontology such as the GO [ 1 ] is for the creation of annotations by the curators of model organism databases [e.g., [ 2 – 4 ]] and at genome annotation centers [ 5 ] who are striving to capture, in a form accessible by computational algorithms, information about the contributions of gene products to biological systems as reported in the scientific literature. Because such annotations are so integral to the use of bio-ontologies, it is important to understand how the curatorial process proceeds. We demonstrate here how the GO annotation paradigm illustrates important aspects of this process.
To help in understanding this work, we provide a glossary of the terms that are most important to our discussion:
An annotation is the statement of a connection between a type of gene product and the types designated by terms in an ontology such as the GO. This statement is created on the basis of observations of the instances of such types made in experiments and of the inferences drawn from such observations. For present purposes we are interested in the annotations prepared by model organism databases to a type called ‘gene’, a term which is seen as encompassing all gene-product types. For the purpose of this discussion, we do not need to address the distinction between gene and gene product.
An instance is a particular entity in spatio-temporal reality, which instantiates a type (for example, a type of gene product molecule, a type of cellular component). In the cases discussed here, the instances would be actual molecules or cellular components that can be physically identified or isolated or associated biological processes that can be physically observed.
A type (aka “universal”) is a general kind instantiated by an open-ended totality of instances that share certain qualities and propensities in common. For example, the type nucleus , whose instances are the membrane bound organelles containing the genetic material present in instances of the type eukaryotic cell .
A level of granularity is a collection of instances (and of corresponding types) characterized by the fact that they form units (‘grains’), such as molecules, cells, organisms in the organization of biological reality. Successive levels of granularity form a hierarchy by virtue of the fact that grains at smaller scales are parts of grains at successively larger scales.
A gene product instance is a molecule (usually an RNA or protein molecule) generated by the expression of a nucleic acid sequence that plays some role in the biology of an organism. For example, an instance of the Shh gene product would be a molecule of the protein produced by the Shh gene.
A molecular function instance is the enduring potential of a gene product instance to perform actions, such as catalysis or binding, on the molecular level of granularity. A molecule of the Adh1 gene product sitting in a test tube has the potential to catalyze the reaction that converts an alcohol into an aldehyde or a ketone. It is assumed that in the correct context, this catalysis event would occur. The potential of this molecule describes its molecular function.
A biological process instance (aka “occurrence”) is a change or complex of changes on the level of granularity of the cell or organism that is mediated by one or more gene products. For example, the development of an arm in a given embryo would be an instance of the biological process limb development .
A cellular component instance is a part of a cell or its extracellular environment where a gene product may be located. For example, a cellular component instance intrinsic to internal side of plasma membrane is that part of a specific cell that comprises the lipid bilayer of the plasma membrane and the cytoplasmic area adjacent to the internal lipid layer where a gene product would project.
For each of the instance terms in the above, there is a corresponding type term defined in the obvious way; thus a molecular function type is a type of molecular function instance, and so on.
Curation is the creation of annotations on the basis of the data (for example data about gene products) contained in experimental reports, primarily as contained in the scientific literature published on the basis of the observation of corresponding instances.
An evidence code is a three-letter designation used by curators during the annotation process that describes the type of experimental support linking gene product types with types from the GO Molecular Function, Cellular Component and Biological Process ontologies. For example, the evidence code IDA (Inferred from Direct Assay) is used when an experimenter has devised an assay that measures the execution of a given molecular function and the experimental results show that instances of the gene product serve as agents in such executions. An assay is designed to detect, either directly or indirectly, those occurrences that are the executions of a given molecular function type. Thereby the assay identifies instances of that function type. The code IGI (Inferred From Genetic Interaction) is used when an inference is drawn, from genetic experiments using instances of more than one gene product type, to the effect that molecules of one of these types are responsible for the execution of a specified molecular function.
The Gene Ontology Consortium (GOC) uses two further evidence codes to describe experimental support for an annotation: IMP (Inferred by mutant phenotype), and IPI (Inferred by physical interaction). The consortium uses other evidence codes to describe inferences used in annotations that are not supported by direct experimental evidence, but these will not be considered in this discussion ( http://www.geneontology.org/GO.evidence.shtml ). Here we give examples of the process of annotation supported by experimental evidence using the IDA and IMP evidence codes. We use these examples to illustrate how using an annotation helps us understand the underlying biological methods that were used to support the inferences between the types that the annotation represents. With this knowledge in hand, we can then use this information to generate new inferences or to filter the information for specific needs.
The curator perspective
A GO annotation represents a link between a gene product type and a molecular function, biological process, or cellular component type (a link, in other words, between the gene product and what that product is capable of doing, what biological processes it contributes to, and where in the cell it is capable of functioning in the natural life of an organism). Formally, a GO annotation consists of a row of 15 columns. For the purpose of this discussion, there are 4 primary fields: i) the public database ID for the gene or gene product being annotated ; ii) the GO:ID for the ontology term being associated with the gene product; iii) an evidence code, and iv) the reference/citation for the source of the information that supports the particular annotation (Figure 1 ). Curators from the GOC have agreed to use standard practices when annotating gene products, practices are enforced by e-mail exchanges, quality control reports, face-to-face meetings and regular conference calls.

Anatomy of an Annotation. Annotations are provided to the Gene Ontology Consortium as tab-delimited files with 15 fields. Four fields indicate the gene product being annotated, the ontology terms used in the association, the type of evidence supporting the annotation and the reference where the original evidence was presented. The three annotations described in this manuscript are shown.
Additional details of these practices and of the annotation structure and GO-defined annotation processes are available at the GO website [ http://www.geneontology.org/GO.annotation.shtml ]. Briefly, the annotation process unfolds in a series of steps. First, specific experiments, documented in the biomedical literature, are identified as relevant to the curation-process responsibilities of a given curator. Second, the curator applies expert knowledge to the documentation of the results of each selected experiment. This process entails determining which gene products are being studied in the experiment, the nature of the experiment itself, and of the molecular functions, biological processes and cellular components that the experiment identifies as being correlated with the gene product. The curator then creates an annotation which captures the appropriate relationships between the corresponding ontology types.
Finally, annotation quality control processes are employed to ensure that the annotation has a correct formal structure, to evaluate annotation consistency among curators and curatorial groups, and to harvest the knowledge emerging from the activity of annotation for the contributions it might make to the refinement and extension of the GO itself, and increasingly also to other ontologies.
Step 1: Identification of relevant experimental data : The main goal of the GO annotation effort is to create genome-specific annotations supported by evidence obtained in experiments performed in the organism being annotated. However, many annotations are inferred from experiments performed in other organisms, or they are inferred not from experiments at all but rather from knowledge about sequence features for the gene in question. Such information, too, is captured in the GO annotations by means of corresponding evidence codes. It is thus important for the user of such annotations to understand what these codes reflect either that an annotation is based on experimental evidence supporting the assertion or that an annotation is a prediction based on structural similarity. The difference between experimentally verified and computationally derived GO annotations can be identified in the annotation file. This complexity, if not taken into account by the user, can confound data analyses and undermine the goal of hypothesis generation on the basis of GO annotation sets. With an understanding of the kinds of evidence that underlie a given GO annotation and of how that annotation is meant to represent the real world, the user can intelligently filter annotation files and retrieve those annotation sets that reflect the kinds of experiments and of predictions that are of maximal relevance.
Step 2: Identification of the appropriate ontology annotation term: The decision as to what GO term to use in an annotation depends on several factors. The experiment itself will bring some limit on the resolution of what can be understood from its results. For example, cell fractionation might localize molecules of a protein to the nucleus of a cell, but immunolocalization experiments might localize molecules of the same type of protein to the nucleolus of a cell. As a result, the same gene may have annotations to different terms in the same ontology because annotations are based on different experiments. Efforts are made to ensure annotation consistency through regular annotation consistency checks. Where inconsistencies are identified, the GOC takes steps to resolve them by working with the curators involved and where necessary with domain specialists. The limitations of experimental methods may lead curators to use their own scientific expertise and background knowledge when selecting a term. It is important to keep in mind that the choice of a GO term is sometimes made by inference made by the annotator on the basis of his or her previous knowledge. An example would be the case in which a mutation in a housekeeping gene causes a defect in a very broad process such as limb morphogenesis. A curator who has background knowledge about the function of this gene as being involved in basic cell physiology may be confident that the defect in morphogenesis is a by-product of unhealthy cells, and that the gene product is not involved in morphogenesis per se . The task of establishing which sub-processes are parts of and which lie outside a given process is challenging not only to ontology developers and curators but also to laboratory biologists. One method to address this issue is to define each process with a discreet beginning and end. GO ontology developers use this method whenever possible when defining process types. This allows annotators to best capture the knowledge based on the GO type defined. This GOC has now adopted a policy, already being realized by the MGI group, of creating annotations that are ‘contextual’. This means that terms from other ontologies such as the cell type (CL) (6) and other OBO Foundry ontologies (7), and from the mouse anatomical dictionary (8) are used in conjunction with GO terms in the annotations. As a result, the annotation can more accurately describe the biological reality that needs to be captured.
Molecular function annotation
In the simplest biological situation, molecules of a given type are associated with a single molecular function type. A specific molecule m is an instance of a molecule type M (represented for example in the UniProt database), and its propensity to act in a certain way is an instance of the molecular function type F (represented by a corresponding GO term). So, a molecule of the gene product type Adh1 , alcohol dehydrogenase 1 (class I), has as its function an instance of the molecular function type alcohol dehydrogenase activity . This means that such a molecule has the potential to execute this function in a given contexts. The term ‘activity’, in this sense, is meant as it is used in a biochemical context; and is more appropriately read as meaning: ‘potential activity’. Note that although the same string, “alcohol dehydrogenase”, is used both in the gene name and in the molecular function, the string itself refers to different entities: in the former to the molecule type; in the latter to the type of function that molecule has the propensity to execute. This ambiguity is rooted in the tendency to name molecules based on the functions they execute, and it is important to understand this distinction since the name of a molecule and the molecular function to which the molecule is attributed may not necessarily agree, for instance because the molecule may execute multiple functions.
If we say that instances of a given gene product type have a potential to execute a given function, this does not mean that every instance of this type will in fact execute this function. Thus molecules of the mouse gene product type Zp2 are found in the oocyte and have the propensity to bind molecules of the gene product type Acr during fertilization [ 9 ]. If, however, an oocyte is never fertilized, the molecules still exist and they still have the propensity to execute the binding function, but the function is never executed.
The experimental evidence used to test whether a given molecular function type F exists comes in the form of an ‘assay’ for the execution of that function type in molecules of some specific type M . If instances of F are identified in such an assay, this justifies a corresponding molecular function annotation asserting an association between M and F . As an example, Figure 2 shows results of an assay for the molecular function retinol dehydrogenase activity taken from a study by Zhang et al . [ 10 ] (Throughout this paper we will denote types using italics.) The molecular function type retinol dehydrogenase activity is defined in the molecular function ontology by the reaction: retinol + NAD + → retinal + NADH + H + . Instances of gene product molecules annotated to this term have the potential to execute this catalytic activity. In this experiment, a cell protein extract was incubated with two substrates, all-trans-retinol (open circles) or 9-cis-retinol (filled circles), and the cofactor NAD + for 10 minutes and the amount of retinal generated was measured. The graph shows the rate of accumulation of product (retinal) with respect to the concentration of substrate (retinoid) used. The results show that the reaction defined by the GO molecular function type retinol dehydrogenase activity has indeed been instantiated – the execution of this function has occurred. The observed occurrences of retinol being converted to retinal are evidence for the existence of instances of this molecular function type. In this experiment, the instances of the function type are identified through observation of actual executions. We assert that some molecules in this extract have molecular functions of type retinol dehydrogenase activity because occurrences of executions of instances of this type have been directly measured.

Molecular Function Annotation Data . This graph is reproduced from Zhang et al [10]. The graph shows the concentration of retinoid used as substrate along the X axis and the retinol dehydrogenase activity along the Y axis. Open circles refer to all-trans-retinol as a substrate and closed circles refer to 9-cis-retinol as a substrate. The enzyme samples were taken from a crude extract of cells transfected with a cDNA encoding the Rdh1 gene. [Used by permission]
Biological process annotation
A molecular function instance is the enduring potential of a gene product instance to act in a certain way. A biological process instance is the execution of one or more such molecular function instances working together to accomplish a certain biological objective. A biological process instance is at the cellular or organismal level of granularity what the execution of a function is at the level of the molecule. There is a relationship between molecular functions and biological processes. At this time this relationship is not represented explicitly in GO. From a gene annotation perspective, we are interested in going beyond the instance-instance relations at the cell- or organism-level, and in gaining the ability to infer type-type relations which link gene product types at the molecular level of granularity to process types at the level of the cell or organism. We are interested in the fact that molecules of a given gene product type can be associated with instances of a molecular function type (known or unknown) whose execution contributes to the occurrence of a biological process of a given type. Inferences about such type-type relations can be made because experiments are designed to test what transpires when specified biological conditions are satisfied in typical circumstances – circumstances in which, as a result of the efforts of the experimenter, disturbing events do not interfere. Experiments are designed to be reproducible and predictive, describing the instances that one would expect to find in biological systems meeting the defined conditions. If future experiments show that preceding experiments did not describe the intended typical situation, then the conclusions from the preceding experiments are questioned and may be reanalyzed and reinterpreted, or even rejected entirely, and the corresponding annotations then need to be amended accordingly.
Annotations in this way sometimes point to errors in the type-type relationships described in the ontology. An example is the recent removal of the type seretonin secretion as an is_a child of neurotransmitter secretion from the GO Biological Process ontology. This modification was made as a result of an annotation from a paper showing that serotonin can be secreted by cells of the immune system where it does not act as a neurotransmitter.
Associations between gene products and biological processes, too, can be detected experimentally. When instances of biological process type P are detected, either by direct observation or by experimental assay, as being associated with instances of a given gene product type M , then this justifies the assertion of that sort of association between M and P which is called a biological process annotation.
For those species of organisms where the tools of genetic study can be successfully applied, the association of gene product types with biological process types is usually achieved through the study of the perturbations of biological processes following genetic mutation. Curators use the IMP evidence code for these annotations. Figure 3 shows an example of a mutational analysis done by Washington-Smoak et al on the effects of a mutation of the Shh gene on mouse heart development [ 11 ]. The left panel shows an image of a heart with normal copies of the gene (WT) at 16.5 days of embryogenesis; the right panel shows a heart with defective copies of the gene at 16.5 days of embryogenesis. The figure clearly illustrates that the development of the outflow tracts of the heart is defective in the embryo with the defective gene. The GO Biological Process ontology defines the type heart development as: ‘the process whose specific outcome is the progression of the heart over time, from its formation to the mature structure. The heart is a hollow, muscular organ, which, by contracting rhythmically, keeps up the circulation of the blood.’

Biological Process Annotation Data . This figure is reproduced from Washington Smoak et al [11]. The figure shows micrographs of hearts in 16.5dpc mouse embryos. The figure on the left shows an animal with two functional copies of the Shh gene and the figure on the right shows an animal with no functional copies. Ao and Pa indicate the aorta and the pulmonary artery respectively. The ? indicates an aberrant outflow tract. Reprinted from Developmental Biology, 283, Washington Smoak et al , Sonic hedgehog is required for cardiac outflow tract and neural crest development, 357-72, Copyright 2005, with permission from Elsevier.

Cellular Component Annotation . This figure is reproduced from MacPhee et al [12]. The figure shows micrographs that are the results of an immunofluorescence localization of the ATP1A1 protein. The illuminated areas show the location of the protein along the plasma membrane. Reprinted from Developmental Biology, 222, MacPhee et al , Differential involvement of Na(+),K(+)-ATPase isozymes in preimplantation development of the mouse, 486-498, Copyright 2000, with permission from Elsevier.
Based on the mutational study reported in Washington-Smoak et al , an MGI curator has made an annotation linking heart development and the Shh gene using the IMP evidence code (Fig. 1 ). This annotation rests on the identification in the normal animal of a molecule of the product of the Shh gene with a molecular function whose execution contributes to an occurrence of the biological process heart development . We know that the biological process heart development exists because we observe it in the normal animal. We know that a molecule of SHH contributes to this process because when we take away all instances of the gene product of the Shh gene in an animal, the process of heart development is disturbed. The annotation thus affirms that a molecule of SHH protein has the potential to execute a molecular function that contributes to an instance of the type heart development in the Biological Process ontology. We also generalize that the execution of the molecular function of a molecule of SHH in a given mouse will in some way contribute to the development of that mouse's heart. However, the results of any phenotypic assay are limited to the resolution of the phenotype itself. In the experiment described above, we have validated the biological process, but cannot make any direct inferences about the nature of the function executed. It is for this and other practical reasons that the molecular function and biological process ontologies were developed independently.
Cellular component annotation
In a large majority of cases, annotations linking gene product with cellular component types are made on the basis of a direct observation of an instance of the cellular component in a microscope, as for example in [ 12 ], which reports an experiment in which an antibody that recognizes gene products of the Atp1a1 gene is used to label the location of instances of such products in preimplantation mouse embryos (Figure 4 ). The fluorescent staining shows that the gene products are located at the plasma membrane of the cells of these embryos. In this case, the instances of the gene products are the molecules bound by the fluorescent antibodies, and the instance of the cellular component is the plasma membrane that is observed under the microscope. A curator has accordingly used the results of this experiment to make an annotation of the ATP1A1 gene product to the GO cellular component plasma membrane (Fig. 1 ). As with molecular functions and biological processes, there is also a relationship between molecular function and cellular component. It is straightforward to hypothesize that, if a molecule of a gene product is found in an instance of a given cellular component, then that gene product has the potential to execute its function in that cellular component as well. If the execution of the function is detected in the component, then we can make a generalization concerning the molecular function type and the cellular component type. We assume, based on the accumulated experimental data, that sufficient instances of the gene product will execute their functions in some instance of the cellular component type and that enough molecules will execute their function in such a way that these executions become biologically relevant. As with molecular function and biological process, experimental evidence for molecular function and cellular component annotations is often separable. Therefore, from a practical standpoint, these ontologies are also developed separately.
The development of an ontology for a given domain reflects a shared understanding of this domain on the part of domain scientists. This understanding, for biological systems, is the result of the accumulation of experimental results reflecting that iterative process of hypothesis generation and experimental testing for falsification which is the scientific method. The process of annotation brings new experimental results into relationship with the existing scientific knowledge that is captured in the ontology. There will necessarily be times when new results yield conflicts with the current version of the ontology. One of the strengths of the GO development paradigm is that development of the GO has been a task performed by biologist-curators who are experts in understanding specific experimental systems: as a result, the GO is continually being updated in response to new information. GO curators regularly request that new terms be added to the GO or suggest rearrangements to the GO structure, and the GO has an ontology development pipeline that addresses not only these requests but also submissions coming in from external users. By coordinating the development of the ontology with the creation of annotations rooted in the experimental literature, the validity of the types and relationships in the ontology is continually checked against the real-world instances observed in experiments. GO curators refer to this as annotation-driven ontology development. In addition, the GO community works with scientific experts for specific biological systems to evaluate and update GO representations for the corresponding parts of the ontology [ 13 ].
Conclusions
Gene Ontology annotations report connections between gene products and the biological types that are represented in the GO using GO evidence codes. The evidence codes record the process by which these connections are established and reflect either the experimental analysis of actual instances of gene products or inferential reasoning from such analysis. We believe that an understanding of the role of instances in the spatiotemporal reality upon which experiments are performed can provide for a more rigorous analysis of the knowledge that is conveyed by annotations to ontology terms. While each annotation rests ultimately on the observation of instances in the context of a scientific experiment, the annotation itself is not about such instances. Rather it is about the corresponding types. This is possible because annotations are derived by scientific curators from the published reports of scientific experiments that describe general cases , cases for which we have scientific evidence supporting the conclusion that the instances upon which the experiments are performed are typical instances of the corresponding types. If such evidence is called into question through further experimentation, then as we saw, the corresponding annotations may need to be revised. The resultant tight coupling between ontology development and curation of experimental literature goes far towards ensuring that ontologies such as GO reflect the most sophisticated understanding of the relevant biology that is available to scientists. One area of future work would be to find ways to computationally identify inconsistencies in the type-type relations in the ontology based on inconsistencies of annotations to the types.
It is to us obvious that our cumulative biological knowledge should represent how instances relate to one another in reality, and that any development of bio-ontologies and of relationships between such ontologies should take into account information of the sort that is captured in annotations. While we are still at an early stage in the process of creating truly adequate and algorithmically processable representations of biological reality, we believe that the GO methodology of allowing ontology development and creation of annotations to influence each other mutually represents an evolutionary path forward, in which both annotations and ontology are being enhanced in both quality and reach.
Gene Ontology Consortium: The Gene Ontology (GO) project in 2006 Nucl. Acids Res 2006, 34: D322-D326. 10.1093/nar/gkj021
Article PubMed Central Google Scholar
Blake JA, Eppig JA, Bult CJ, Kadin JA: Richardson JE and Mouse Genome Database Group: The Mouse Genome Database (MGD): updates and enhancements. Nucleic Acids Res 2006, 34: D562–7. 10.1093/nar/gkj085
Article PubMed Central CAS PubMed Google Scholar
Cherry JM, Ball C, Weng S, Juvik G, Schmidt R, Adler C, Dunn B, Dwight S, Riles L, Mortimer RK, Botstein D: Genetic and physical maps of Saccharomyces cerevisiae. Nature 1997, 387 (6632 Suppl):67–73.
PubMed Central CAS PubMed Google Scholar
Grumbling G, Strelets V: The FlyBase Consortium: FlyBase: anatomical data, images and queries. Nucleic Acids Res 2006, 34: D484-D488. doi:10.1093/nar/gkj068 10.1093/nar/gkj068
Camon E, Magrane M, Barrell D, Lee V, Dimmer E, Maslen J, Binns D, Harte N, Lopez R, Apweiler R: The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Res 2004, 32: D262-D266. 10.1093/nar/gkh021
Bard J, Rhee S, Ashburner M: An Ontology for Cell Types. Genome Biology 2005, 6: R21. doi:10.1186/gb-2005–6-2-r21 10.1186/gb-2005-6-2-r21
Article PubMed Central PubMed Google Scholar
Smith B, Ashburner M, Rosse C, et al .: The OBO Foundry: Coordinated evolution of ontologies to support biomedical data integration. Nat Biotechnol 2007, 25: 1251–1255. 10.1038/nbt1346
Hayamizu TF, Mangan M, Corradi JP, Kadin JA, Ringwald M: The Adult Mouse Anatomical Dictionary: a tool for annotating and integrating data. Genome Biol 2005, 6 (3):R29. Epub 2005 Feb 15 10.1186/gb-2005-6-3-r29
Howes E, Pascall JC, Engel W, Jones R: Interactions between mouse ZP2 glycoprotein and proacrosin; a mechanism for secondary binding of sperm to the zona pellucida during fertilization. J Cell Sci 2001, 114 (Pt 22):4127–36.
CAS PubMed Google Scholar
Zhang M, Chen W, Smith SM, Napoli JL: Molecular characterization of a mouse short chain dehydrogenase/reductase active with all-trans-retinol in intact cells, mRDH1. J Biol Chem 2001, 276 (47):44083–90. 10.1074/jbc.M105748200
Article CAS PubMed Google Scholar
Washington Smoak I, Byrd NA, Abu-Issa R, Goddeeris MM, Anderson R, Morris J, Yamamura K, Klingensmith J, Meyers EN: Sonic hedgehog is required for cardiac outflow tract and neural crest cell development. Dev Biol 2005, 283: 357–72. 10.1016/j.ydbio.2005.04.029
Article Google Scholar
MacPhee DJ, Jones DH, Barr KJ, Betts DH, Watson AJ, Kidder GM: Differential involvement of Na(+),K(+)-ATPase isozymes in preimplantation development of the mouse. Dev Biol 2000, (222):486–498. 10.1006/dbio.2000.9708
Diehl AD, Lee JA, Scheuermann RH, Blake JA: Ontology development for biological systems: immunology. Bioinformatics 2007, 23: 913–5. 10.1093/bioinformatics/btm029
Download references
Acknowledgements
The authors would like to thank Cynthia Smith, Terry Hayamizu and Waclaw Kusnierczyk, and the reviewers, for their help with this manuscript. This work was supported by NIH grants HG02273 (DPH, JB) and HG00330 (JB, MM) and by the NIH Roadmap for Medical Research, Grant 1 U 54 HG004028 (BS).
This article has been published as part of BMC Bioinformatics Volume 9 Supplement 5, 2008: Proceedings of the 10th Bio-Ontologies Special Interest Group Workshop 2007. Ten years past and looking to the future. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/9?issue=S5 .
Author information
Authors and affiliations.
The Jackson Laboratory, Bar Harbor, ME, USA
David P Hill, Monica S McAndrews-Hill & Judith A Blake
Department of Philosophy and Center of Excellence in Bioinformatics and Life Sciences, University at Buffalo, NY, USA
Barry Smith
You can also search for this author in PubMed Google Scholar
Corresponding author
Correspondence to Judith A Blake .
Additional information
Competing interests.
The authors declare that they have no competing interests.
Authors' contributions
All authors contributed equally to this effort through discussion, writing, and revision of the manuscript.
Rights and permissions
Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Reprints and Permissions
About this article
Cite this article.
Hill, D.P., Smith, B., McAndrews-Hill, M.S. et al. Gene Ontology annotations: what they mean and where they come from. BMC Bioinformatics 9 (Suppl 5), S2 (2008). https://doi.org/10.1186/1471-2105-9-S5-S2
Download citation
Published : 29 April 2008
DOI : https://doi.org/10.1186/1471-2105-9-S5-S2
Share this article
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
- Gene Ontology
- Molecular Function
- Ontology Development
- Evidence Code
- Gene Ontology Consortium
BMC Bioinformatics
ISSN: 1471-2105
- Submission enquiries: [email protected]
- General enquiries: [email protected]

Bioinformatics in Rice Research pp 163–177 Cite as
Gene Identification and Structure Annotation
- Puja Sashankar 3 ,
- Santhosh N Hegde 4 &
- N. Sathyanarayana 3
- First Online: 25 September 2021
786 Accesses
Rice ( O. sativa L.) is one among the necessary food crops worldwide. Due to ever-increasing demand, many are undertaking several efforts to enhance its productivity - the latest being the sequel of Rice genome sequencing projects. The accelerated developments in next-generation sequencing (NGS) has bolstered these efforts in hundreds to thousands of rice varieties, which has enabled researchers to unpack the hidden potential of vast and diverse rice germplasm. One of the important objectives of these projects is to accurately characterize the gene models, which has a major significance for the in-depth study of gene function and, thus, various applications. Bioinformatics plays a major role in gene structure identification and its biological function through various algorithms and software. Hence, this chapter aims to elucidate the approach of identifying, characterizing, and finding the function of different types of rice genes.
- Annotations
- Gene Prediction
- Coding genes
This is a preview of subscription content, access via your institution .
Buying options
- Available as PDF
- Read on any device
- Instant download
- Own it forever
- Available as EPUB and PDF
- Compact, lightweight edition
- Dispatched in 3 to 5 business days
- Free shipping worldwide - see info
- Durable hardcover edition
Tax calculation will be finalised at checkout
Purchases are for personal use only
Abbreviations
Coding DNA Sequence
Committee on Gene Symbolization Nomenclature and Linkage
Expressed Sequence Tags
Full-length complementary DNA
The International Rice Genome Sequencing Project
Michigan State University
Next-Generation Sequencing
National Institute of Agrobiological Sciences
Oryza Map Alignment Project
Open Reading Frame
Rice Annotation Project
Rice Genome Annotation Project
Rice Genome Knowledgebase
Rice Mutant Database
ribosomal RNA
small nucleolar RNA
Transposon Element
transfer RNA
Eswaran R, Sofiya M, Anbanandan V. Identification of cold tolerant rice genotypes and associated traits at seedling stage. J Pharmacogn Phytochem. 2019;8(2S):774–6.
Google Scholar
Khush GS. Origin, dispersal, cultivation and variation of rice. Plant Mol Biol. 1997 Sep 1;35(1):25–34.
CrossRef CAS PubMed Google Scholar
Fuller DQ, Sato Y-I, Castillo C, Qin L, Weisskopf AR, Kingwell-Banham EJ, et al. Consilience of genetics and archaeobotany in the entangled history of rice. Archaeol Anthropol Sci. 2010 Jun 1;2(2):115–31.
CrossRef Google Scholar
Kovach MJ, Sweeney MT, McCouch SR. New insights into the history of rice domestication. Trends Genet. 2007 Nov 1;23(11):578–87.
Song S, Tian D, Zhang Z, Hu S, Yu J. Rice genomics: over the past two decades and into the future. Genomics Proteomics Bioinformatics. 2018;16(6):397–404.
CrossRef PubMed Google Scholar
Sweeny M, McCouch S. The complex history of the domestication of rice. Ann Bot. 2007;100:951–7.
Satoh K, Doi K, Nagata T, Kishimoto N, Suzuki K, Otomo Y, et al. Gene Organization in Rice revealed by full-length cDNA mapping and gene expression analysis through microarray. PLoS One. 2007 Nov 28;2(11):e1235.
CrossRef PubMed PubMed Central Google Scholar
Kawahara Y, de la Bastide M, Hamilton JP, Kanamori H, McCombie WR, Ouyang S, et al. Improvement of the Oryza sativa Nipponbare reference genome using next generation sequence and optical map data. Rice. 2013 Feb 6;6(1):4.
Matsumoto T, Wu J, Itoh T, Numa H, Antonio B, Sasaki T. The Nipponbare genome and the next-generation of rice genomics research in Japan. Rice. 2016 Jul 22;9(1):33.
Eckardt NA. Sequencing the Rice genome. Plant Cell. 2000 Nov 1;12(11):2011–7.
CrossRef CAS PubMed PubMed Central Google Scholar
Kumagai M, Tanaka T, Ohyanagi H, Hsing Y-IC, Itoh T. Genome sequences of Oryza species. In: Sasaki T, Ashikari M, editors. Rice genomics, genetics and breeding [internet]. Singapore: Springer; 2018. p. 1–20. [cited 2021 Jan 16]. Available from: https://doi.org/10.1007/978-981-10-7461-5_1 .
Burge CB. Identification of Genes in Human Genomic DNA. [PhD Thesis]. Stanford University Stanford, CA; 1997.
Delcher AL, Bratke KA, Powers EC, Salzberg SL. Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics. 2007 Mar 15;23(6):673–9.
Stanke M, Waack S. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics. 19(Suppl 2):ii215–25.
Cantarel BL, Korf I, Robb SMC, Parra G, Ross E, Moore B, et al. MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res. 2008 Jan 1;18(1):188–96.
Scalzitti N, Jeannin-Girardon A, Collet P, Poch O, Thompson JD. A benchmark study of ab initio gene prediction methods in diverse eukaryotic organisms. BMC Genomics. 2020;21:1–20.
Wang Y, Chen J-Q, Araki H, Jing Z, Jiang K, Shen J, et al. Genome-wide identification of NBS genes in japonica rice reveals significant expansion of divergent non-TIR NBS-LRR genes. Mol Gen Genomics. 2004 May 1;271(4):402–15.
Yuan Q, Ouyang S, Liu J, Suh B, Cheung F, Sultana R, et al. The TIGR rice genome annotation resource: annotating the rice genome and creating resources for plant biologists. Nucleic Acids Res. 2003 Jan 1;31(1):229–33.
Yuan Q, Ouyang S, Wang A, Zhu W, Maiti R, Lin H, et al. The institute for genomic research Osa1 Rice genome annotation database. Plant Physiol. 2005 May 1;138(1):18–26.
Ouyang S, Zhu W, Hamilton J, Lin H, Campbell M, Childs K, et al. The TIGR Rice genome annotation resource: improvements and new features. Nucleic Acids Res. 2007 Jan 3;35(Database):D883–7.
McCouch SR, CGSNL (Committee on Gene Symbolization N and L Rice Genetics Cooperative). Gene Nomenclature System for Rice. Rice. 2008 Sep 1;1(1):72–84.
Ressayre A, Glémin S, Montalent P, Serre-Giardi L, Dillmann C, Joets J. Introns structure patterns of variation in nucleotide composition in Arabidopsis thaliana and Rice protein-coding genes. Genome Biol Evol. 2015 Oct 1;7(10):2913–28.
Ohyanagi H. The Rice annotation project database (RAP-DB): hub for Oryza sativa ssp. japonica genome information. Nucleic Acids Res. 2006 Jan 1;34(90001):D741–4.
Wang D, Xia Y, Li X, Hou L, Yu J. The rice genome knowledgebase (RGKbase): an annotation database for rice comparative genomics and evolutionary biology. Nucleic Acids Res. 2013 Jan 1;41(D1):D1199–205.
Sakai H, Lee SS, Tanaka T, Numa H, Kim J, Kawahara Y, et al. Rice annotation project database (RAP-DB): an integrative and interactive database for Rice genomics. Plant Cell Physiol. 2013 Feb 1;54(2):e6.
Atambayeva SA, Khailenko VA, Ivashchenko AT. Intron and exon length variation in Arabidopsis, rice, nematode, and human. Mol Biol. 2008 May 25;42(2):312.
CrossRef CAS Google Scholar
Wang J, Wan X, Crossa J, Crouch J, Weng J, Zhai H, et al. QTL mapping of grain length in rice (Oryza sativa L.) using chromosome segment substitution lines. Genet Res. 2006 Oct;88(2):93–104.
Wang B-B, Brendel V. Genomewide comparative analysis of alternative splicing in plants. PNAS. 2006 May 2;103(18):7175–80.
Thibaud-Nissen F, Ouyang S, Buell CR. Identification and characterization of pseudogenes in the rice gene complement. BMC Genomics. 2009 Jul 16;10(1):317.
Bartel DP. MicroRNAs: genomics, biogenesis, mechanism, and function. Cell. 2004 Jan 23;116(2):281–97.
International Rice Genome Sequencing Project. The map-based sequence of the rice genome. Nature. 2005 Aug;436(7052):793–800.
Kurata N, Yamazaki Y. Oryzabase. An integrated biological and genome information database for Rice. Plant Physiol. 2006 Jan 1;140(1):12–7.
Zhang J, Li C, Wu C, Xiong L, Chen G, Zhang Q, et al. RMD: a rice mutant database for functional analysis of the rice genome. Nucleic Acids Res. 2006 Jan 1;34(suppl_1):D745–8.
Copetti D, Zhang J, El Baidouri M, Gao D, Wang J, Barghini E, et al. RiTE database: a resource database for genus-wide rice genomics and evolutionary biology. BMC Genomics. 2015 Jul 22;16(1):538.
Sato Y, Takehisa H, Kamatsuki K, Minami H, Namiki N, Ikawa H, et al. RiceXPro version 3.0: expanding the informatics resource for rice transcriptome. Nucleic Acids Res. 2013 Jan 1;41(D1):D1206–13.
Sato Y, Namiki N, Takehisa H, Kamatsuki K, Minami H, Ikawa H, et al. RiceFREND: a platform for retrieving coexpressed gene networks in rice. Nucleic Acids Res. 2013 Jan 1;41(D1):D1214–21.
Wang D, Xia Y, Li X, Hou L, Yu J. The rice genome knowledgebase (RGKbase): an annotation database for rice comparative genomics and evolutionary biology. Nucleic Acids Res. 2012 Nov 27;41(D1):D1199–205.
Download references
Conflict of Interest
Author information, authors and affiliations.
Molecular Biology and Biotechnology Laboratory, Department of Botany, Sikkim University, Gangtok, Sikkim, India
Puja Sashankar & N. Sathyanarayana
Center for Functional Genomics and Bioinformatics, University of Trans-Disciplinary Health Sciences and Technology (TDU), Bengaluru, Karnataka, India
Santhosh N Hegde
You can also search for this author in PubMed Google Scholar
Corresponding author
Correspondence to N. Sathyanarayana .
Editor information
Editors and affiliations.
Crop Improvement Division, ICAR-National Rice Research Institute, Cuttack, Odisha, India
Dr. Manoj Kumar Gupta
Dr. Lambodar Behera
Rights and permissions
Reprints and Permissions
Copyright information
© 2021 The Author(s), under exclusive licence to Springer Nature Singapore Pte Ltd.
About this chapter
Cite this chapter.
Sashankar, P., Hegde, S.N., Sathyanarayana, N. (2021). Gene Identification and Structure Annotation. In: Gupta, M.K., Behera, L. (eds) Bioinformatics in Rice Research. Springer, Singapore. https://doi.org/10.1007/978-981-16-3993-7_8
Download citation
DOI : https://doi.org/10.1007/978-981-16-3993-7_8
Published : 25 September 2021
Publisher Name : Springer, Singapore
Print ISBN : 978-981-16-3992-0
Online ISBN : 978-981-16-3993-7
eBook Packages : Biomedical and Life Sciences Biomedical and Life Sciences (R0)
Share this chapter
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
- Find a journal
- Publish with us
Annotating Genes, Genomes, and Variants
Martin Morgan February 4, 2014
Gene annotation
Data packages.
Organism-level ('org') packages contain mappings between a central identifier (e.g., Entrez gene ids) and other identifiers (e.g. GenBank or Uniprot accession number, RefSeq id, etc.). The name of an org package is always of the form org.<Sp>.<id>.db (e.g. org.Sc.sgd.db ) where <Sp> is a 2-letter abbreviation of the organism (e.g. Sc for Saccharomyces cerevisiae ) and <id> is an abbreviation (in lower-case) describing the type of central identifier (e.g. sgd for gene identifiers assigned by the Saccharomyces Genome Database, or eg for Entrez gene ids). The “How to use the '.db' annotation packages” vignette in the [AnnotationDbi][] package (org packages are only one type of “.db” annotation packages) is a key reference. The '.db' and most other Bioconductor annotation packages are updated every 6 months.
Annotation packages usually contain an object named after the package itself. These objects are collectively called AnnotationDb objects, with more specific classes named OrgDb , ChipDb or TranscriptDb objects. Methods that can be applied to these objects include cols() , keys() , keytypes() and select() . Common operations for retrieving annotations are summarized in the table.
Exercise : This exercise illustrates basic use of the select' interface to annotation packages.
- What is the name of the org package for Homo sapiens ? Load it. Display the OrgDb object for the org.Hs.eg.db package. Use the columns() method to discover which sorts of annotations can be extracted from it.
- Use the keys() method to extract ENSEMBL identifiers and then pass those keys in to the select() method in such a way that you extract the SYMBOL (gene symbol) and GENENAME information for each. Use the following ENSEMBL ids.
Solution The OrgDb object is named org.Hs.eg.db .
Internet resources
A short summary of select Bioconductor packages enabling web-based queries is in following Table.
Using biomaRt
The biomaRt package offers access to the online biomart resource. this consists of several data base resources, referred to as 'marts'. Each mart allows access to multiple data sets; the biomaRt package provides methods for mart and data set discovery, and a standard method getBM() to retrieve data.
- Load the biomaRt package and list the available marts. Choose the ensembl mart and list the datasets for that mart. Set up a mart to use the ensembl mart and the hsapiens gene ensembl dataset.
- A biomaRt dataset can be accessed via getBM() . In addition to the mart to be accessed, this function takes filters and attributes as arguments. Use filterOptions() and listAttributes() to discover values for these arguments. Call getBM() using filters and attributes of your choosing.
As an optional exercise, annotate the genes that are differentially expressed in the DESeq2 laboratory, e.g., find the \texttt{GENENAME} associated with the five most differentially expressed genes. Do these make biological sense? Can you merge() the annotation results with the top table' results to provide a statistically and biologically informative summary?
Genome annotation
There are a diversity of packages and classes available for representing large genomes. Several include:
- 'TxDb.*' For transcript and other genome / coordinate annotation.
- BSgenome For whole-genome representation. See available.packages() for pre-packaged genomes, and the vignette 'How to forge a BSgenome data package' in the
- Homo.sapiens For integrating 'TxDb ' and 'org. ' packages.
- 'SNPlocs.*' For model organism SNP locations derived from dbSNP.
- FaFile() ( Rsamtools ) for accessing indexed FASTA files.
- 'SIFT.*', 'PolyPhen', 'ensemblVEP' Variant effect scores.
Transcript annotation packages
Genome-centric packages are very useful for annotations involving genomic coordinates. It is straight-forward, for instance, to discover the coordinates of coding sequences in regions of interest, and from these retrieve corresponding DNA or protein coding sequences. Other examples of the types of operations that are easy to perform with genome-centric annotations include defining regions of interest for counting aligned reads in RNA-seq experiments and retrieving DNA sequences underlying regions of interest in ChIP-seq analysis, e.g., for motif characterization.
This exercise uses annotation resources to go from a gene symbol 'BRCA1' through to the genomic coordinates of each transcript associated with the gene, and finally to the DNA sequences of the transcripts.
- Use the org.Hs.eg.db package to map from the gene symbol 'BRCA1' to its Entrez identifier. Do this using the select command.
- Use the TxDb.Hsapiens.UCSC.hg19.knownGene package to retrieve the transcript names ( TXNAME ) corresponding to the BRCA1 Entrez identifier. (The 'org*' packages are based on information from NCBI, where Entrez identifiers are labeled ENTREZID; the 'TxDb*' package we are using is from UCSC, where Entrez identifiers are labeled GENEID).
Use the cdsBy() function to retrieve the genomic coordinates of all coding sequences grouped by transcript, and select the transcripts corresponding to the identifiers we're interested in. The coding sequences are returned as an GRangesList , where each element of the list is a GRanges object representing the exons in the coding sequence. As a sanity check, ensure that the sum of the widths of the exons in each coding sequence is evenly divisible by 3 (the R 'modulus' operator %% returns the remainder of the division of one number by another, and might be helpful in this case).
Visualize the transcripts in genomic coordinates using the [Gviz][] package to construct an AnnotationTrack , and plotting it using plotTracks() .
Use the Bsgenome.Hsapiens.UCSC.hg19 package and extractTranscriptSeqs() function to extract the DNA sequence of each transcript.
Retrieve the Entrez identifier corresponding to the BRCA1 gene symbol
Map from Entrez gene identifier to transcript name
Retrieve all coding sequences grouped by transcript, and select those matching the transcript ids of interest, verifying that each coding sequence width is a multiple of 3
Visualize the BRCA1 transcripts using [Gviz] )
Extract the coding sequences of each transcript
Intron coordinates can be identified by first calculating the range of the genome (from the start of the first exon to the end of the last exon) covered by each transcript, and then taking the (algebraic) set difference between this and the genomic coordinates covered by each exon
Retrieve the intronic sequences with getSeq() (these are not assembled, the way that extractTranscriptSeqs() assembles exon sequences into mature transcripts); note that introns start and end with the appropriate acceptor and donor site sequences.
rtracklayer
The rtracklayer package allows us to query the UCSC genome browser, as well as providing import() and export() functions for common annotation file formats like GFF, GTF, and BED.
Here we use rtracklayer to retrieve estrogen receptor binding sites identified across cell lines in the ENCODE project. We focus on binding sites in the vicinity of a particularly interesting region of interest.
- Define our region of interest by creating a GRanges instance with appropriate genomic coordinates. Our region corresponds to 10Mb up- and down-stream of a particular gene.
- Create a session for the UCSC genome browser
- Query the UCSC genome browser for ENCODE estrogen receptor ERalpha\(_a\) transcription marks; identifying the appropriate track, table, and transcription factor requires biological knowledge and detective work.
- Visualize the location of the binding sites and their scores; annotate the mid-point of the region of interest.
Define the region of interest
Create a session
Query the UCSC for a particular track, table, and transcription factor, in our region of interest
Visualize the result
Follow the Variants work flow.
AnnotationHub
- Coordinate mappping using chain files and liftOver()
- grasp2db package (under development)
Data base access (advanced)
- Bioconductor annotation packages are based on sqlite databases.
- dplyr provides convenient access via src_sqlite , tbl , filter , arrange , summarize , etc.
Exercise: use AnnotationDbi::dbfile() on a TxDb object gene table to filter just those genes on chr3. Pipe this select to makeGRangesFromDataFrame to get a GRanges instance.
Web resources (advanced)
'Easy' to write web-based services, e.g., to query ENSEMBL REST API
- Is the service alive?
- Symbol -> ENSG identifier mapping
- Range-based query
- Available taxa

An official website of the United States government
The .gov means it's official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you're on a federal government site.
The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.
- Publications
- Account settings
- Eukaryotic Annotation Home
- Annotation Process
- NCBI Handbook Chapter
- Software Release Notes
- All Annotated Genomes
- Recently Annotated Genomes
- Annotation Runs In Progress
- Annotations Per Year Graphs
- Annotation Policy
- Request Annotation
The NCBI Eukaryotic Genome Annotation Pipeline
The NCBI Eukaryotic Genome Annotation Pipeline provides content for various NCBI resources including Nucleotide , Protein , BLAST , Gene and the Genome Data Viewer genome browser.
This page provides an overview of the annotation process. Please refer to the Eukaryotic Genome Annotation chapter of the NCBI Handbook for algorithmic details.
The pipeline uses a modular framework for the execution of all annotation tasks from the fetching of raw and curated data from public repositories (sequence and Assembly databases) to the alignment of sequences and the prediction of genes, to the submission of the accessioned annotation products to public databases. Core components of the pipeline are alignment programs ( Splign and ProSplign ) and an HMM-based gene prediction program ( Gnomon ) developed at NCBI.
Important features of the pipeline include:
- flexibility and speed
- higher weight given to curated evidence than non-curated evidence
- utilization of RNA-Seq for gene prediction
- production of models that compensate for assembly issues
- tracking of gene loci from one annotation to the next
- ability to co-annotate multiple assemblies for the same organism
The products of an annotation run (chromosome, scaffolds and model transcripts and proteins) are labeled with an Annotation Name. There are two formats for the Annotation Name, which is used throughout NCBI as a way to uniquely identify annotation products originating from the same annotation run.
- the combination of the organism name and Annotation Release number (e.g. NCBI Pongo abelii Annotation Release 103)
- the combination of the RefSeq assembly accession and the year and month in which the annotation was started (e.g. NCBI GCF_016801865.2-RS_2022_12)
Source of genome assemblies
Transcript alignments, transcriptomics long read alignments, rna-seq read alignments, protein alignments, model prediction, curated refseq genomic sequence alignments, choosing the best models for a gene, protein naming and determination of locus type, gene ontology, assignment of geneids, annotation of small rnas, annotation of transcription start sites (tss), special considerations, annotation of multiple assemblies, re-annotation, annotation quality, annotation products, data availability.
Please see The Eukaryotic Genome Annotation chapter in the NCBI Handbook for more details about the algorithms.
The figure below provides an overview of the annotation process. The genomic sequences are masked (grey) and transcripts (blue), proteins (green) and RNA-Seq reads and, if available in SRA, long reads transcriptomes and Cap Analysis Gene Expression (CAGE) data (orange) are aligned to the genome. If available for the organism being annotated, curated RefSeq genomic sequences are also aligned (pink). Gene model prediction based on transcript and protein alignments is then performed (brown). The best models are selected among the RefSeq and the predicted models, named and accessioned (purple). Finally, the annotation products are formatted and deployed to public resources (yellow).

The RefSeq assemblies that are annotated by NCBI are copies of the genome assemblies that are public in INSDC ( DDBJ , ENA and GenBank ). Unplaced scaffolds with length below 1000 bases may not be included in the RefSeq copy of the assembly if the INSDC assembly contains more than 300,000 unplaced scaffolds and more than 25,000 of them are below 1000 bases. Both RefSeq and GenBank assemblies are further described in the Assembly resource.
Masking is done using RepeatMasker or WindowMasker . Human and mouse are masked with RepeatMasker using their respective Dfam libraries, while genomes from other species are masked with WindowMasker .
The set of transcripts selected for alignment to the genome varies by species, and may include transcripts from other organisms. This set generally includes:
- Known RefSeq transcripts: Coding and non-coding RefSeq transcripts with NM_ or NR_ prefixes, respectively, are generated by NCBI staff based on automatic processes, manual curation, or data from collaborating groups (see more details here )
- GenBank transcripts from the taxonomically relevant GenBank divisions, and the Third-Party Annotation ( TPA ), High-throughput cDNA (HTC) and Transcriptome Shotgun Assembly ( TSA ) divisions
- ESTs from dbEST
Sequences highly likely to be mitochondrial or to have cloning vector or IS element contamination, and sequences identified as low quality by RefSeq curation staff are screened out.
RefSeq transcripts and non-RefSeq transcripts that pass the contamination screen are aligned locally to the genome using BLAST to identify the location(s) at which transcripts align. Global re-alignment at these locations is performed with Splign to refine the identification of splice sites. Alignments are then ranked and filtered based on customizable criteria (such as coverage, identity, rank). Typically, only the best-placed (rank 1) alignment for a given query is selected for use in the downstream steps.
Transcriptomics reads from SRA generated using long read sequencing technologies such as PacBio or Oxford Nanopore are aligned to the genome using Minimap2 . Each transcript's best-placed (rank 1) alignment is selected for use in the downstream steps, if above 85% identity.
RNA-Seq reads for the species or closely related species are aligned to the genome. When a very large number of samples and reads (multiple billions) are available in SRA , projects with samples spanning the widest range of tissues and developmental stages are chosen over others, with a preference for untreated or non-diseased samples. RNA-Seq reads are aligned to the genome with STAR . To address the short length, redundancy and abundance of the reads, alignments with the same splice structure and the same or similar start and end points are collapsed into a single representative alignment. Information is recorded about the samples and number of reads represented by each alignment, so the level of support can be used to filter alignments and evaluate gene predictions. Alignments representing very rare introns likely to be background noise are filtered out.
For each SRA run aligned to the genome, RNA-seq read coverage graphs in UCSC BigWig format are generated and made available for download on the FTP site (see link below). The number of reads mapped to annotated genes is also counted using Subread featureCounts software, and the gene expression counts files are made available for download on the FTP site. Additionally, a file containing information about all of the SRA runs used is provided.
The set of proteins selected for alignment to the genome varies by species, and may include proteins from other organisms. This set generally includes:
- Known RefSeq proteins
- GenBank proteins derived from cDNAs from the taxonomically relevant GenBank divisions
Highly repetitive sequences are removed from the set. Proteins are aligned locally to the genome with BLAST and re-aligned globally using ProSplign . Alignments are then ranked and filtered based on customizable criteria.
Protein, transcript, transcriptomics and RNA-Seq read alignments are passed to Gnomon for gene prediction. Gnomon first chains together non-conflicting alignments into putative models. In a second step, Gnomon extends predictions missing a start or a stop codon or internal exon(s) using an HMM-based algorithm. Gnomon additionally creates pure ab initio predictions where open reading frames of sufficient length but with no supporting alignment are detected.
This first set of predictions is further refined by alignment against a subset of the nr (non-redundant) database of protein sequences. The additional alignments are added to the initial alignments, and the chaining and ab initio extension steps are repeated. The results constitute the set of Gnomon predictions.
Gnomon predictions may include deletions or insertions of Ns with respect to the genomic sequence. These differentes are introduced to compensate for frameshifts or stop codons in the literal translation of the genome, when the aligning proteins provides evidence of an intact ORF.
For some organisms, a set of genomic sequences is curated ( RefSeq accessions with NG_ prefixes). These sequences represent either non-transcribed pseudogenes, a manually annotated gene cluster that is difficult to annotate via automated methods, and human RefSeqGene records. They are aligned to the genome, and their best placement is identified.
The final set of annotated features comprises, in order of preference, pre-existing RefSeq sequences and a subset of well-supported Gnomon -predicted models. It is built by evaluating together at each locus the known RefSeq transcripts, the features projected from curated RefSeq genomic alignments and the models predicted by Gnomon .
1. Models based on known and curated RefSeq
RefSeq transcripts are given precedence over overlapping Gnomon models with the same splice pattern. Alignments of known same-species RefSeq transcripts or curated genomic sequences are used directly to annotate the gene, RNA and CDS features on the genome. Since the RefSeq sequence may not align perfectly or completely to the genomic sequence, a consequence of this rule is that the annotated product may differ from the conceptual translation of the genome. Differences between the RefSeq transcripts and the genome are provided in a note on the RefSeq genomic record (scaffold or chromosome).
2. Models based on Gnomon predictions
Gnomon predictions are included in the final set of annotations if they do not share all splice sites with a RefSeq transcript and if they meet certain quality thresholds including:
- Only fully- or partially-supported Gnomon predictions, or pure ab initio Gnomon predictions with high coverage hits to UniProtKB/SwissProt proteins are selected
- When multiple fully-supported transcript variants are predicted for a gene, only the Gnomon predictions supported in their entirety by a single long alignment (e.g. a full-length mRNA) or by RNA-Seq reads from a single BioSample are selected
- Poorly-supported Gnomon predictions conflicting with better-supported models annotated on the opposite strand are excluded from the final set of models
- Gnomon predictions with high homology to transposable or retro-transposable elements are excluded from the final set of models
3. Integrating RefSeq and Gnomon annotations
As a result of the model selection process, a gene may be represented by multiple splice variants, with some of them known RefSeq and others model RefSeq (originating from Gnomon predictions).
Gnomon predictions selected for the final annotation set are assigned model RefSeq accessions with XM_ or XR_ prefixes for transcripts and XP_ prefixes for proteins to distinguish them from known RefSeq with NM_/NR_ and NP_ prefixes. Model RefSeq can be searched in Entrez with the query “srcdb_refseq_model[properties]” while known RefSeq sequences can be obtained with the query “srcdb_refseq_known[properties]”.
- Genes represented by known or curated RefSeq sequences inherit the Gene symbol, name and locus type (e.g. coding, pseudogene...) of the RefSeq sequence.
- Genes represented by predicted models are named based on homology to SwissProt proteins.
- Most Gnomon models with insertions, deletions or frameshifts are labeled pseudogenes.
- Gnomon models with insertions or deletions relative to the genome may be considered coding if they have a strong unique hit to the SwissProt database or appear to be orthologs of known protein-coding genes. Titles for these models are prefixed with “PREDICTED: LOW QUALITY PROTEIN” to indicate that these models and the underlying assembly sequences may content defects.
- Gnomon models that appear to be single-exon retrocopies of protein-coding genes may be annotated as pseudogenes.
- When multiple assemblies are annotated , a partial or imperfect model may be called coding because a complete model exists at the corresponding locus on one of the other annotated assemblies.
Gene Ontology (GO) terms for all annotated proteins were computed using InterProScan , a tool that identifies protein domains and families. The GO terms were then collated by gene, and the resulting GO annotations are made available for download from the FTP site (see link below) in the GAF (GO Annotation File) format .
Genes in the final set of models are assigned GeneIDs in NCBI's Gene database.
- A gene represented by a known RefSeq transcript will receive the GeneID of the RefSeq transcript.
- All alternative splice forms of a gene get the same GeneID.
- As much as possible, GeneIDs are carried forward from one annotation run to the next, using the mapping of the new assembly to the previous one if the assembly was updated.
- Gene features mapped to equivalent locations of co-annotated assemblies are assigned the same GeneIDs.
- miRNAs are imported from miRBase , accessioned with NR_ prefixes and placed using Splign .
- tRNAs are predicted with tRNAscan-SE .
- Starting with software version 8.0, rRNAs, snoRNAs and snRNAs are annotated by searching eukaryotic RFAM HMMs against the genome with Infernal's cmsearch .
Starting with software release 9.0, Cap Analysis Gene Expression (CAGE) data that is available in SRA for the species are aligned to the genome with Splign and used for annotating transcription start sites.
When multiple assemblies of good quality are available for a given organism, annotation of all is done in coordination. To ensure that matching regions across assemblies are annotated the same way, assemblies are aligned to each other before the annotation.
- Assembly-assembly alignment results are used to rank the transcript and the curated genomic alignments: for a given query sequence, alignments to corresponding regions of two assemblies receive the same rank.
- Corresponding loci of multiple assemblies are assigned the same GeneID and locus type.
Organisms are periodically re-annotated when new evidence is available (e.g. RNA-Seq) or when a new assembly is released. Special attention is given to tracking of models and genes from one release of the annotation to the next. Previous and current models annotated at overlapping genomic locations are identified and the locus type and GeneID of the previous models are taken into consideration when assigning GeneIDs to the new models. If the assembly was updated between the two rounds of annotation, the assemblies are aligned to each other and the alignments used to match previous and current models in mapped regions.
The quality of the annotation is assessed prior to publishing, based on the intrinsic characteristics of the annotated models and on the expectations for the species. Indicators of a low quality annotation may disqualify a genome from being included in RefSeq. These indicators are: high count of coding genes that lack near-full coverage by alignments of experimental evidence, high count of partial coding genes (lacking a start or stop codon, or internal exons), high count of low-quality genes with suspected frameshifts or premature stop codons, low BUSCO completeness score (see below), and, for vertebrates, low count of genes with orthologs to a reference species.
BUSCO run in "protein" mode provides an estimate of the completeness of the gene set. The BUSCO models (single-copy marker genes) for the most fitting lineage based on NCBI Taxonomy are searched against the longest protein for each annotated coding gene. Results are reported in BUSCO notation (C:complete [S:single-copy, D:duplicated], F:fragmented, M:missing, n:number of genes used).
- The scaffolds and chromosomes of the assembled genomes, with the annotation products as features.
- The individual products (transcripts and proteins)
- Sequence records for predicted models, scaffolds and chromosomes contain the Annotation Name, which uniquely identifies the annotation. Examples:
The sequence records for scaffolds, chromosomes and predicted transcripts and proteins for NCBI Pongo abelii Annotation Release 103 contain the following comment:
##Genome-Annotation-Data-START## Annotation Provider :: NCBI Annotation Status :: Full annotation Annotation Name :: Pongo abelii Annotation Release 103 Annotation Version :: 103 Annotation Pipeline :: NCBI eukaryotic genome annotation pipeline Annotation Software Version :: 8.0 Annotation Method :: Best-placed RefSeq; Gnomon Features Annotated :: Gene; mRNA; CDS; ncRNA ##Genome-Annotation-Data-END##
The sequence records for scaffolds, chromosomes and predicted transcripts and proteins for NCBI GCF_016801865.2-RS_2022_12 contain the following comment:
##Genome-Annotation-Data-START## Annotation Provider :: NCBI RefSeq Annotation Status :: Full annotation Annotation Name :: GCF_016801865.2-RS_2022_12 Annotation Pipeline :: NCBI eukaryotic genome annotation pipeline Annotation Software Version :: 10.1 Annotation Method :: Gnomon; cmsearch; tRNAscan-SE Features Annotated :: Gene; mRNA; CDS; ncRNA ##Genome-Annotation-Data-END##
The data produced by the annotation pipeline is available in various resources:
- Genome Data Viewer
- BUSCO : Manni M et al. Molecular biology and evolution 2021, 38 (10):4647-4654
- InterProScan : Jones P et al. Bioinformatics 2014. 30 (9):1236-1240
- Minimap2 : Li H. Bioinformatics 2018 34 (18):3094-3100
- miRBase : Griffiths-Jones S. Nucleic Acids Research 2004, 32 (Database Issue):D109-11
- RefSeq : Pruitt KD et al. Nucleic Acids Research 2014, 42 (Database issue):D756-63
- RepeatMasker : Smit AFA, Hubley R, Green P. RepeatMasker Open-3.0. 1996–2004. http://www.repeatmasker.org
- Rfam : Nawrocki, EP et al. Nucleic Acids Research 2015, 43 (Database issue):D130-7
- Splign : Kapustin Y, Souvorov A, Tatusova T, Lipman D. Biology Direct 2008, 3 :20
- STAR : Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR. Bioinformatics 2013, 29 (1): 15–21
- Subread featureCounts : Liao, Y, Smyth GK, Shi, W. Bioinformatics 2014, 30 (7):923-930
- tRNAscan-SE : Lowe, TM and Eddy, SR. Nucleic Acids Research 1997, 25 : 955-964
- WindowMasker : Morgulis A, Gertz EM, Schäffer AA, Agarwala R. Bioinformatics 2006 2 :134-41
Connect with NLM
National Library of Medicine 8600 Rockville Pike Bethesda, MD 20894
Web Policies FOIA HHS Vulnerability Disclosure
Help Accessibility Careers
Last updated: 2023-11-14T21:17:35Z
- Bioinformatics
- Agriculture
- Gene Therapy
- Medical Devices
Select Page
What is Gene Annotation in Bioinformatics?
Posted by Biolyse | Nov 3, 2018 | Bioinformatics | 0 |
Over the years scientist and researchers have made tremendous efforts through various inventions and innovation to make life better. Bioinformatics as an interdisciplinary approach has created numerous opportunities in scientific advancements and promoted efforts towards the realization of better living. A considerable milestone development in bioinformatics goes down to the necessary level of life: genes. Previously identification and ability to distinguishing genes were limited hindering scientific manipulations and diagnostic procedures. With a clear understanding of the gene sequencing process, we can surely achieve massive success in the management of various conditions and generally maintaining a healthy generation. Gene annotation has made this to be in reach.
What is gene annotation?
In molecular biology, genomes make the basic genetic material and typically consist of DNA. Whereby, genome include the genes (coding) and the non-coding regions, of interest to us, are the coding regions as they actively influence basic life processes. The genes contain useful biological information that is required in building up and maintaining an organism. Gene annotation can be defined merely as the process of making nucleotide sequence meaningful. However, it’s a much complex process encompassing several procedures and a broad range of activities.
Gene annotation involves the process of taking the raw DNA sequence produced by the genome-sequencing projects and adding layers of analysis and interpretation necessary to extracting biologically significant information and placing such derived details into context. Through the aid of bioinformatics, there exists software to perform such complex procedures. The first gene annotation software system was developed in1995 at The Institute for Genomic Research, and this was used to sequence and analyze the genes of the bacterium Haemophilus influenza.
As a process of identification of gene location and coding regions, gene annotation helps us have an insight of what these genes do in the body by establishing structural aspects and relating them to functions of different proteins. Currently, the process is automated, and the National Center for Biomedical Ontology have a database for records and to enable comparison.
Learn More: How to Learn Bioinformatics Why is Bioinformatics important in Genetic Research? How to Get Into Bioinformatics
How is gene annotation performed?
Gene annotation can either be manual or electronic with the aid of tools developed by an amalgamation of organizations. The downsides of the manual technique are that it is time-consuming and the turn-over rate is much low. However, it remains useful for predictive purposes thus serves a complementary function. There exist three main steps in the process of gene annotation:
Identification of the non-coding regions of the genome (exons). This is vital to limit the range of analysis and only focus on the essential components as it is needless doing the tedious work on portions that give no or little biological information.
Gene prediction; these give an overview of the amino acid components of the genes and the role of such elements. Also referred to as gene finding, this process identifies regions of genomic DNA that encode genes. Empirical methods or Ab Initio methods can do it.
Establishing a connection and a correlation between the identified elements and the biological information at hand. Linking of biological functions and data is possible this way.
Homology-based tools for example Blast has hugely simplified the process of gene annotation, and this can now be done without much hassle as witnessed in manual methods that require human expertise.
Modalities of gene annotation
Genomics is a broad study and can be subdivided as structural genomics, functional genomics, and comparative genomics to leverage the understanding of this crucial topic. Similarly, gene annotation exists as a double-phased entity comprising of structural gene annotation and functional gene annotation.
Structural annotation
The initial process in gene annotation and involve identification by physical appearance, chemical composition, molecular weight variations, and general morphology. Such differences as coding regions, gene structures, ORFs and their locations , as well as regulatory motifs, are crucial information that is derived from this procedure and influence the process of gene identification as well as distinction. The accuracy of this process can be evaluated based on two parameters; specificity and accuracy. Where sensitivity is the percentage of right signals predicted among all possible correct strengths while specificity refers to the proportion of right signal among all that are forecasted.
Functional annotation
The process of relating crucial biological functions to the genetic elements as depicted in the structural annotation step. Biochemical functions, physiological functions, involved regulations and interactions atop expressions are some of the critical roles that are often considered in DNA annotation.
The above steps can involve biological experiments as well as in silico analysis mimicking the internal conditions. A new method seeking to improve genomics annotation- Proteogenomics is currently in use, and it utilizes information from expressed proteins, such information is obtained from mass spectrometry.
Essential components
Gene annotation is a purposeful process, and some of the vital information that we seek to extract from this process include; CDs, mRNA, Pseudogenes, promoter and poly-A signals, mcRNA among others. Such elements are minute and identification may be hectic. Scientists have developed software and tools to aid the process and notable tools frequently used are; ORF detectors, promoter detectors and start/stop codon identifiers. Automation of this process has created enhanced accuracy, and now there exist large discrepancies between with the manually conducted procedures as gene sequencing is a dynamic topic.
After a successful gene annotation process, it is expected that the obtained information should be published, stored in the database and shared for research purposes.
Gene annotation is a new and exceedingly promising idea, much remains unfolded, and there is a lot of potentially beneficial areas that remains to be explored. Fortunately, many groups have invested in gene annotation, and new developments arise daily. Some of the ongoing projects on gene annotation include; Ensembl, GENCODE and GeneRIF among others. It is important to appreciate that modern literature gets published daily concerning this topic and it is prudent to keep updated.
DNA annotation reveals much of the information contained in the genomes therefore complete gene annotation is descriptive of organisms being and thus remains a milestone invention.
About The Author
Related posts, how to get into bioinformatics.
November 3, 2018
How To Learn Bioinformatics
September 27, 2018
Why is Bioinformatics important in Genetic Research?
Recent posts.
- Boosting Revenue for Managed Service Provider Companies with Small Business Financing Loans
- Business Loans for Healthcare Business Owners
- 5 Tips for IT Companies on Getting a Loan
- ICC Property Management Is An Industry Leader In Cleanliness And Maintenance
- Biotechnology Use for back pain
- Biotechnology
- Biotechnology in Agriculture
- Biotechnology in Medicine
- Paclitaxel Manufacturer
Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.
- View all journals
- My Account Login
- Explore content
- About the journal
- Publish with us
- Sign up for alerts
- Open access
- Published: 09 November 2023
Transcriptional and epigenetic regulators of human CD8 + T cell function identified through orthogonal CRISPR screens
- Sean R. McCutcheon 1 , 2 ,
- Adam M. Swartz 3 ,
- Michael C. Brown 4 ,
- Alejandro Barrera ORCID: orcid.org/0000-0001-9244-9822 2 , 5 ,
- Christian McRoberts Amador 2 , 6 ,
- Keith Siklenka 2 , 5 ,
- Lucas Humayun ORCID: orcid.org/0000-0003-4624-2550 1 ,
- Maria A. ter Weele 1 , 2 ,
- James M. Isaacs 7 ,
- Timothy E. Reddy ORCID: orcid.org/0000-0002-7629-061X 1 , 2 , 5 ,
- Andrew S. Allen 2 , 5 ,
- Smita K. Nair ORCID: orcid.org/0000-0001-7019-1912 3 , 7 , 8 ,
- Scott J. Antonia 7 &
- Charles A. Gersbach ORCID: orcid.org/0000-0003-1478-4013 1 , 2 , 3
Nature Genetics ( 2023 ) Cite this article
10k Accesses
114 Altmetric
Metrics details
- Functional genomics
- Immunotherapy
Clinical response to adoptive T cell therapies is associated with the transcriptional and epigenetic state of the cell product. Thus, discovery of regulators of T cell gene networks and their corresponding phenotypes has potential to improve T cell therapies. Here we developed pooled, epigenetic CRISPR screening approaches to systematically profile the effects of activating or repressing 120 transcriptional and epigenetic regulators on human CD8 + T cell state. We found that BATF3 overexpression promoted specific features of memory T cells and attenuated gene programs associated with cytotoxicity, regulatory T cell function, and exhaustion. Upon chronic antigen stimulation, BATF3 overexpression countered phenotypic and epigenetic signatures of T cell exhaustion. Moreover, BATF3 enhanced the potency of CAR T cells in both in vitro and in vivo tumor models and programmed a transcriptional profile that correlates with positive clinical response to adoptive T cell therapy. Finally, we performed CRISPR knockout screens that defined cofactors and downstream mediators of the BATF3 gene network.
Adoptive T cell therapy (ACT) holds tremendous potential for cancer treatment by redirecting T cells to cancer cells via expression of engineered receptors that recognize and bind to tumor-associated antigens. The potency and duration of T cell response are associated with defined T cell subsets, and cell products enriched in stem or memory T cells provide superior tumor control in animal models and in the clinic 1 , 2 , 3 , 4 , 5 . Consequently, precise regulation or programming of T cell state is a promising approach to improve the therapeutic potential of ACT.
T cell state and function are largely regulated by specific transcription factors (TFs) and epigenetic modifiers that process intrinsic and extrinsic signals into complex and exquisitely tuned gene expression programs. For example, TOX 6 , 7 , 8 , 9 , 10 and NFAT 11 program CD8 + T cell exhaustion in the context of chronic antigen exposure. Conversely, T cell function can be enhanced by rewiring transcriptional networks through either enforced expression or genetic deletion of specific TFs and epigenetic modifiers. Ectopic overexpression (OE) of specific TFs such as c-JUN 12 , BATF 13 and RUNX3 (ref. 14 ) or genetic deletion of NR4A 15 , FLI1 (ref. 16 ), members of the BAF chromatin remodeling complex 17 , 18 , and regulators of DNA methylation 19 , 20 can alter T cell state and improve T cell function through diverse mechanisms.
Large-scale CRISPR knockout (CRISPRko) 21 , 22 , 23 and open reading frame (ORF) OE 24 screens have further accelerated gene discovery. Compared to these screening modes, it has been more challenging to conduct gene activation or repression screens via epigenome editing in primary human T cells 25 . One study optimized lentiviral production to overcome limitations of delivering large CRISPR-based epigenome editors and then conducted proof-of-concept gene silencing or activation screens to define regulators of cytokine production 25 . However, there remains an expansive opportunity to discover modulators of other T cell states, as well as combinatorial perturbations to dissect gene interactions that control human T cell phenotypes.
In this Article, we developed an approach for CRISPR interference (CRISPRi) or CRISPR activation (CRISPRa) screens in primary human T cells and applied it to systematically profile the effects of 120 genes on human CD8 + T cell state. These screens and subsequent characterization revealed that overexpressing BATF3 supports specific features of memory T cells, counters T cell exhaustion and improves tumor control. We conducted pooled CRISPRko screens of all human transcription factor genes (TFome) with or without BATF3 OE to define cofactors and downstream targets of BATF3. More generally, we developed orthogonal CRISPR-based screening approaches to systematically discover regulators of gene networks and complex T cell phenotypes, which should accelerate efforts to engineer T cells with enhanced durability and therapeutic potential.
Developing an epigenetic screening platform in human T cells
Staphylococcus aureus Cas9 (SaCas9) has been extensively used for genome editing in vivo as its compact size (3,159 bp) relative to the conventional Streptococcus pyogenes Cas9 (SpCas9) enables packaging into adeno-associated virus 26 , 27 , 28 . However, SaCas9 has not been widely used for targeted gene regulation 29 , 30 or in the context of an epigenome editing screen. To facilitate delivery to human T cells, we rigorously characterized the activity of dSaCas9 as a repressor or activator using several promoter tiling guide RNA (gRNA) screens in both primary human T cells and the Jurkat cell line (Extended Data Figs. 1 – 3 , Supplementary Fig. 1 and Supplementary Note 1 ). Collectively, this work demonstrated that dSaCas9 can potently silence or activate target gene expression and informed gRNA design rules.
CRISPRi/a screens identify regulators of human T cell state
We sought to interrogate the effects of repressing or activating genes encoding TFs and epigenetic modifiers on T cell state. We designed a gRNA library targeting 120 TFs and epigenetic modifiers associated with T cell state (Supplementary Fig. 3 and Supplementary Note 2 ). To detect whether specific gene perturbations altered T cell state, we used CCR7 expression as our screen readout (Fig. 1a and Supplementary Fig. 4 ). CCR7 is a well-characterized T cell marker and is highly expressed in specific T cell subsets such as naive, stem-cell memory and central memory T cells 31 . We hypothesized it would enable us to capture more subtle changes in T cell state than phenotypic readouts such as proliferation or cytokine production.

a , Schematic of CRISPRi/a screens with TF gRNA library (lib). b , c , Significance ( P adj ) versus fold change in gRNA abundance between CCR7 HIGH and CCR7 LOW populations for CRISPRi ( b ) and CRISPRa ( c ) screens. gRNA enrichment was defined using a paired two-tailed DESeq2 test with Benjamini–Hochberg correction. d , Fold change of BATF3 and BATF CRISPRa gRNA hits for each donor (D1-D3). Blue lines represent BATF3 or BATF gRNAs and gray lines represent the distribution of 120 non-targeting (NT) control gRNAs. e , f , All BATF3 ( e ) and BATF ( f ) CRISPRa gRNAs in gRNA library relative to TSS, chromatin accessibility and ENCODE candidate cis -regulatory elements (cCREs). Blue and black lines represent gRNA hits and nonsignificant gRNAs, respectively. cCRE tracks are overlaid for visualization of promoter-like elements (red) and enhancer-like elements (blue).
The CRISPRi screen recovered many canonical regulators of memory T cells including FOXO1 (ref. 32 ), MYB 33 and BACH2 (ref. 34 )—all of which when silenced led to reduced expression of CCR7, indicative of T cell differentiation towards effector T cells (Fig. 1b and Supplementary Fig. 5a ). Interestingly, the most significant hit from the CRISPRi screen was the gene encoding the maintenance DNA methyltransferase DNMT1 . Genetic disruption of both TET2 and DNMT3A , which encode for proteins that regulate DNA methylation in opposite directions, can improve the therapeutic potential of T cells 19 , 20 . There was a single nontargeting (NT) gRNA (1/120) hit in the CRISPRi screen. The same NT gRNA emerged as a hit in multiple screens using CCR7 as the readout, suggesting a real off-target effect.
The CRISPRa screen also identified several TFs that have been implicated in CD8 + T cell differentiation and function such as EOMES 35 , BATF 13 and JUN 12 (Fig. 1c ). Importantly, gRNA enrichment was consistent across the three donors and not a function of the number of gRNAs targeting each gene (Fig. 1d and Supplementary Fig. 5 ). Multiple gRNAs targeting BATF and BATF3 were enriched in reciprocal directions across CRISPRi and CRISPRa screens, and BATF and BATF3 were among the top hits in gene-level analyses, highlighting the power of coupling loss- or gain-of-function perturbations (Supplementary Table 2 ). The BATF and BATF3 gRNA hits in the CRISPRa screen generally colocalized to regions upstream of the promoter and near the summits of accessible chromatin (Fig. 1e,f ).
scRNA-seq characterization of transcriptional regulators
We next characterized the transcriptomic effects of each candidate gene identified from our CRISPRi or CRISPRa screens using single-cell RNA sequencing (scRNA-seq). We cloned the union set of gRNA hits across CRISPRi/a screens (32 gRNAs) and 8 NT gRNAs into both CRISPRi and CRISPRa plasmids (Supplementary Table 3 ). We then followed the same workflow as the sort-based screens, but instead of sorting the cells based on CCR7 expression, we profiled the transcriptomes and gRNA identity of ~60,000 cells across three donors for each screen. We aggregated the cells based on gRNA assignment and compared the transcriptional profile of cells with the same gRNA to nonperturbed cells (Supplementary Fig. 6 and Supplementary Note 3 ).
First, we focused on CCR7 expression to validate the results from our CRISPRi/a screens (Fig. 2a,b ). Roughly half of the gRNA hits affected CCR7 expression, and the rank order was similar to the sort-based screens. For example, both assays informed that targeted silencing of DNMT1 or FOXO1 drastically reduced CCR7 expression levels, which was further confirmed through individual gRNA validations (Supplementary Fig. 7a,b ). The gRNA hits that did not validate in the scRNA-seq characterization were represented by fewer cells than validated gRNAs, reaffirming that higher gRNA coverage helps to resolve more subtle changes in gene expression 36 (Supplementary Fig. 7c and Supplementary Note 4 ). In addition to confirming gRNA effects on CCR7 expression, the true negative rates were high for both CRISPRi (96%) and CRISPRa (82%), demonstrating the specificity of these sort-based screens (Fig. 2a,b ).

a , b , Significance ( P adj ) versus average fold change of CCR7 expression for each gRNA compared to nonperturbed cells for CRISPRi ( a ) and CRISPRa ( b ) perturbations. Significant gRNA effects on CCR7 expression were defined using a two-tailed MAST test with Bonferroni correction. True positive (TP) and negative rates (TN) are displayed above each volcano plot. c , Fold change in target gene expression for NT gRNAs and targeting gRNAs across CRISPRi ( n = 31 gRNAs) and CRISPRa ( n = 30 gRNAs) perturbations (mean values ± s.e.m.). A two-way ANOVA with Tukey’s post hoc test was used to compare groups. d , Dot plot with average expression and percentage of cells expressing target genes, memory markers and effector molecules for the indicated CRISPRi perturbations. Significant gRNA–gene links were defined using a two-tailed MAST test with Bonferroni correction. e , Number of DEGs ( P adj < 0.01) associated with each gRNA versus the gRNA effect on the target gene for both CRISPRi and CRISPRa perturbations. f , g , Significant gRNA–gene links were defined using a two-tailed MAST test with Bonferroni correction. Correlation of the union set of DEGs between the top two CRISPRi MYB gRNAs ( f ) and CRISPRa BATF3 gRNAs ( g ). Pearson’s correlation coefficient was calculated and then a two-tailed t -test was conducted to determine whether the relationship was significant. h , i , Representative enriched pathways for the top three CRISPRi ( h ) and CRISPRa gRNAs ( i ). Statistical significance was defined using a two-tailed Fisher’s exact test followed by Benjamini–Hochberg correction.
We next measured on-target gene silencing or activation. Of gRNAs assigned to at least five cells in each of the CRISPRa and CRISPRi screens, 56/61 gRNAs (92%) silenced or activated their gene target (Fig. 2c ). Given that CCR7 was selected as a surrogate marker for a memory T cell phenotype, we expected some perturbations to regulate subset-defining gene expression programs. Indeed, scRNA-seq revealed that silencing the top predicted positive regulators of memory ( DNMT1 , FOXO1 and MYB ) led to decreased expression of CCR7 and other memory-associated genes (such as IL7R , SELL , CD27 , CD28 and TCF7 ) and increased expression of effector-associated genes ( GZMA , GZMB and PRF1 ) (Fig. 2d ).
Finally, we examined all differentially expressed genes (DEGs) associated with each perturbation. Endogenous regulation of several TFs and epigenetic-modifying proteins had widespread transcriptional effects with six gene perturbations (four CRISPRi gene perturbations and two CRISPRa gene perturbations) altering expression of >1,000 genes (Fig. 2e ). Interestingly, MYB repression with two unique gRNAs led to widespread and concordant gene expression changes with 8,976 and 7,899 DEGs (Fig. 2e,f ). MYB silencing drove a transcriptional program with hallmark features of effector T cells, suggesting that MYB plays a key role in regulating T cell stemness in human CD8 + T cells as previously reported in mouse CD8 + T cells 33 (Extended Data Fig. 4a,b and Supplementary Note 5 ).
Endogenous activation of several TFs including NR1D1 , EOMES and BATF3 had pronounced effects on T cell state (Fig. 2e,i ). Perturbation-driven single-cell clustering revealed a distinct cell cluster with NR1D1 activation that was markedly enriched for exhaustion-associated genes compared to nonperturbed cells (Extended Data Fig. 4c–e and Supplementary Note 6 ). Notably, a pair of highly concordant BATF3 gRNAs had the strongest effects among CRISPRa perturbations with 3,056 and 1,402 DEGs (Fig. 2e,g ). Gene Ontology analyses revealed that BATF3-induced genes were enriched for DNA and messenger RNA metabolic processing, ribosomal biogenesis and cell-cycle pathways, suggesting an improvement in T cell fitness (Fig. 2i ).

BATF3 OE programs features of memory T cells
BATF3 promotes survival and memory formation in mouse CD8 + T cells. However, molecular and phenotypic effects of BATF3 in human CD8 + T cells have not been well defined 37 . Given that BATF3 ORF delivery led to higher expression of BATF3 than endogenous BATF3 activation (Extended Data Fig. 5a and Supplementary Note 7 ) and the compact size of the BATF3 ORF (only 381 bp), we used ectopic BATF3 expression for all subsequent assays and GFP OE as a negative control.
BATF3 OE markedly increased expression of IL7R, a surface marker associated with T cell survival, long-term persistence and positive clinical response to ACT 38 (Fig. 3a,b and Extended Data Fig. 5b ). We performed RNA-seq across CD8 + T cells from five donors to gain an unbiased view of the transcriptomic changes induced by BATF3 OE. Compared to control cells, there were over 1,100 DEGs distributed almost equally between upregulated and downregulated genes (Fig. 3c ). Gene Ontology analyses revealed that BATF3 OE increased expression of genes involved in metabolic pathways such as glycolysis and gluconeogenesis, DNA replication and translation (Fig. 3d and Supplementary Table 4 ).

a , Representative histogram of IL7R expression in CD8 + T cells with BATF3 OE or control GFP OE on day 8 post-transduction. b , Summary statistics of IL7R expression with or without BATF3 OE ( n = 3 donors with lines connecting the same donor, a two-tailed paired t -test ( P = 0.0004) was used to compare IL7R expression between groups). c , Differential gene expression analysis between CD8 + T cells with or without BATF3 OE on day 10 post transduction ( n = 5 donors). DEGs were defined using a paired two-tailed DESeq2 test with Benjamini–Hochberg correction. d , e , Selected enriched ( d ) and depleted ( e ) biological processes from BATF3 OE. Statistical significance was defined using a two-tailed Fisher’s exact test followed by Benjamini–Hochberg correction. f , Heatmap of DEGs ( P adj < 0.01, n = 5 donors) related to T cell exhaustion, regulatory function, cytotoxicity, transcriptional activity and glycolysis. g , Representative histograms of exhaustion markers (TIGIT, LAG3 and TIM3) on day 12 after acute or chronic stimulation across groups. h , Stacked bar chart with average percentage of CD8 + T cells positive for zero, one, two or three exhaustion markers (TIGIT, LAG3 and TIM3) on day 12 after chronic stimulation across groups ( n = 3 donors, mean values ± s.e.m.).
In contrast, BATF3 OE dampened T cell effector programs and downregulated activation markers, inflammatory cytokines and cytotoxic molecules (Fig. 3e,f ). Additionally, BATF3 OE reduced expression of several markers associated with FOXP3 + regulatory T cells (T regs ), which are associated with poor response to ACT 38 . A subset of CD8 + FOXP3 + LAG3 + T regs suppress T cell activity by secreting CC chemokine ligand 4 (CCL4) 39 . Interestingly, BATF3 OE reduced expression of FOXP3, LAG3 and CCL4 in CD8 + T cells (Fig. 3f and Extended Data Fig. 5c ).
In addition to LAG3, BATF3 silenced other canonical markers of T cell exhaustion including TIGIT, TIM3 and CISH (Fig. 3f and Extended Data Fig. 5c ). We speculated these effects might be amplified in the context of chronic antigen stimulation (Extended Data Fig. 6 ). As previously observed 40 , PD1 expression peaked after the initial stimulation and then tapered off over time, whereas TIGIT, LAG3 and TIM3 expression was maintained or increased after each subsequent round of stimulation. Notably, BATF3 OE attenuated PD1 induction and restricted TIGIT, LAG3 and TIM3 expression to closely resemble that of acutely stimulated cells despite three additional rounds of TCR stimulation (Fig. 3g and Extended Data Fig. 6b,c ). As terminally exhausted T cells often co-express multiple exhaustion-associated markers, we quantified the proportion of cells expressing each combination of TIGIT, LAG3 and TIM3. Only 13% of BATF3 OE T cells co-expressed all three markers compared to 65% and 59% of untreated and GFP-treated T cells (Fig. 3h ).
BATF3 OE remodels the epigenetic landscape
As an orthogonal method of inducing T cell exhaustion, we acutely or chronically stimulated HER2-targeted CAR T cells with or without BATF3 OE with HER2 + cancer cells (Fig. 4 , Supplementary Fig. 8 and Supplementary Note 8 ). We assessed chromatin remodeling by assay for transposase-accessible chromatin with sequencing (ATAC-seq) in response to BATF3 OE under acute or chronic stimulation. In both models, BATF3 OE extensively remodeled the chromatin with 5,104 and 22,201 differentially accessible regions compared to control T cells with 60% and 54% of these regions, respectively, being more accessible with BATF3 OE (Fig. 4a–c ). Most of these changes were in intronic or intergenic regions consistent with cis -regulatory or enhancer elements (Extended Data Fig. 7a,b ).

a , Number of ATAC-seq regions with increased or decreased accessibility in acutely ( n = 3 donors) or chronically stimulated CD8 + T cells ( n = 2 donors) with BATF3 OE on day 14 post-transduction. Differentially accessible (DA) regions were defined as P adj < 0.05 using a paired two-tailed DESeq2 test with Benjamini–Hochberg correction. b , c , Heatmap of DA regions between control and BATF3 OE T cells under acute ( b ) or chronic ( c ) stimulation with selected regions annotated with their nearest gene. d , Joint analysis of RNA-seq and ATAC-seq datasets in the context of acute stimulation. Number of DA regions near upregulated and downregulated genes. Dashed lines represent the number of unique DEGs associated with DA regions. e , f , Representative ATAC-seq tracks of IL7R ( e ) and TIGIT ( f ) loci after acute or chronic stimulation with overlaid rectangles indicating DA regions between control and BATF3 OE T cells in each context. g , h , TF DNA-binding motifs enriched in open (left) and closed (right) regions of chromatin in BATF3 OE T cells compared to control T cells after acute ( g ) and chronic ( h ) stimulation. HOMER computes P values from the cumulative hypergeometric distribution and does not adjust for multiple hypotheses. Bar plot in lower right corner illustrates BATF3’s effect on ETS1 expression based on RNA-seq ( n = 5 donors, mean values ± s.e.m.; statistical significance was determined using a paired two-tailed DESeq2 test between treatment groups).
To understand whether changes in chromatin accessibility corresponded to changes in gene expression, we jointly analyzed our ATAC-seq and RNA-seq data in the context of acute stimulation. We assigned each differentially accessible region to its closest gene to estimate genes that could be regulated in cis by these elements. There was an enrichment of regions with increased or decreased accessibility proximal to upregulated and downregulated genes, respectively, indicating that BATF3-driven epigenetic changes affected nearby gene transcription (Fig. 4d ). Approximately 25% of the genes that changed expression were associated with a corresponding differentially accessible region (297 out of 1,160 genes). For example, BATF3 OE increased accessibility at the IL7R promoter, intronic, 3′-untranslated region, and intergenic regions and decreased accessibility at the 5′-untranslated region, intronic and exonic regions of TIGIT (Fig. 4d,e ). Additionally, BATF3 OE partially counteracted the effect of chronic antigen stimulation at each of these loci (Fig. 4d,e ). Interestingly, BATF3 OE increased accessibility at regions near both memory ( TCF7 , MYB , IL7R , CCR7 and SELL ) and effector-associated genes ( EOMES and TBX21 ) (Fig. 4c ). This may represent a hybrid T cell phenotype or the presence of heterogenous subpopulations of memory and effector T cells. Consistent with RNA-seq and flow data, there was reduced accessibility at exhaustion-associated loci such as TIGIT , CTLA4 and LAG3 with BATF3 OE.
Next, we conducted motif enrichment analyses to gain further insight into the transcriptional networks regulating control and BATF3 OE T cells under acute and chronic stimulation (Fig. 4g,h ). Compared to control T cells, AP-1 transcription family motifs were strongly enriched in both differentially open and closed regions with BATF3 OE under acute stimulation. In fact, 45% and 42% of differentially open and closed regions sites, respectively, harbored a BATF3 motif, suggesting direct BATF3 activity at these regions. This is consistent with the dual potential of BATF3 to silence or activate gene expression depending on its binding partners 41 . Interestingly, a TCF7 binding motif was uniquely enriched in differentially open regions with BATF3 OE. However, under chronic stimulation, AP-1 TF motifs were enriched with BATF3 OE only in differentially open regions. ETS family member motifs were enriched in closed regions, suggesting that BATF3 OE dampens the activity of these factors. Several ETS family members (for example ETV1 , ETV2 and ETV4 ) are not expressed at baseline in T cells, making it unlikely these genes contribute to the widespread epigenetic changes induced by chronic antigen stimulation. ETS1 , however, may represent an important node of the transcriptional network as it is highly expressed at baseline (>500 transcripts per million, TPM) and significantly repressed by BATF3 OE under acute stimulation (Fig. 4h ).
BATF3 OE enhances potency of CAR T cells
Given the profound transcriptional and epigenetic changes, we hypothesized that BATF3 OE might improve CD8 + T cell function. First, we observed that BATF3 OE increased killing of cultured human HER2 + cancer cells by HER2-targeted CAR T cells across donors and effector:target (E:T) ratios (Fig. 5a and Extended Data Fig. 8a,b ). Next, we evaluated whether BATF3 OE could improve in vivo control of solid tumors, given the challenge of T cell exhaustion in the solid tumor setting 42 , 43 . To simplify delivery of the CAR and BATF3 transgenes, we constructed all-in-one lentiviral vectors encoding a HER2 CAR coupled to either GFP or BATF3 expression. Strikingly, CAR T cells co-expressing BATF3 markedly enhanced tumor control at two subcurative doses (2.5 × 10 5 and 5 × 10 5 CAR + cells) compared to control CAR T cells in an orthotopic human HER2 + breast cancer model (Fig. 5b,c and Extended Data Fig. 8c–f ).

a , Tumor viability after co-culture at specified E:T ratios ( n = 3 donors). A two-way ANOVA with Dunnett’s post hoc test compared tumor viability at each E:T ratio: 1:8 ( P adj = 0.0243), 1:4 ( P adj = 0.0042) and 1:2 ( P adj = 0.0099). b , c , Tumor volumes of untreated ( n = 5) and treated mice with 5 × 10 5 ( n = 1 donor, 5 mice per treatment) ( b ) or 2.5 × 10 5 CAR T cells ( n = 1 donor, 4 mice per treatment) ( c ) with or without BATF3 OE. Two-way ANOVA with Tukey’s post hoc tests compared tumor volumes at each time point across treatments. Tumor volumes were never different between untreated and control CAR groups. Asterisks indicate significant differences between control and BATF3 OE CAR T cells. d – g , Percentage of CD8 + T cells ( d ) within each resected tumor on day 3 post-treatment and (Ki-67 ( e ), TCF1 ( f ) and IFNγ ( g ) MFI of T cells ( n = 2 donors, 2 GFP and 3 BATF3 mice for donor 1, 3 mice per treatment for donor 2). Two-tailed Mann–Whitney tests compared percentage of CD8 + cells and marker MFI between groups ( P = 0.0065 for TCF1 and P = 0.0303 for IFNγ). h , i , Percentage ( h ) and total number ( i ) of CD8 + T cells within each resected tumor on day 19 post-treatment ( n = 2 donors, 4 mice per treatment for donor 1, 2 GFP and 3 BATF3 mice for donor 2). Two-tailed Mann–Whitney tests compared percentage ( P = 0.026) and total number of CD8 + cells between groups. j , k , TCF1 and ID3 MFI of T cells on day 19 ( n = 2 donors, 1 mouse per treatment for donor 1, 2 GFP and 3 BATF3 mice for donor 2). Two-tailed t -tests compared MFI between groups ( P = 0.037 for ID3). l , Significance ( P adj ) versus fold change between BATF3 OE and control CD8 + T cells for 144 genes associated with clinical outcome to CD19 CAR T cell therapy 38 . Mean values ± s.e.m. are plotted for a – k .
To explore the mechanism driving superior tumor control with BATF3 OE, we repeated the in vivo experiment with T cells from two different donors and phenotypically characterized the CAR T cells before treatment and after collecting tumor-infiltrating CAR T cells on day 3 and day 19 post-treatment (Fig. 5d–k , Extended Data Fig. 9 and Supplementary Fig. 9 ). Across both sets of experiments, there were no differences in CAR transduction rates (>70% for all groups) or the total number of CAR + T cells before intravenous injections between CAR constructs (Extended Data Fig. 8d,e ). Again, we observed superior tumor control with BATF3 OE CAR T cells across both donors (Extended Data Fig. 9a,b ). Consistent with the previous characterization (Fig. 3f–h ), input BATF3 OE cells tended to express lower levels of exhaustion markers including LAG3, TIGIT and TIM3 (Extended Data Fig. 9c ).
More striking differences between the two groups emerged at the day 3 post-treatment time point. In control and BATF3 OE cells, we detected equivalent proportions of CD8 + T cells within the tumor and circulating in peripheral blood, indicating that BATF3 OE was not improving tumor control by merely increasing T cell proliferation or tumor trafficking (Fig. 5d and Extended Data Fig. 9f ). Similarly, expression of the proliferative marker Ki-67 was equivalent between the groups (Fig. 5e ). Rather, we noticed that tumor-infiltrating CAR T cells with BATF3 OE expressed higher levels of both TCF1 and IFN γ (Fig. 5f,g ). This prompted us to revisit our gene expression and chromatin accessibility data. BATF3 OE did not increase expression of TCF7 (which encodes for TCF1) under acute stimulation (Extended Data Fig. 5c ). However, there were seven differentially accessible sites near the TCF7 locus between control and BATF3 OE CAR T cells under chronic stimulation (Fig. 4c and Extended Data Fig. 7c ). Notably, 5/7 sites were more accessible in BATF3 OE cells including all three intragenic regions. These data suggest that BATF3 OE can partially counter heterochromatinization of the TCF7 locus during chronic antigen stimulation and retain higher levels of TCF1 expression.
As reflected in the tumor growth curves, we detected a higher proportion of tumor-infiltrating CAR T cells in the BATF3 OE group at the final day 19 time point, probably due to smaller tumor sizes, as the absolute number of T cells were similar between the two groups (Fig. 5h,i ). We did not detect any CAR T cells in peripheral blood for either group. We stained the tumor-infiltrating CAR T cells for TCF1, TBET, EOMES, GATA3, ID2, ID3 and IRF4. Interestingly, TCF1 was no longer differentially expressed, but ID3 (a downstream TF of TCF1 (ref. 44 )) was upregulated in the BATF3 OE group (Fig. 5j,k ). Therefore, BATF3 OE T cells may have gradually transitioned from transcriptional programs driven by TCF1 to ID3.
Given the enhanced tumor control conferred by BATF3 OE in CD8 + T cells, we investigated whether BATF3 OE programmed a transcriptional signature associated with clinical response to ACT. In fact, nonresponders to CD19-targeted CAR T cell therapy had a significantly higher proportion of CD8 + T cells in a cytotoxic or exhausted phenotype compared to responders in a recent clinical trial 38 . Using these datasets, we identified 147 DEGs between the infused CD8 + CAR T cell product of responders and nonresponders (Supplementary Fig. 10 ). Of these 147 DEGs, 144 genes were detected in our RNA-seq data. Strikingly, BATF3 OE silenced 35% (23/65) of genes associated with nonresponse and activated 20% (16/79) of genes associated with response (Fig. 5l ). Seven of the ten genes most strongly associated with clinical outcome were regulated in a favorable direction. Conversely, only 4.9% (7/144) of genes were regulated in a direction opposing positive clinical response, providing further evidence that BATF3 OE drives a transcriptional program associated with positive clinical outcomes.
CRISPRko screens reveal cofactors of BATF3
BATF3 is a compact AP-1 TF with only a basic DNA binding domain and a leucine zipper motif. Given that BATF3 lacks additional protein domains such as transactivation domains for gene activation, we speculated that BATF3 interacts with other TFs to impact gene expression and chromatin accessibility (Supplementary Note 9 ) 41 . Additionally, we reasoned that other TFs might compete with or inhibit BATF3 and that removing these factors would further amplify the effects of BATF3 OE. To identify these factors, we conducted parallel CRISPRko screens with or without BATF3 OE using a gRNA library targeting all 1,612 human TF genes 45 (TFome) (Fig. 6a ). We selected IL7R expression as the readout for these screens because BATF3 OE profoundly increases IL7R expression (Fig. 3a,b ), thus providing a proxy for BATF3 activity. IL7R is also expressed in 20–50% of CD8 + T cells at baseline, making it feasible to recover gene hits in both directions, unlike ubiquitously silenced and highly expressed genes.

a , Schematic of CRISPRko screens with TF KO gRNA library (lib). b , z scores of gRNAs for selected genes in mCherry (left) and BATF3 (right) screens. Enriched gRNAs ( P adj < 0.01) were defined using a paired two-tailed DESeq2 test with Benjamini–Hochberg correction. c , Each gene target in the mCherry (top) and BATF3 (bottom) screens ranked based on the MAGeCK 58 robust ranking aggregation (RRA) score in both IL7R LOW (left) and IL7R HIGH (right) populations. Dashed lines indicate FDR of 0.05. d , Scatter plot of z scores for each gRNA in CRISPRko screens with BATF3 versus without BATF3. Enriched gRNAs ( P adj < 0.01) were defined using a paired two-tailed DESeq2 test with Benjamini–Hochberg correction. e , Individual and combined effects of ZNF217 KO and BATF3 OE on IL7R expression ( n = 3 donors, mean values ± s.e.m.). A one-way, paired ANOVA test with Tukey’s post hoc test was used to compare the percentage of IL7R + cells between groups ( P adj = 0.041 for control versus ZNF217 KO, P adj = 0.008 for control versus BATF3 OE, and P adj = 0.049 for BATF3 OE versus BATF3 OE and ZNF217 KO). f , Scatter plot of transcriptomic effects of ZNF217 KO versus BATF3 OE relative to control T cells ( n = 3 donors). DEGs ( P adj < 0.05) were defined using a paired two-tailed DESeq2 test with Benjamini–Hochberg correction and labeled on the basis of whether the DEG was unique to a specific perturbation or shared across perturbations. g , Selected enriched biological processes from ZNF217 KO. Statistical significance was defined using a two-tailed Fisher’s exact test followed by Benjamini–Hochberg correction.
As expected, IL7R gRNAs were the most enriched gRNAs in the IL7R low population across both screens (Fig. 6b ). Notably, BATF3 gRNAs only emerged in the screen with BATF3 OE as BATF3 is lowly expressed at baseline (Fig. 6b ). BATF3 gRNAs indiscriminately target endogenous and exogenous BATF3, indicating that knocking out exogenous BATF3 nullified its effects. Further supporting the robustness of these screens, we recovered multiple gRNA hits for many genes and the baseline expression of target gene hits was significantly higher than non-hit genes (Extended Data Fig. 10a,b ).
By comparing gRNA- and gene-level enrichment between the two screens (Fig. 6c,d ), we could classify whether genes regulated IL7R in a BATF3-independent or BATF3-dependent manner. For example, FOXO1 and DNMT1 were among the strongest hits in the IL7R low population for both screens, indicating BATF3-independent effects. To identify potential cofactors of BATF3, we searched for genes encoding for AP-1 or IRF TFs that were only enriched in the IL7R low population with BATF3 OE. Notably, BATF3 , JUNB , and IRF4 were the top genes meeting these criteria, confirming that BATF3 interacts with JUNB and IRF4 to mediate transcriptional control in CD8 + T cells (Fig. 6c and Extended Data Fig. 10c,d ) 46 . These screens also revealed upstream regulators of IL7R and candidate gene targets for further improving ACT (Fig. 6c ). The most enriched genes in the IL7R high population in the TF-knockout (KO) screen without BATF3 OE were ZNF217 , RUNX3 , FOXP1 , GATA3 , GFI1 , AHR , ETS1 , ZNF626 and FOXP3 . Fewer genes were enriched in IL7R high population in the BATF3 OE screen, in part because baseline IL7R expression was higher. Furthermore, we speculated that some TFs whose effects were lost with BATF3 OE might be downstream targets of BATF3. Indeed, the RNA-seq results show that several TFs including FOXP1, ETS1 and FOXP3 were all downregulated by BATF3 OE (Supplementary Table 4 ).
KO of three genes ( ZNF217 , GATA3 and AHR ) increased IL7R expression individually or in combination with BATF3 OE. ZNF217 was the top hit in both screens and has not previously been characterized in the context of T cell biology. GATA3 has been shown to promote CD8 + T cell dysfunction and targeted deletion of GATA3 improves tumor control 47 . Moreover, both GATA3 and AHR can activate FOXP3 expression in regulatory T cells, providing further evidence of a link between T cell dysfunction and T cell regulatory activity 48 , 49 , 50 .
Next, we measured the effects of knocking out IL7R , BATF3 , JUNB , IRF4 , ZNF217 and GATA3 with and without BATF3 OE (Extended Data Fig. 10e ). BATF3 OE alone increased IL7R expression by >40% compared to control CD8 + T cells (~33% to 77% IL7R + ) (Extended Data Fig. 10e ). Ablating BATF3 partially restored baseline IL7R levels, presumably due to incomplete nuclease activity across ectopic lentiviral copies of BATF3. IL7R induction by BATF3 was profoundly negated with either JUNB or IRF4 KOs (Extended Data Fig. 10e,f ). Conversely, GATA3 and ZNF217 KOs increased the percentage of IL7R + T cells (Extended Data Fig. 10e ). Finally, BATF3 OE and ZNF217 KO together led to a further increase in T cells expressing IL7R (Fig. 6e and Extended Data Fig. 10g ).
We next evaluated the transcriptional effects of ZNF217 or GATA3 KO relative to control T cells and BATF3 OE alone (Fig. 6f , Supplementary Fig. 11 and Supplementary Table 7 ). ZNF217 KO led to 644 DEGs relative to control T cells with many encoding for TFs and surface makers implicated in T cell biology and function (Fig. 6f ). Further supporting a T cell-specific role for ZNF217, Gene Ontology analysis revealed that ZNF217 KO promoted positive regulation of T cell activation, proliferation, IL-2 production, and differentiation (Fig. 6g and Supplementary Table 7 ). Approximately 33% (225/644) of all DEGs with ZNF217 KO were shared with BATF3 OE with the vast majority (206/225) regulated in the same direction. Nevertheless, the majority of DEGs for each individual perturbation were unique, suggesting that ZNF217 KO and BATF3 OE can drive overlapping but also distinct transcriptional changes.
In this study, we developed an epigenetic screening platform with dSaCas9 to systematically map regulators of primary human CD8 + T cells through complementary CRISPRi/a screens. Our CRISPRi/a screens identified many regulators of CD8 + T cell with a striking convergence on BATF3. BATF3 OE markedly enhanced the potency of CD8 + CAR T cells in both in vitro and in vivo tumor models. The compact size of BATF3 makes it particularly amenable to integration into current ACT manufacturing processes by including it in the same lentivirus that delivers the CAR or TCR to donor T cells. It will be important to carefully assess the safety of ACT with T cells engineered with gene modules such as BATF3. Although the progeny of a single TET2 null CAR T cell clone cured a patient with advanced refractory chronic lymphocytic leukemia 20 , a recent study highlighted that biallelic deletion of TET2 in combination with sustained expression of BATF3 can lead to antigen-independent clonal T cell expansion 51 . BATF3 OE alone does not induce adverse effects in T cells 52 , but the BATF–IRF axis can be oncogenic in the context of other genetic and epigenetic aberrations such as mutations, deletions, translocations and duplications 53 , 54 , 55 , 56 , 57 . We did not detect increased levels of MYC or Ki-67 expression in our RNA-seq data nor did we detect elevated numbers of T cells after nearly 3 weeks of in vivo surveillance in tumor-bearing mice. Nevertheless, future work could focus on transiently delivering transgenes, modulating transgene expression or integrating suicide switches to control the activity of T cells in vivo.
The combination of TF OE with a TFome KO screen to dissect cofactors and downstream factors highlights the power of orthogonal CRISPR screen technologies. Specifically, these results support a model where BATF3 heterodimerizes with JUNB and interacts with IRF4 to drive transcriptional programs in CD8 + T cells. We also identified factors such as ZNF217 for further investigation, as these genes have not previously been associated with controlling T cell state or AP-1 gene regulation. Overall, this work expands the toolkit of epigenome editors and our understanding of regulators of CD8 + T cell state and function. This catalog of genes could serve as a basis for engineering the next generation of cancer immunotherapies.
Ethics statement
All animal experiments were conducted with strict adherence to the guidelines for the care and use of laboratory animals of the National Institutes of Health.
All plasmids were cloned using Gibson assembly (NEB). The HER2 CAR constructs for in vivo tumor control studies were cloned by digesting an empty lentiviral vector (Addgene 79121) with MluI and amplifying HER2-CAR 59 and 2A-GFP or 2A-BATF3 (gblock, IDT) fragments with appropriate overhangs for Gibson assembly. The following plasmids were deposited to Addgene: pLV hU6-gRNA hUbC-dSaCas9-KRAB-T2A-Thy1.1 (Addgene 194278) and pLV hU6-gRNA hUbC-VP64-dSaCas9-VP64-T2A-Thy1.1 (Addgene 194279).
HEK293Ts and SKBR3s were maintained in Dulbecco’s modified Eagle medium (DMEM) GlutaMAX supplemented with 10% fetal bovine serum (FBS), 1 mM sodium pyruvate, 1× MEM non-essential amino acids, 10 mM HEPES, 100 U ml −1 penicillin and 100 μg ml −1 streptomycin. Jurkat lines were maintained in RPMI supplemented with 10% FBS, 100 U ml −1 penicillin and 100 μg ml −1 streptomycin. HCC1954s were maintained in DMEM/F12 supplemented with 10% FBS, 100 U ml −1 penicillin and 100 μg ml −1 streptomycin.
Isolation and culture of primary human T cells
Human CD8 + T cells were obtained from either pooled peripheral blood mononuclear cell donors (ZenBio) using negative selection human CD8 isolation kits (StemCell Technologies) or directly from vials containing isolated CD8 + T cells from individual donors (StemCell Technologies). For technology development experiments, T cells were cultured in Advanced RPMI (Thermo Fisher) supplemented with 10% FBS, 100 U ml −1 penicillin and 100 μg ml −1 streptomycin. For T cell reprogramming experiments, T cells were cultured in PRIME-XV T cell Expansion XSFM (FujiFilm) supplemented with 5% human platelet lysate (Compass Biomed), 100 U ml −1 penicillin and 100 μg ml −1 streptomycin. All media were supplemented with 100 U ml −1 human IL-2 (Peprotech). T cells were activated with a 3:1 ratio of CD3/CD28 dynabeads to T cells and maintained at 1–2 × 10 6 cells ml −1 unless otherwise indicated.
Lentivirus generation and transduction of primary human T cells
For all technology development experiments, lentivirus was produced as previously described 60 . For all T cell reprogramming experiments, a recently optimized transfection protocol was used (Supplementary Method 1) 25 . Lentiviral supernatant was centrifuged at 600 g for 10 min to remove cellular debris and concentrated to 50–100× the initial concentration using Lenti-X Concentrator (Takara Bio). T cells were transduced at 5–10% v/v of concentrated lentivirus at 24 h post-activation. For dual transduction experiments, T cells were serially transduced at 24 h and 48 h post activation.
Design of CD2 , B2M and IL2RA gRNA libraries
Saturation CD2 and B2M CRISPRi gRNA libraries were designed to tile a 1,050-bp window (−400 bp to 650 bp) around the transcription start site (TSS) of each target gene using CRISPick 61 . The IL2RA CRISPRa gRNA library was designed to tile a 5,000-bp window (−4,000 bp to 1,000 bp) around the TSS of IL2RA using ChopChop 62 . Each gRNA library was designed to target dSaCas9’s relaxed protospacer adjacent motif (PAM) variant: 5′-NNGRRN-3′. NT gRNAs were generated for each library to match the nucleotide composition of the targeting gRNAs. CD2 , B2M and IL2RA gRNA libraries are in Supplementary Table 1 .
gRNA library cloning
Oligonucleotide pools containing variable gRNA sequences and constant regions for polymerase chain reaction (PCR) amplification were synthesized by Twist Bioscience. gRNA amplicons were gel extracted, PCR purified and input into 20 μl Gibson reactions (5:1 molar ratio of insert to backbone) with 200 ng of Esp3I digested and 1 × solid-phase reversible immobilization (SPRI)-selected (Beckman Coulter) plasmid backbone. Gibson reactions were purified using ethanol precipitation and transformed into Lucigen’s Endura ElectroCompetent Cells. Transformed cells were cultured overnight and plasmids were isolated using Qiagen Midi Kits.
CRISPRi tiling screens
CD8 + T cells from pooled peripheral blood mononuclear cell donors were transduced with all-in-one lentivirus encoding for dSaCas9–KRAB–2A–GFP and either CD2 ( n = 2 replicates) or B2M ( n = 3 replicates) gRNA libraries. Cells were expanded for 9 days and then stained for the target gene. Transduced cells in the lower and upper 10% tails of target gene expression were sorted for subsequent gRNA library construction and sequencing. All replicates were maintained and sorted at a minimum of 350× coverage.
Construction of CRISPRa Jurkat lines and IL2RA CRISPRa tiling screens
Polyclonal dSaCas9 VP64 and VP64 dSaCas9 VP64 Jurkat cell lines were generated by transducing Jurkat cells with lentivirus encoding for either dSaCas9 VP64 –2A–PuroR or VP64 dSaCas9 VP64 –2A–PuroR. Cells were selected for 5 days using 0.5 µg ml −1 of puromycin. After selection, 1 × 10 6 dSaCas9 VP64 and VP64 dSaCas9 VP64 Jurkat cells were plated and transduced in triplicate with the IL2RA gRNA library lentivirus at a low multiplicity of infection (MOI). Cells were expanded for 10 days, selected for Thy1.1 using a CD90.1 Positive Selection Kit (StemCell Technologies), and stained for Thy1.1 and IL2RA. Transduced cells in the lower and upper 10% tails of IL2RA expression were sorted for subsequent gRNA library construction and sequencing. All replicates were maintained and sorted at a minimum of 500× coverage.
TF and epi-modifier CRISPRi/a gRNA library construction
Genes were selected on the basis of motif enrichment in differentially accessible chromatin across T cell subsets 4 , 63 , 64 and a unified atlas of CD8 T cells in cancer and chronic infection 65 . BACH2 , TOX , TOX2 , PRDM1 , KLF2 , BMI1 , DNMT1 , DNMT3A , DNMT3B , TET1 and TET2 were manually added to the gene list (complete 121 member gene list is in Supplementary Table 2 ). The TSS for each gene was extracted using CRISPick and 1,000-bp windows were constructed around each TSS (−500 to +500 bp). After establishing an SaCas9 gRNA database with the strict PAM variant (NNGRRT) using guideScan 66 , the genomic windows were input into the guidescan_guidequery function to generate the gRNA library. Any gRNA that aligned to another genomic site with fewer than four mismatches was removed from the library. The final gRNA library contained at least seven gRNAs targeting 120/121 target gene (there were no PBX2 -targeting gRNAs) with an average of 16 gRNAs per gene. A total of 120 NT gRNAs were included in the library for a total of 2,099 gRNAs (Supplementary Table 2 ).
TF and epi-modifier CRISPRi/a gRNA screens
CD8 + CCR7 + T cells were sorted and transduced with either CRISPRi ( n = 2 donors) or CRISPRa ( n = 3 donors) TF + epi-modifier gRNA libraries at a low MOI. Cells were expanded for 10 days and then stained for Thy1.1 (a marker to identify transduced cells) and CCR7 (a marker associated with T cell state). Transduced cells in the lower and upper 10% tails of CCR7 expression were sorted for subsequent gRNA library construction and sequencing. All replicates were maintained and sorted at a minimum of 300× coverage.
Genomic DNA isolation, gRNA PCR and sequencing gRNA libraries
Genomic DNA was isolated using Qiagen’s DNeasy Blood and Tissue Kit. Genomic DNA was split across 100 μl PCR reactions (25 cycles at 98 °C for 10 s, 60 °C for 30 s, and 72 °C for 20 s) with Q5 2× Master Mix and up to 1 μg of genomic DNA per reaction. PCRs were pooled together for each sample and purified using double-sided (SPRI)bead selection at 0.6× and 1.8×. Libraries were run on a High Sensitivity D1000 tape (Agilent) to confirm amplicon size and quantified using Qubit’s dsDNA High Sensitivity assay. Libraries were diluted to 2 nM, pooled together at equal volumes, and sequenced using Illumina’s MiSeq Reagent Kit v2 (50 cycles). Primers are listed in Supplementary Table 5 .
Processing gRNA sequencing and gRNA analysis for FACS-based screens
FASTQ files were aligned to custom indexes for each gRNA library (generated from the bowtie2-build function) using Bowtie 2 (ref. 67 ). Counts for each gRNA were extracted and used for further analysis in R. Individual gRNA enrichment was determined using the DESeq2 (ref. 68 ) package to compare gRNA abundance between groups for each screen. DESeq2 results for promoter tiling screens, CRISPRi/a TF screens and CRISPRko screens are presented in Supplementary Tables 1 , 2 and 7 .
Gene-level analysis for FACS-based TF CRISPRi and CRISPRa screens
DESeq2 P values were empirically transformed to cumulative probabilities using a midpoint linear interpolation of the 120 NT gRNA P values between 0 and 1. This transformation aligns the data with the null hypothesis that NT gRNA P values have a uniform distribution between 0 and 1. Within each gene, transformed P values were aggregated using a modified robust rank aggregation method to detect genes with nonuniform (non-null) gRNA P values. A gene-level P value was produced by comparison with 10 million gene-level null simulations of P values randomly sampled from a uniform distribution. NT gRNAs were randomly grouped into NT control ‘genes’ (NTCs) and analyzed in the same way. The number of gRNAs per NTC was sampled with replacement from the distribution of gRNAs per gene in the screen until all the NT gRNAs were used. Genes were selected as hits if their Benjamini–Hochberg false discovery rate (FDR) was less than 0.05. Gene-level aggregation was done in Python. Two effect sizes were computed for each gene by averaging gRNAs’ unshrunk DESeq2 log 2 FoldChange within the gene, weighted by each gRNA’s transformed one-sided P value. The larger (absolute value) effect size was chosen for each gene. Effect sizes were estimated in R. Gene-level effect sizes and P values are presented in Supplementary Table 2 .
gRNA validations
For CD2 and B2M gRNA validations, CD8 + T cells were transduced in triplicate with each individual gRNA and followed the same timeline as the CRISPRi screens. For IL2RA gRNA validations, dSaCas9 VP64 and VP64 dSaCas9 VP64 Jurkat lines were transduced with each gRNA hit and followed the same timeline as the CRISPRa screen. Cells were stained with the respective antibody and measured using flow cytometry on day 9.
Flow cytometry and surface marker staining
An SH800 FACS Cell Sorter (Sony Biotechnology) was used for cell sorting and analysis unless otherwise indicated. For antibody staining of all surface markers except CCR7, cells were collected, spun down at 300 g for 5 min, resuspended in flow buffer (1× phosphate-buffered saline (PBS), 2 mM ethylenediaminetetraacetic acid and 0.5% bovine serum albumin) with the appropriate antibody dilutions and incubated for 30 min at 4 °C on a rocker. Antibody staining of CCR7 was carried out for 30 min at 37 °C. Cells were then washed with flow buffer, spun down at 300 g for 5 min and resuspended in flow buffer for cell sorting or analysis. Antibody details are presented in Supplementary Table 5 . Fluorescent minus one (FMO) controls were used to set appropriate gates for all flow panels.
mRNA was isolated using Norgen’s Total RNA Purification Plus Kit. Reverse transcription was carried out by inputting an equal mass of mRNA for each sample into a 10 μl SuperScript Vilo cDNA Synthesis reaction. Two microliters of complementary DNA was used per PCR reaction with Perfecta SYBR Green Fastmix (Quanta BioSciences, 95072) using the CFX96 Real-Time PCR Detection System (Bio-Rad). All primers were designed using the National Center for Biotechnology Information’s primer blast tool, and amplicon products were verified by melt curve analysis. All RT–qPCRs are presented as log 2 fold change in RNA normalized to GAPDH expression unless otherwise indicated. Primers are listed in Supplementary Table 5 .
A 40-gRNA library (Supplementary Table 3 ) containing all 32 gRNA hits from CRISPRi/a screens and 8 NT gRNAs was cloned into all-in-one CRISPRi and CRISPRa lentiviral plasmids. The experimental timeline for the scRNA-seq screens was identical to the cell sorting-based screens. CD8 + CCR7 + T cells from three donors were transduced with CRISPRi and CRISPRa mini-TF gRNA libraries. T cells were expanded for 10 days and then stained and sorted for Thy1.1 + cells. Sorted cells were loaded into the Chromium X for a targeted recovery of 2 × 10 4 cells per donor and treatment according to the Single Cell 5′-High-Throughput Reagent Kit v2 protocol (10x Genomics). SaCas9 gRNA sequences were captured by spiking in 2 μM of a custom primer into the reverse transcription master mix, as previously done for SpCas9 gRNA capture 36 . The custom primer was designed to bind to the constant region of SaCas9’s gRNA scaffold. 5′-Gene Expression (GEX) and gRNA libraries were separated using double-sided SPRI bead selection in the initial cDNA clean-up step. 5′-GEX libraries were constructed according to manufacturer’s protocol. gRNA libraries were constructed using two sequential PCRs (PCR 1: 10 cycles, PCR 2: 25 cycles). The PCR 1 product was purified using double-sided SPRI bead selection at 0.6 × and 2 ×. Twenty percent of the purified PCR 1 product was input into PCR 2. The PCR2 product was purified using double-sided SPRI bead selection at 0.6 × and 1 ×. All libraries were run on a High Sensitivity D1000 tape to measure the average amplicon size and quantified using Qubit’s dsDNA High Sensitivity assay. Libraries were individually diluted to 20 nM, pooled together at desired ratios and sequenced on an Illumina NovaSeq S4 Full Flow Cell (200 cycles) with the following read allocation: Read 1, 26; i7 index, 10; Read 2, 90. All oligos used in this study are listed in Supplementary Table 5 .
Processing and analyzing scRNA-seq
CellRanger v6.0.1 was used to process, demultiplex and generate UMI counts for each transcript and gRNA per cell barcode. UMI counts tables were extracted and used for subsequent analyses in R using the Seurat 69 v4.1.0 package. Low-quality cells with <200 detected genes, >20% mitochondrial reads or <5% ribosomal reads were discarded. DoubletFinder 70 was used to identify and remove predicted doublets. All remaining high-quality cells across donors for each treatment (CRISPRi or CRISPRa) were aggregated for further analyses. gRNAs were assigned to cells if they met the threshold (gRNA UMI >4). Cells were then grouped on the basis of gRNA identity. For differential gene expression analysis, we compared the transcriptomic profiles of cells sharing a gRNA to cells with only NT gRNAs using Seurat’s FindMarkers function to test for DEGs with the hurdle model implemented in model-based analysis of single-cell transcriptomics (MAST). All significant gRNA–gene links are listed in Supplementary Table 3 . Upregulated DEGs were input into EnrichR’s GO Biological Process 2021 database 71 for functional annotation.
RNA sequencing
RNA was isolated using Norgen’s Total RNA Purification Plus Kit and submitted to Azenta (formerly Genewiz) for standard RNA-seq with polyA selection. Reads were first trimmed using Trimmomatic 72 v0.32 to remove adapters and then aligned to GRCh38 using STAR v2.4.1a aligner. Gene counts were obtained with featureCounts 73 from the subread package (version 1.4.6-p4) using the comprehensive gene annotation in Gencode v22. Differential expression analysis was determined with DESeq2 (ref. 68 ) where gene counts are fitted into a negative binomial generalized linear model and a Wald test determines significant DEGs. DESeq2 results of RNA-seq analyses with BATF3 OE and ZNF217 or GATA3 KO are presented in Supplementary Tables 4 and 7 , respectively. Upregulated and downregulated DEGs were input into EnrichR’s GO Biological Processes 2021 database 71 for functional annotation.
scRNA-seq analysis of CD19 CAR T cell infusion product for responders and nonresponders
scRNA-seq data of the infused CD19 CAR T cell products from patients treated with tisagenlecleucel 38 were downloaded from GEO: GSE197268 . Patient data in MarketMatrix format were classified as responders (R) and nonresponders (NR) and processed with Seurat 74 4.2.0. For each patient, cells with fewer than 20% mitochondrial UMI counts, more than 20 GEX UMI counts, and in the bottom 95th percentile of GEX UMI counts were selected. GEX UMI counts were log-normalized for further analysis. Individual patient data were merged (merge function in Seurat) into a combined Seurat object, preserving the group identity in the cellular barcodes. GEX UMI counts were linearly scaled and centered (ScaleData function with default parameters) before finding the most DEGs (Seurat FindVariableFeatures) using principal component analysis. Clustering was performed using the first ten principal components to identify and select CD8 + T cells for subsequent analyses. MAST was used to identify DEGs between CD8 + T cells from responders and nonresponders. All DEGs between responders and nonresponders are presented in Supplementary Table 4 .
A total of 5 × 10 4 transduced CD8 + T cells were sorted for Omni ATAC-seq as previously described 75 . Libraries were sequenced on an Illumina NextSeq 2000 with paired-end 50-bp reads. Read quality was assessed with FastQC and adapters were trimmed with Trimmomatic 72 . Trimmed reads were aligned to the Hg38 reference genome using Bowtie 76 (v1.0.0) using parameters -v 2–best–strata -m 1. Reads mapping to the ENCODE hg38 blacklisted regions were removed using bedtools2 (ref. 77 ) intersect (v2.25.0). Duplicate reads were excluded using Picard MarkDuplicates (v1.130 (ref. 78 )). Count-per-million-normalized bigWig files were generated for visualization using deeptools bamCoverage 79 (v3.0.1). Peak calling was performed using MACS2 narrowPeak 80 and filtered for P adj ≤ 0.001. Peak calls were merged across samples to make a union-peak set. A count matrix containing the number of reads in peaks for each sample was generated using featureCounts 73 (subread v1.4.6) and used for differential analysis in DESeq2 (ref. 68 ) (v.1.36). ChIPSeeker 81 was used to annotate the genomic regions and retrieve the nearest gene around each peak. HOMER (v4.11) package 82 was used to find transcription factor binding motifs that contributed to changes in chromatin accessibility with BATF3 OE compared to control cells (Supplementary Method 2 ).
In vitro tumor killing assay
CD8 + T cells were transduced with lentiviruses encoding for a HER2–CAR–mCherry at 24 h post-activation and BATF3–2A–GFP or GFP at 48 h post-activation. After 12 days of expansion, CAR + GFP + T cells were sorted and counted for the co-culture assay. Four hours before starting the co-culture, 2 × 10 5 HER2 + SKBR3s were plated in a 24-well plate with cDMEM to allow the SKBR3s to adhere to the plate. After 4 h, cDMEM was discarded and mCherry + GFP + T cells in cPRIME medium were added at the indicated E:T cell ratios. After 24 h of co-culture, the cells were collected by collecting the supernatant (containing T cells and dead tumor cells) and adherent cells (which were detached from the plate using trypsin). Cells were spun down at 600 g for 5 min and then stained with a fixable viability dye and Annexin V to label dead and apoptotic cells according to manufacturer’s protocol (Supplementary Method 3 ).
CD3/CD28 and tumor repeat stimulations
For chronic stimulation with CD3/CD28 dynabeads, cells were debeaded and counted, plated at 1–2.5 × 10 5 T cells, and restimulated with fresh CD3/CD28 beads at a 3:1 bead-to-cell ratio in a 24-well plate every 3 days. On day 12, cells were stained and flow analyzed for expression of exhaustion-associated markers. For tumor restimulation, 1 × 10 5 HER2 CAR T cells were transferred to a new 24-well plate with 2 × 10 5 SKBR3s (1:2 E:T ratio) every 3 days. T cells were recovered without antigen stimulation for 2 days after the final round of tumor stimulation before ATAC-seq on day 14. In both assays, T cells were restimulated on days 3, 6 and 9.
All experiments involving animals were conducted with strict adherence to the guidelines for the care and use of laboratory animals of the National Institutes of Health. All experiments were approved by the Institutional Animal Care and Use Committee at Duke University (protocol number A130-22-07). Six- to 8-week-old female immunodeficient NOD/SCID gamma (NSG) mice were obtained from Jackson Laboratory and then housed in 12-h light/dark cycles, at an ambient temperature (21 ± 3 °C) with relative humidity (50 ± 20%) and handled in pathogen-free conditions. Mice were euthanized before reaching a tumor volume of 2,000 mm 3 , the upper threshold defined by the Duke Institutional Animal Care and Use Committee.
In vivo tumor model
A total of 2.5 × 10 6 HER2 + HCC1954 cells were implanted orthotopically into the mammary fat pad of NSG mice in 100 μl 50:50 (v:v) PBS:Matrigel. T cells were expanded for 9–11 days post-transduction before treatment. Transduction rates were measured on the day of treatment using flow cytometry. For all in vivo experiments, transduction rates exceeded 70% for both HER2–CAR–2A–GFP and HER2–CAR–2A–BATF3 constructs. T cells were resuspended at 50 × 10 6 CAR + cells ml −1 in 1× PBS and serially diluted to the appropriate cell concentrations for 200-μl injections of either 10 × 10 6 , 2 × 10 6 , 5 × 10 5 , 2.5 × 10 5 or 1 × 10 5 HER2 CAR + T cells. Then, 20–21 days after tumor implantation, and immediately before CAR T cell injections, mice were randomized into groups and tumors were measured. Tumor volumes were calculated on the basis of caliper measurements using the following formula: volume = ½(Length × Width 2 ). CAR T cells were injected intravenously by tail vein. Tumors were measured every 4–6 days.
Flow cytometry analysis of input and tumor-infiltrating CAR T cells
Mice bearing HCC1954 tumors were euthanized at days 3 and 19 post CAR T cell delivery under deep isoflurane anesthesia via exsanguination, from which blood was collected. Blood was processed via red blood cell lysis buffer (Sigma) treatment followed by washing in PBS. Tumors were resected, minced and incubated in RPMI-1640 medium (Gibco) for 45 min in 100 mg ml −1 Liberase (Sigma-Aldrich) and 10 mg ml −1 DNase I (Roche). Single-cell suspensions for blood and tumor were filtered through a 70-mm cell strainer (Olympus Plastics), washed in PBS (Gibco), stained with Zombie NIR (1:250, BioLegend), washed in FACS buffer (2% FBS (Sigma) + PBS), and treated with 1:50 mouse Tru-stain Fc block (BioLegend). Cells were then stained for cell surface markers followed by intracellular staining using the Transcription Factor Staining Buffer Set (Invitrogen) per the manufacturer’s instructions. Antibodies are listed in Supplementary Table 5 , and more details on the staining protocol are outlined in Supplementary Method 4 . All data were collected on a Fortessa X 20 (Duke Cancer Institute Flow Cytometry Core) and analyzed using Flow Jo V10.8.1. Blood/tumor from sham-infused mice and FMO controls were used to guide gating for CAR T cells and to confirm appropriate compensation, respectively.
TFome CRISPRko gRNA library construction
The Brunello genome-wide KO 83 library (four gRNAs per gene) was subset for 1,612 TFs 45 and IL7R . A total of 550 NT gRNAs were included in the library for a total of 7,000 gRNAs (Supplementary Table 6 ). This gRNA library was cloned into SpCas9 gRNA lentiviral plasmids with either mCherry or BATF3.
TFome CRISPRko screens and validations
A total of 20 × 10 6 CD8 + T cells from two donors were activated with CD3/CD28 dynabeads at a 1:1 ratio. At 24 h post-activation, CD8 + T cells were split evenly and transduced in parallel with TFome CRISPRko gRNA libraries with mCherry or BATF3. At 48 h post-activation, cells were electroporated with Cas9 protein. Briefly, the cells were collected, spun down at 90 g for 10 min, resuspended in 100 μl of Lonza P3 Primary Cell buffer with 3.2 μg Cas9 (Thermo) per 10 6 cells, and electroporated with the pulse code EH115. After electroporation, warm medium was immediately added to each cuvette and cells were recovered at 37 °C for 20 min before being transferred into a six-well plate. On day 3 post transduction, cells were selected with 2 μg ml −1 of puromycin for 3 days. On day 9 post transduction, cells were stained for CD8, IL7R and a viability dye. Viable CD8 + T cells in the lower and upper 10% tails of IL7R expression were sorted for subsequent gRNA library construction and sequencing. All replicates were maintained and sorted at a minimum of 75× coverage. Subsequent individual gRNA validations were scaled down to 3.5 × 10 5 cells per electroporation in an eight-well cuvette strip, but otherwise followed the same protocol and timeline as the CRISPRko screens.
TFome CRISPRko screen analyses
gRNA enrichment was performed using DESeq2 as explained above. Gene-level enrichment was performed using the MAGeCK v.0.5.9.4 (ref. 58 ) test module with –paired and –control sgrna parameters, pairing samples by donors and NT gRNAs as control, respectively. Results from gRNA- and gene-level analyses are presented in Supplementary Table 6 .
Statistics and reproducibility
All statistical analysis methods are indicated in the figure legends (NS, not significant; * P < 0.05, ** P < 0.01, *** P < 0.001, **** P < 0.0001). Statistical analyses and data visualizations were performed in Graphpad Prism v.9.0.2, R v4.2.1 or Python v3.7.6. All experiments have been replicated with at least two biological replicates. For in vivo studies, mice were randomly assigned into treatment groups. In this study, no statistical method was used to predetermine sample size, no data were excluded from the analyses, experiments were not randomized, and investigators were not blinded to allocation during experiments and outcome assessment.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Data availability
All data associated with this study are present in the manuscript or its Supplementary Information files. GRCh38 reference genome was used for gRNA library designs and alignments. All CRISPR screening, scRNA-seq, RNA-seq and ATAC-seq data have been deposited in the Gene Expression Omnibus (GEO) under accession number GSE218988 .
Code availability
Publicly available software and packages were used in this study as indicated in Methods . A copy of the custom code used for gene-level analysis of the CRISPR screens is released on Zenodo 84 ( https://doi.org/10.5281/zenodo.8370763 ).
Xu, Y. et al. Closely related T-memory stem cells correlate with in vivo expansion of CAR.CD19-T cells and are preserved by IL-7 and IL-15. Blood 123 , 3750–3759 (2014).
Article CAS PubMed PubMed Central Google Scholar
Fraietta, J. A. et al. Determinants of response and resistance to CD19 chimeric antigen receptor (CAR) T cell therapy of chronic lymphocytic leukemia. Nat. Med. 24 , 563–571 (2018).
Klebanoff, C. A. et al. Central memory self/tumor-reactive CD8 + T cells confer superior antitumor immunity compared with effector memory T cells. Proc. Natl Acad. Sci. USA 102 , 9571–9576 (2005).
Krishna, S. et al. Stem-like CD8 T cells mediate response of adoptive cell immunotherapy against human cancer. Science 370 , 1328–1334 (2020).
Locke, F. L. et al. Tumor burden, inflammation, and product attributes determine outcomes of axicabtagene ciloleucel in large B-cell lymphoma. Blood Adv. 4 , 4898–4911 (2020).
Article PubMed PubMed Central Google Scholar
Scott, A. C. et al. TOX is a critical regulator of tumour-specific T cell differentiation. Nature 571 , 270–274 (2019).
Alfei, F. et al. TOX reinforces the phenotype and longevity of exhausted T cells in chronic viral infection. Nature 571 , 265–269 (2019).
Article CAS PubMed Google Scholar
Khan, O. et al. TOX transcriptionally and epigenetically programs CD8 + T cell exhaustion. Nature 571 , 211–218 (2019).
Wang, X. et al. TOX promotes the exhaustion of antitumor CD8 + T cells by preventing PD1 degradation in hepatocellular carcinoma. J. Hepatol. 71 , 731–741 (2019).
Seo, H. et al. TOX and TOX2 transcription factors cooperate with NR4A transcription factors to impose CD8 + T cell exhaustion. Proc. Natl Acad. Sci. USA 116 , 12410–12415 (2019).
Martinez, G. J. et al. The transcription factor NFAT promotes exhaustion of activated CD8 + T cells. Immunity 42 , 265–278 (2015).
Lynn, R. C. et al. c-Jun overexpression in CAR T cells induces exhaustion resistance. Nature 576 , 293–300 (2019).
Seo, H. et al. BATF and IRF4 cooperate to counter exhaustion in tumor-infiltrating CAR T cells. Nat. Immunol. 22 , 983–995 (2021).
Tang, J. et al. Runx3-overexpression cooperates with ex vivo AKT inhibition to generate receptor-engineered T cells with better persistence, tumor-residency, and antitumor ability. J. Immunother. Cancer 11 , e006119 (2023).
Chen, J. et al. NR4A transcription factors limit CAR T cell function in solid tumours. Nature 567 , 530–534 (2019).
Chen, Z. et al. In vivo CD8 + T cell CRISPR screening reveals control by Fli1 in infection and cancer. Cell 184 , 1262–1280 (2021).
Belk, J. A. et al. Genome-wide CRISPR screens of T cell exhaustion identify chromatin remodeling factors that limit T cell persistence. Cancer Cell 40 , 768–786 (2022).
Guo, A. et al. cBAF complex components and MYC cooperate early in CD8 + T cell fate. Nature 607 , 135–141 (2022).
Prinzing, B. et al. Deleting DNMT3A in CAR T cells prevents exhaustion and enhances antitumor activity. Sci. Transl. Med. 13 , eabh0272 (2021).
Fraietta, J. A. et al. Disruption of TET2 promotes the therapeutic efficacy of CD19-targeted T cells. Nature 558 , 307–312 (2018).
Shifrut, E. et al. Genome-wide CRISPR screens in primary human T cells reveal key regulators of immune function. Cell 175 , 1958–1971 (2018).
Carnevale, J. et al. RASA2 ablation in T cells boosts antigen sensitivity and long-term function. Nature 609 , 174–182 (2022).
Freitas, K. A. et al. Enhanced T cell effector activity by targeting the Mediator kinase module. Science 378 , eabn5647 (2022).
Legut, M. et al. A genome-scale screen for synthetic drivers of T cell proliferation. Nature 603 , 728–735 (2022).
Schmidt, R. et al. CRISPR activation and interference screens decode stimulation responses in primary human T cells. Science 375 , eabj4008 (2022).
Ran, F. A. et al. In vivo genome editing using Staphylococcus aureus Cas9. Nature 520 , 186–191 (2015).
Nelson, C. E. et al. In vivo genome editing improves muscle function in a mouse model of Duchenne muscular dystrophy. Science 351 , 403–407 (2015).
Yin, C. et al. In vivo excision of HIV-1 provirus by saCas9 and multiplex single-guide RNAs in animal models. Mol. Ther. 25 , 1168–1186 (2017).
Matharu, N. et al. CRISPR-mediated activation of a promoter or enhancer rescues obesity caused by haploinsufficiency. Science 363 , eaau0629 (2019).
Thakore, P. I. et al. RNA-guided transcriptional silencing in vivo with S. aureus CRISPR-Cas9 repressors. Nat. Commun. 9 , 1674 (2018).
Henning, A. N., Roychoudhuri, R. & Restifo, N. P. Epigenetic control of CD8 + T cell differentiation. Nat. Rev. Immunol. 18 , 340–356 (2018).
Delpoux, A., Laia, C.-Y., Hedrick, S. M. & Doedensa, A. L. FOXO1 opposition of CD8 + T cell effector programming confers early memory properties and phenotypic diversity. Proc. Natl Acad. Sci. USA 114 , 8865–8874 (2017).
Article Google Scholar
Gautam, S. et al. The transcription factor c-Myb regulates CD8 + T cell stemness and antitumor immunity. Nat. Immunol. 20 , 337–349 (2019).
Roychoudhuri, R. et al. BACH2 regulates CD8 + T cell differentiation by controlling access of AP-1 factors to enhancers. Nat. Immunol. 17 , 851–860 (2016).
Pearce, E. L. et al. Control of effector CD8 + T cell function by the transcription factor eomesodermin. Science 302 , 1041–1043 (2003).
Mimitou, E. P. et al. Multiplexed detection of proteins, transcriptomes, clonotypes and CRISPR perturbations in single cells. Nat. Methods 16 , 409–412 (2019).
Ataide, M. A. et al. BATF3 programs CD8 + T cell memory. Nat. Immunol. 21 , 1397–1407 (2020).
Haradhvala, N. J. et al. Distinct cellular dynamics associated with response to CAR-T therapy for refractory B cell lymphoma. Nat. Med. 28 , 1848–1859 (2022).
Joosten, S. A. et al. Identification of a human CD8 + regulatory T cell subset that mediates suppression through the chemokine CC chemokine ligand 4. Proc. Natl Acad. Sci. USA 104 , 8029–8034 (2007).
Blaeschke, F. et al. Modular pooled discovery of synthetic knockin sequences to program durable cell therapies. Cell 186 , 4216–4234 (2023).
Murphy, T. L., Tussiwand, R. & Murphy, K. M. Specificity through cooperation: BATF–IRF interactions control immune-regulatory networks. Nat. Rev. Immunol. 13 , 499–509 (2013).
Majzner, R. G. & Mackall, C. L. Clinical lessons from the first leg of the CAR T cell journey. Nat. Med. 25 , 1341–1355 (2019).
Lim, W. A. & June, C. H. The principles of engineering immune cells to treat cancer. Cell 168 , 724–740 (2017).
Article CAS PubMed Central Google Scholar
Shan, Q. et al. Tcf1 preprograms the mobilization of glycolysis in central memory CD8 + T cells during recall responses. Nat. Immunol. 23 , 386–398 (2022).
Lambert, S. A. et al. The human transcription factors. Cell 172 , 650–665 (2018).
Chang, Y. K., Zuo, Z. & Stormo, G. D. Quantitative profiling of BATF family proteins/JUNB/IRF hetero-trimers using Spec-seq. BMC Mol. Biol. 19 , 5 (2018).
Singer, M. et al. A distinct gene module for dysfunction uncoupled from activation in tumor-infiltrating T cells. Cell 171 , 1221–1223 (2017).
Tindemans, I., Serafini, N., Di Santo, J. P. & Hendriks, R. W. GATA-3 function in innate and adaptive immunity. Immunity 41 , 191–206 (2014).
Wang, Y., Su, M. A. & Wan, Y. Y. An essential role of the transcription factor GATA-3 for the function of regulatory T cells. Immunity 35 , 337–348 (2011).
Gandhi, R. et al. Activation of the aryl hydrocarbon receptor induces human type 1 regulatory T cell-like and Foxp3 + regulatory T cells. Nat. Immunol. 11 , 846–853 (2010).
Jain, N. et al. TET2 guards against unchecked BATF3-induced CAR T cell expansion. Nature 615 , 315–322 (2023).
Weiser, C. et al. Ectopic expression of transcription factor BATF3 induces B-cell lymphomas in a murine B-cell transplantation model. Oncotarget 9 , 15942–15951 (2018).
Lollies, A. et al. An oncogenic axis of STAT-mediated BATF3 upregulation causing MYC activity in classical Hodgkin lymphoma and anaplastic large cell lymphoma. Leukemia 32 , 92–101 (2018).
Nakagawa, M. et al. Targeting the HTLV-I-regulated BATF3/IRF4 transcriptional network in adult T cell leukemia/lymphoma. Cancer Cell 34 , 286–297 (2018).
Liang, H. C. et al. Super-enhancer-based identification of a BATF3/IL-2R-module reveals vulnerabilities in anaplastic large cell lymphoma. Nat. Commun. 12 , 5577 (2021).
Girardi, T., Vicente, C., Cools, J. & De Keersmaecker, K. The genetics and molecular biology of T-ALL. Blood 129 , 1113–1123 (2017).
Lamant, L. et al. Gene-expression profiling of systemic anaplastic large-cell lymphoma reveals differences based on ALK status and two distinct morphologic ALK + subtypes. Blood 109 , 2156–2164 (2007).
Li, W. et al. MAGeCK enables robust identification of essential genes from genome-scale CRISPR/Cas9 knockout screens. Genome Biol. 15 , 554 (2014).
Cho, J. H. et al. Engineering advanced logic and distributed computing in human CAR immune cells. Nat. Commun. 12 , 792 (2021).
Black, J. B. et al. Master regulators and cofactors of human neuronal cell fate specification identified by CRISPR gene activation screens. Cell Rep. 33 , 108460 (2020).
Sanson, K. R. et al. Optimized libraries for CRISPR–Cas9 genetic screens with multiple modalities. Nat. Commun. 9 , 5416 (2018).
Labun, K. et al. CHOPCHOP v3: expanding the CRISPR web toolbox beyond genome editing. Nucleic Acids Res. 47 , 171–174 (2019).
Philip, M. et al. Chromatin states define tumour-specific T cell dysfunction and reprogramming. Nature 545 , 452–456 (2017).
Galletti, G. et al. Two subsets of stem-like CD8 + memory T cell progenitors with distinct fate commitments in humans. Nat. Immunol. 21 , 1552–1562 (2020).
Pritykin, Y. et al. A unified atlas of CD8 T cell dysfunctional states in cancer and infection. Mol. Cell 81 , 2477–2493 (2021).
Perez, A. R. et al. GuideScan software for improved single and paired CRISPR guide RNA design. Nat. Biotechnol. 35 , 347–349 (2017).
Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9 , 357–359 (2012).
Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15 , 550 (2014).
Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 36 , 411–420 (2018).
McGinnis, C. S., Murrow, L. M. & Gartner, Z. J. DoubletFinder-doublet detection in single-cell RNA sequencing data using artificial nearest neighbors. Cell Syst. 8 , 329–337 (2019).
Kuleshov, M. V. et al. Enrichr—a comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Res. 44 , 90–97 (2016).
Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic—a flexible trimmer for Illumina sequence data. Bioinformatics 30 , 2114–2120 (2014).
Liao, Y., Smyth, G. K. & Shi, W. featureCounts—an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 30 , 923–930 (2013).
Article PubMed Google Scholar
Hao, Y. et al. Integrated analysis of multimodal single-cell data. Cell 184 , 3573–3587 (2021).
Corces, M. R. et al. An improved ATAC-seq protocol reduces background and enables interrogation of frozen tissues. Nat. Methods 14 , 959–962 (2017).
Langmead, B., Trapnell, C., Pop, M. & Salzberg, S. L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10 , R25 (2009).
Quinlan, A. R. & Hall, I. M. BEDTools—a flexible suite of utilities for comparing genomic features. Bioinformatics 26 , 841–842 (2010).
Picard. Broad Institute http://broadinstitute.github.io/picard/ (2017).
Ramırez, F., Dundar, F., Diehl, S., Gruning, B. A. & Manke, T. deepTools—a flexible platform for exploring deep-sequencing data. Nucleic Acids Res. 42 , W187–W191 (2014).
Zhang, Y. et al. Model-based analysis of ChIP–seq (MACS). Genome Biol. 9 , R137 (2008).
Yu, G., Wang, L.-G. & He, Q.-Y. ChIPseeker—an R:Bioconductor package for ChIP peak annotation, comparison and visualization. Bioinformatics 31 , 2382–2383 (2015).
Heinz, S. et al. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol. Cell 38 , 576–589 (2010).
Doench, J. G. et al. Optimized sgRNA design to maximize activity and minimize off-target effects of CRISPR-Cas9. Nat. Biotechnol. 34 , 184–191 (2016).
McCutcheon, S. R. Transcriptional and epigenetic regulators of human CD8 + T cell function identified through orthogonal CRISPR screens. Nat. Genet. https://doi.org/10.1101/2023.05.01.538906 (2023).
Download references
Acknowledgements
We thank all members of the Gersbach laboratory and G. E. Crawford for technical assistance and helpful discussions. We thank W. Wong for generously providing the HER2 CAR plasmid. Illustrative schematics (Figs. 1a and 6a , and Extended Data Figs. 1a,b , 3a and 6a ) were created using BioRender. This work was supported by National Institutes of Health grants U01AI146356 (C.A.G.) UM1HG012053, UM1HG009428 and RM1HG011123 (T.E.R. and C.A.G.), National Science Foundation grants EFMA-1830957 (C.A.G.), an Allen Distinguished Investigator Award from the Paul G. Allen Frontiers Group to C.A.G, the Open Philanthropy Project, and the Duke-Coulter Translational Partnership.
Author information
Authors and affiliations.
Department of Biomedical Engineering, Duke University, Durham, NC, USA
Sean R. McCutcheon, Lucas Humayun, Maria A. ter Weele, Timothy E. Reddy & Charles A. Gersbach
Center for Advanced Genomic Technologies, Duke University, Durham, NC, USA
Sean R. McCutcheon, Alejandro Barrera, Christian McRoberts Amador, Keith Siklenka, Maria A. ter Weele, Timothy E. Reddy, Andrew S. Allen & Charles A. Gersbach
Department of Surgery, Duke University Medical Center, Durham, NC, USA
Adam M. Swartz, Smita K. Nair & Charles A. Gersbach
Department of Neurosurgery, Duke University School of Medicine, Durham, NC, USA
Michael C. Brown
Department of Biostatistics and Bioinformatics, Duke University Medical Center, Durham, NC, USA
Alejandro Barrera, Keith Siklenka, Timothy E. Reddy & Andrew S. Allen
Department of Pharmacology and Cancer Biology, Durham, NC, USA
Christian McRoberts Amador
Duke Cancer Institute Center for Cancer Immunotherapy, Duke University School of Medicine, Durham, NC, USA
James M. Isaacs, Smita K. Nair & Scott J. Antonia
Department of Pathology, Duke University School of Medicine, Durham, NC, USA
Smita K. Nair
You can also search for this author in PubMed Google Scholar
Contributions
S.R.M., A.M.S., M.C.B., C.M.A., J.M.I. and C.A.G. designed experiments. S.R.M., A.M.S., M.C.B., C.M.A., K.S. and L.H. performed the experiments. S.R.M, A.M.S. and M.C.B. performed the in vivo tumor experiments. S.R.M. and A.B. analyzed the scRNA-seq data. M.A.t.W. and A.S.A. contributed to screen analyses. T.E.R., S.N., S.A. and C.A.G. supervised the study. S.R.M. and C.A.G. wrote the manuscript with contributions by all authors.
Corresponding author
Correspondence to Charles A. Gersbach .
Ethics declarations
Competing interests.
S.R.M. and C.A.G. are named inventors on patent applications related to epigenome engineering technologies in primary human T cells. S.R.M. is a consultant for Tune Therapeutics. C.A.G. is a co-founder of Tune Therapeutics and Locus Biosciences, and is an advisor to Sarepta Therapeutics. The remaining authors declare no competing interests.
Peer review
Peer review information.
Nature Genetics thanks the anonymous reviewers for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended data fig. 1 dsacas9-based epigenetic screening platform..
(a) Schematic of CRISPRi lentiviral plasmid. (b) Schematic of CRISPRi screens in human CD8+ T cells. (c) Significance (P adj ) versus fold change in gRNA abundance between CD2 HIGH and CD2 LOW populations for CD2 CRISPRi screen. gRNA enrichment was defined using a paired two-tailed DESeq2 test with Benjamini-Hochberg correction. (d) CD2 gRNA fold change versus gRNA position relative to TSS. Dashed lines represent previously defined optimal CRISPRi window 32 . (e) CD2 gRNA fold change as a function of the final base pair of the PAM. x represents the number of gRNA hits and y represents the total number of gRNAs for each PAM variant. A one-way ANOVA with Dunnett’s post hoc test was used to compare fold change of gRNAs for each PAM variant to NNGRRT (mean values +/− SEM, T versus A (P adj = 0.0003), T versus C (P adj = 0.0399), and T versus G (P adj = 0.0088). (f) CD2 gRNA activity plotted in rank order (n = 3 replicates of CD8+ T cells from pooled PBMC donors, mean values +/− SEM). A one-way ANOVA with Dunnett’s post hoc test was used to compare each gRNA to NT. Final base pair of PAM for each gRNA is indicated beneath gRNA label. (g) Relationship between CD2 gRNA activity and fold enrichment in screen (n = 18 CD2-targeting gRNAs (16 hits and 2 non-hits) and 1 non-targeting gRNA, mean values +/− SEM with Pearson’s correlation coefficient (r)). Significance (P adj ) versus fold change in gRNA abundance between IL2RA HIGH and IL2RA LOW populations for the IL2RA CRISPRa Jurkat screens (n = 3 replicates) with (h) dSaCas9 VP64 and (i) VP64 dSaCas9 VP64 . gRNA enrichment was defined using a paired two-tailed DESeq2 test with Benjamini-Hochberg correction. (j) Normalized IL2RA MFI of dSaCas9 VP64 and VP64 dSaCas9 VP64 Jurkat lines transduced with indicated gRNAs (n = 2 replicates). A two-tailed paired ratio t-test (p = 0.0068) was used to compare gRNA activity between dSaCas9 VP64 and VP64 dSaCas9 VP64 Jurkat lines. (k) Relative IL2RA mRNA expression of Jurkat CRISPRa lines transduced with indicated gRNA on day 9 post-transduction (n = 2, mean values +/− SEM).
Extended Data Fig. 2 B2M promoter tiling CRISPRi screen in primary human CD8+ T cells.
(a ) Significance (P adj ) versus fold change in gRNA abundance between B2M HIGH and B2M LOW populations for B2M CRISPRi screen. gRNA enrichment was defined using a paired two-tailed DESeq2 test with Benjamini-Hochberg correction. (b) B2M gRNA fold change versus gRNA position relative to TSS. Dashed lines represent previously defined optimal CRISPRi window 32 . (c) B2M gRNA fold change as a function of the final base pair of the PAM. x represents the number of gRNA hits and y represents the total number of gRNAs for each PAM variant. A global one-way ANOVA with Dunnett’s post hoc test was used to compare the fold change of gRNAs for each PAM variant to NNGRRT (mean values +/− SEM, T versus A (P adj = 0.002), T versus C (P adj < 0.0001), and T versus G (P adj = 0.0003). (d) B2M gRNA activity plotted in rank order (n = 3 replicates of CD8+ T cells from pooled PBMC donors, mean values +/− SEM). A one-way ANOVA with Dunnett’s post hoc test was used to compare each gRNA to NT. Final base pair of PAM for each gRNA is indicated beneath gRNA label. (e) Relative B2M mRNA expression of CD8+ cells transduced with indicated gRNA on day 9 post-transduction (n = 3, mean values +/− SEM). A one-way ANOVA with Dunnett’s post hoc test was used to compare each gRNA to NT.
Extended Data Fig. 3 dSaCas9 VP64 and VP64 dSaCas9 VP64 IL2RA promoter tiling CRISPRa screens in Jurkats.
(a) Schematic of dSaCas9 VP64 and VP64 dSaCas9 VP64 IL2RA promoter tiling CRISPRa screens in Jurkats. (b ) UCSC genome browser track of IL2RA locus with statistical significance displayed for each gRNA in VP64 dSaCas9 VP64 CRISPRa screen. gRNA hits are annotated and labeled in blue. ATAC-seq and ENCODE candidate cis regulatory elements (cCREs) tracks are overlayed for visualization of chromatin accessibility and annotations. cCREs in red are defined as promoter-like elements and cCREs in blue are defined as enhancer-like elements. (c) Fold change in IL2RA gRNA abundance as a function of the final base pair of the PAM. x represents the number of gRNA hits and y represents the total number of gRNAs for each PAM variant. A global one-way ANOVA with Dunnett’s post hoc test was used to compare the fold change of gRNAs for each PAM variant to NNGRRT (mean values +/− SEM, T versus A (P adj = 0.0169), T versus C (P adj = 0.0131), and T versus G (P adj = 0.0079). (d) Representative overlayed histograms of IL2RA expression for dSaCas9 VP64 and VP64 dSaCas9 VP64 Jurkat lines on day 9 post-transduction across gRNAs.
Extended Data Fig. 4 MYB silencing drives T cells towards an effector phenotype and NR1D1 activation induces an exhaustion phenotype.
(a) Statistical significance (P adj ) for each gene versus the fold change in gene expression in MYB CRISPRi-perturbed cells relative to non-perturbed cells. Only DEGs (P adj < 0.01, all labeled blue except MYB) are displayed. DEGs were defined using a two-tailed MAST test with Bonferroni correction. (b) Classification of annotated DEGs based on their functional role. (c) UMAP plot of CRISPRa scRNA-seq characterization with cells split by perturbation status: non-perturbed (top) and perturbed (bottom). Blue data points indicate cells with a NR1D1 gRNA. Cells were clustered using Seurat’s CalcPerturbSig function to mitigate confounding sources of variation such as the donor and phase of cell cycle. (d) Statistical significance (P adj ) for each gene versus the fold change in gene expression in NR1D1 CRISPRa-perturbed cells relative to non-perturbed cells. Only DEGs (P adj < 0.01, all labeled blue except NR1D1) are displayed. DEGs were defined using a two-tailed MAST test with Bonferroni correction. (e) Violin plot of exhaustion gene signature score across non-perturbed (n = 2,980 cells) and NR1D1-perturbed (n = 456 cells) in the CRISPRa scRNA-seq screen. Boxplots extend from the lower whisker (minimum value within 1.5 IQR of the first quartile) to the upper whisker (maximum value within 1.5 IQR of the third quartile). The boxed lines represent the first quartile, median, and third quartile. UCell gene signature scores are based on the Mann-Whitney U statistic.
Extended Data Fig. 5 Kinetics of BATF3 expression and effects of BATF3 OE.
(a) Median BATF3 expression over time relative to baseline expression before T cell activation across groups (n = 3 donors, fold change in BATF3 expression was calculated using 2 −dCT method relative to baseline BATF3 expression, internal householding control was excluded because T cell stimulation dramatically alters expression of householding genes such as GAPDH and TBP, input mass of RNA into the reverse transcription reaction was the same for all samples). (b) An IL7R fluorescent minus one (FMO, left) control was used to set the IL7R+ gate. Representative IL7R expression of CD8+ T cells from a donor transduced with either GFP (middle) or BATF3 OE (right) on day 8 post-transduction. (c) Transcripts per million (TPM) of selected genes: BATF3 (P adj = 1e-7), CCR7 (P adj = 0.01), TCF7 (ns), TIGIT (P adj = 3e-18), TIM3 (P adj =1e-10), CISH (P adj = 6e-11), LAG3 (P adj = 1e-14), FOXP3 (P adj =5e-13), and CCL4 (P adj = 5e-6) with either GFP or BATF3 OE on day 10 post transduction (n = 5 donors, mean values +/− SEM). P adj values were determined using a paired two-tailed DESeq2 test with Benjamini-Hochberg correction.
Extended Data Fig. 6 BATF3 OE attenuates expression of T cell exhaustion markers.
(a) Schematic of acute (left) and chronic stimulation (right) with CD3/CD28 dynabeads. (b) Average percentage of positive cells (top panel: PD1 (p = 0.02), LAG3 (p = 0.06), TIGIT (p = 0.03), and TIM3 (ns)) and MFI (bottom panel: PD1 (p = 0.02), LAG3 (p = 0.03), TIGIT (p = 0.046), and TIM3 (ns)) of exhaustion markers on day 3 post-transduction with GFP or BATF3 OE (n = 3 individual donors, mean values +/− SEM, two-tailed paired t tests were used to determine statistical significance). (c) Time course of PD1, LAG3, TIGIT, and TIM3 expression post-transduction with GFP or BATF3 OE under acute or chronic stimulation (n = 3 individual donors, mean values +/− SEM).
Extended Data Fig. 7 BATF3 OE remodels epigenetic landscape of TCF7 locus.
Proportion of differentially accessible regions based on genomic feature classification with (a) acute stimulation and (b) chronic stimulation. (c) Representative ATAC-seq tracks of the TCF7 locus under acute and chronic stimulation with and without BATF3 OE.
Extended Data Fig. 8 BATF3 OE enhances in vitro and in vivo tumor control.
(a) Tumor viability after 24 hours of culture in T cell media, co-culture with CAR null T cells, or co-culture with CAR T cells at specified effector to target (E:T) ratios (n = 3 donors, mean values +/− SEM). (b) Tumor viability after 24 hours of co-culture with GFP CAR null , GFP CAR + , and BATF3 OE CAR + CD8 T cells at specified E:T ratios for each donor. (c) Tumor volume over time as a function of the dose of control HER2 CAR T cells (n = 5 mice per treatment, mean values +/− SEM). Mice were intravenously injected with CAR T cells on day 21. (d) Representative flow plots of CAR expression in CD8+ T cells with control and BATF3 OE CAR lentiviral plasmids on day 9 post-transduction (the same day that the mice were intravenously injected with CAR T cells). (e) Summary statistics of transduction rates and total CAR+ T cells with control and BATF3 OE CAR lentiviral plasmids on day 9 post-transduction (n = 3 donors, lines connect donors across treatments, paired two-tailed t tests were used to determine statistical significance). (f) Tumor volumes of individual mice treated with 5 × 10 5 (left panel, n = 5 mice per treatment group) or 2.5 × 10 5 (right panel, n = 4 mice per treatment group) CAR T cells with or without BATF3 overexpression. Thinner lines represent tumor volumes of individual mice and thicker lines represent mean tumor volumes +/− SEM for each treatment group.
Extended Data Fig. 9 Characterization of CAR T cells with or without BATF3 OE during in vivo tumor control experiment.
(a) Tumor volumes over time for untreated mice (n = 4 mice) and mice treated with 5 × 10 5 CAR T cells with or without BATF3 overexpression (n = 2 donors, 4 GFP and 3 BATF3 mice for donor 1, 3 mice per treatment for donor 2. mean values +/− SEM). Input CAR T cells and tumor infiltrating CAR T cells on day 3 and day 19 post-treatment were characterized using flow cytometry. (b) Same as (a) except stratified based on donor (4 GFP and 3 BATF3 mice for donor 1, 3 mice per treatment for donor 2, mean values +/− SEM). (c) Percentage of positive cells or (d) MFI for indicated markers of input CAR T cells across groups (n = 2 donors, mean values +/− SEM). (e) Histograms of TCF1 and LAG3 expression for input CAR T cells. (f) Percentage of CD8+ T cells in peripheral blood on day 3 post-treatment across groups (n = 2 donors, 2 GFP and 3 BATF3 mice for donor 1 and 3 mice per treatment for donor 2, mean values +/− SEM). A two-tailed Mann-Whitney test was used to compare the percentage of CD8+ cells between the two groups. (g) Percentage of positive cells for indicated markers of tumor infiltrating CAR T cells on day 3 post-treatment across groups (n = 2 donors, 2 GFP and 3 BATF3 mice for donor 1 and 3 mice per treatment for donor 2, mean values +/− SEM). (h) Percentage of positive cells or (I) MFI for indicated markers of tumor infiltrating CAR T cells on day 19 post-treatment across groups (n = 2 donors, 1 mouse per treatment for donor 1, 2 GFP and 3 BATF3 mice for donor 2, mean values +/− SEM). Two-tailed t tests were used to compare expression of each marker between groups (% LAG3+ (p = 0.046) % CD45RA+ (p = 0.048), IRF4 MFI (p = 0.01)).
Extended Data Fig. 10 CRISPR knockout screens reveal co-factors of BATF3 and targets for cancer immunotherapy.
(a) Number of gRNA hits (P adj < 0.01 as defined by a paired two-tailed DESeq2 test with Benjamini-Hochberg correction) per gene in the CRISPRko screen without BATF3 OE. Only genes with at least 1 enriched gRNA were included in this plot. (b) Boxplot of baseline expression of genes stratified based on whether they were hits in the CRISPRko screen without BATF3 OE (n = 1,573 nonsignificant genes and n = significant 34 genes, genes with an FDR < 0.01 based on mageck gene-level analysis were classified as hits). Boxplots extend from the lower whisker (minimum value) to the upper whisker (maximum value). Lines represent the first quartile, median, and third quartile. A two-tailed t test was used to compare baseline expression of nonsignificant and significant gene hits. (c) z scores of gRNAs for JUNB and IRF4 in mCherry (left) and BATF3 (right) screens. Enriched gRNAs (P adj < 0.01, labeled blue) were defined using a paired two-tailed DESeq2 test with Benjamini-Hochberg correction. Non-targeting gRNAs are labeled gray. (d) Predicted functional protein association network of BATF3 using STRING. (e) Percentage IL7R+ (left) and relative IL7R MFI (right) in CD8+ T cells with mCherry or BATF3 across gRNAs. Relative IL7R MFI was calculated by dividing the IL7R MFI of each targeting gRNA by the IL7R MFI of the non-targeting gRNA for each donor within the treatment group (n = 3 donors, mean values +/− SEM). (f) Representative histograms of IL7R expression in CD8+ T cells with BATF3 overexpression in combination with JUNB or IRF4 gene knockouts. (g) Effect of ZNF217 knockout on IL7R expression in CD8+ T cells across three donors with BATF3 OE.
Supplementary information
Supplementary information.
Supplementary Figs. 1–11, Notes 1–9 and Methods 1–4.
Reporting Summary
Supplementary tables 1–7.
Supplementary Table 1. CD2 , B2M , IL2RA CRISPR tiling screening data. Supplementary Table 2. TF CRISPRi/a library and flow-based screening data. Supplementary Table 3. TF CRISPRi/a scRNA-seq data. Supplementary Table 4. BATF3 OE RNA-seq data. Supplementary Table 5. gRNAs, oligos and antibodies. Supplementary Table 6. CRISPR TFome KO screening data. Supplementary Table 7. ZNF217 and GATA3 KO RNA-seq data.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ .
Reprints and Permissions
About this article
Cite this article.
McCutcheon, S.R., Swartz, A.M., Brown, M.C. et al. Transcriptional and epigenetic regulators of human CD8 + T cell function identified through orthogonal CRISPR screens. Nat Genet (2023). https://doi.org/10.1038/s41588-023-01554-0
Download citation
Received : 18 July 2023
Accepted : 26 September 2023
Published : 09 November 2023
DOI : https://doi.org/10.1038/s41588-023-01554-0
Share this article
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
Quick links
- Explore articles by subject
- Guide to authors
- Editorial policies
Sign up for the Nature Briefing: Cancer newsletter — what matters in cancer research, free to your inbox weekly.


IMAGES
VIDEO
COMMENTS
In simple, operational terms, annotation may be defined as the part of genome analysis that is customarily performed before a genome sequence is deposited in GenBank and described in a published paper.
In molecular biology and genetics, DNA annotation or genome annotation is the process of describing the structure and function of the components of a genome, [2] by analyzing and interpreting them in order to extract their biological significance and understand the biological processes in which they participate. [3]
Genome annotationis the process of finding and designating locations of individual genes and other features on raw DNA sequences, called assemblies. Annotation gives meaning to a given sequence and makes it much easier for researchers to view and analyze its contents.
Genome annotation is the process of deriving the structural and functional information of a protein or gene from a raw data set using different analysis, comparison, estimation, precision, and other mining techniques.
An annotation (irrespective of the context) is a note added by way of explanation or commentary. Since the 1980's, molecular biology and bioinformatics have created the need for DNA annotation. DNA annotation or genome annotation is the process of identifying the locations of genes and all of the coding regions in a genome and determining ...
Annotation is a means of retrieving information encoded within the multitude of different sequence patterns of the four nucleotides (i.e., A, T, C and G). The term genome annotation has evolved from the annotation of protein-coding genes to include the annotation of single nucleotides on thousands of individual genomes.
DNA annotation or genome annotation is the process of identifying the locations of genes and all of the coding regions in a genome and determining what those genes do. An annotation (irrespective of the context) is a note added by way of explanation or commentary. Once a genome is sequenced, it needs to be annotated to make sense of it.
Many components of the human genome were discovered long before its base pairs were read through astute experimental design.When the DNA was finally readable, these abstract concepts were mapped onto actual sequences, creating a multilayered annotation linking sequence to phenotype. Defining Concepts
Gene finding is a chief aspect of nucleotide-level annotation. For complex genomes, the most successful methods use a combination of ab initio gene prediction and sequence comparison with...
Accurate genome annotation is critical for successful genomic, genetic, and molecular biology experiments. These annotations can be generated using a number of approaches and available software tools. This unit describes methods for genome annotation and a number of software tools commonly used in gene annotation.
First, automatic annotation uses a predefined set of 'marker genes' (i.e., genes that are specifically expressed in a known cell type) or reference single-cell data (i.e., an existing...
Genome annotation is the process of attaching biological information to sequences. It consists of two main steps: identifying elements on the genome, a process called gene prediction, and attaching biological information to these elements. Automatic annotation tools try to perform all of this by computer analysis, as opposed to manual ...
An annotation is the statement of a connection between a type of gene product and the types designated by terms in an ontology such as the GO. This statement is created on the basis of observations of the instances of such types made in experiments and of the inferences drawn from such observations.
Genome annotation has moved beyond merely identifying protein-coding genes to include the annotation of transposons, regulatory regions, pseudogenes and non-coding RNA genes. Another new...
Genome annotation is the process of attaching higher-level information to primary sequences. The whole process consists of starting with raw DNA sequences and giving a biological meaning to its content (Stein 2001).The first step in annotating a raw sequence will require the mapping of structural elements in the genome by comparing the latter against a library of already known sequences.
Genes can now be studied in a large context and considered like elements of an ensemble of homologous genes or implicated in the same physiological function. But building the link between raw DNA sequence and gene function is far from being an easy task 〚15〛. 1.2. Annotation: definition and justification
A fundamental application of the Gene Ontology (GO) is in the creation of gene product annotations, evidence-based associations between GO definitions and experimental or sequence-based analysis. Currently, the GOC disseminates 126 million annotations covering >374 000 species including all the kingdoms of life.
Annotation is the identification of genes, its structure, and other miscellaneous features in the genome as well as finding biological functions of identified features. The genome annotation is classified into structural and functional annotation. Structural annotation involves finding the locations of genes in the chromosomes, gene structure ...
Genes occupy a small fraction of most eukaryotic genomes (about 5% in the case of the human genome). 1 Identifying and mapping genes into a given genome sequence is usually referred to as annotating the genome.
This exercise uses annotation resources to go from a gene symbol 'BRCA1' through to the genomic coordinates of each transcript associated with the gene, and finally to the DNA sequences of the transcripts. ... Define our region of interest by creating a GRanges instance with appropriate genomic coordinates. Our region corresponds to 10Mb up ...
Please refer to the Eukaryotic Genome Annotation chapter of the NCBI Handbook for algorithmic details. The pipeline uses a modular framework for the execution of all annotation tasks from the fetching of raw and curated data from public repositories (sequence and Assembly databases) to the alignment of sequences and the prediction of genes, to ...
Gene annotation involves the process of taking the raw DNA sequence produced by the genome-sequencing projects and adding layers of analysis and interpretation necessary to extracting biologically significant information and placing such derived details into context.
Functional annotation of open reading frames in microbial genomes remains substantially incomplete. Enzymes constitute the most prevalent functional gene class in microbial genomes and can be ...
genes have eQTL variants with independent regulatory effects in the developing brain, confirming that they are functional variants. Overall, our hierarchical method generated an annotation of bulk eQTL data that allowed for the discovery of divergent cell type regulation in an organ with a complex mixture of cell types. Results
While context-type-specific regulation of genes is largely determined by cis-regulatory regions, attempts to identify cell-type specific eQTLs are complicated by the nested nature of cell types. We present a network-based model for hierarchical annotation of bulk-derived eQTLs to levels of a cell type tree using single cell chromatin accessibility data and no clustering of cells into discrete ...
Clinical response to adoptive T cell therapies is associated with the transcriptional and epigenetic state of the cell product. Thus, discovery of regulators of T cell gene networks and their ...