Inference of 3D Regulatory Interactions from 2D Genomic Data – Katherine Pollard


Katherine Pollard:
Thank you so much. It’s wonderful to be here to tell you guys about some work we’ve
been doing in my lab to understand how genes are regulated in 3D structures that form between
distal enhancers and promoters. And I wanted to start with some pictures that are a little
bit more elaborate than the usual line with blobs on it. These are from some colleagues
who’ve attempted to make videos or still graphics of the process of transcription.
And the purpose is to remind all of us that this is happening in a three dimensional space,
that it involves a lot of proteins, and it involves DNA and chromatin-forming complex
structures. And I would argue that while more complicated than the line drawings, these
are still falling short of the true complexity. So our goal is to try to understand what we
could about this process from ENCODE-like data, including actual ENCODE data. So the motivation is probably clear. This
has been spoken about already by Mathieu this morning and Tyler last night. But just to
recap and to maybe give you a little personal look into my motivation, the idea is that
there are many mutations in the non-coding portion of the genome. It’s an obvious hypothesis
that these might affect enhancers, as we’ve also heard about today in several other talks.
And so, if you have here in the line drawing version a mutation that’s associated with
a disease, for instance, or something I’m really interested in, which is a human chimp
difference, a divergence between us and our closely-related relatives, you might want
to ask, could this variant actually be causal? Or is it some other linked variant nearby?
And if I could find the causal variant, how would I follow that up? I would want to know
the genes and the pathways that were being targeted. And so, it would be complicated
if there were many genes in this locus, which there are in most places in the genome where
we look at these non-coding mutations. So what we’ve been working on a lot in my
lab and we came to it from this evolutionary question of comparing humans and chimps and
by our observation that the fastest evolving regions of the human genome have the genetic
and epigenetic signatures of distal enhancers that function during development. But I think
it’s also very relevant to the disease question. We’ve been asking ourselves can we annotate
where the enhancers are, where the distal regulatory elements are. And I know many people
here have been thinking about that. And then more recently, it would be great — we’ve
been thinking that it would be great if we could map those to the genes. For example,
here if that TC variant on the right looks interesting because it falls in an enhancer
in a relevant cell type, the inference that it would be regulating the closest gene, gene
C, is wrong in this map where it’s actually looping over here to regulate gene A. So these problems are both hard. As everybody
here knows, finding enhancers and figuring out what genes they target are not easy problems.
Some of the standard things that are done, such as performing a few ChIP-seq experiments
and saying, “I found enhancers.” These are K-27, acetylated, for example, failed
to identify many of the experimentally-validated enhancers and find many false positives. But
there is helpful information there. And the idea would be if we can combine lots of datasets
together to do better, and our hope was that we might be able to improve on predicting
gene targets because the commonly-used practice of picking the closest gene or even picking
all genes within a reason window on the genome, it turns out when you do the chromatin capture
experiment that measures these interactions, to be right only like 8 to 10 percent of the
time. So often where — when we do a gene ontology enrichment or some sort of follow-up
functional study, we’re actually pursuing the wrong functions and the wrong genes and
the wrong pathway. So I’m going to focus — I’m going to
assume initially that we know the enhancers. And I’m going to focus on the question of
predicting the gene targets. So the question is, can we reconstruct 3D interactions between
the enhancers and promoters from the 2D genomic data? And so, here is a picture of a region
of the human genome, just to define the complexity of the problem a little more and to show you
some of the data. This is a browser-like shot. There are many genes here. These are active
enhancers and active promoters, in this case from a ChromHMM. And these are interactions
that were detected in a high resolution, Hi-C experiment where chromatin capture measured
that some of these enhancers here — this one, E-1 and E-2, are looping over to this
promoter, not these intervening genes here. So could we predict this from all this up
here? And why might we want to do that has been motivated by several other speakers here,
but one huge motivation is that this experiment, to get it to this resolution of single promoters
and single enhancers, is incredibly expensive, millions of dollars to generate that data.
And this data is easier to generate in a short time period and with less money. Another motivation
that I think is maybe even more exciting in some sense than the financial motivation is
that if some of this data were predictive, we might actually learn something about how
chromatin loops form, that we might learn something about the mechanism. So the approach that we’ve been using is
supervised machine learning. What that means is that we need training data. We need some
examples of promoters and enhancers that are active in a cell type and are in physical
proximity to each other and some other enhancers and promoters that are active — have the
active marks at them, but are not physically interacting. And then we have a feature data
from which we are going to try to learn a model. And once we learn a model, by holding
out some of the data we can evaluate how well we predict on that held-out data, a process
called cross-validation. If we could succeed in this, we could then
make predictions beyond our training data with some confidence in the accuracy. So we’re
fortunate to have some good training data here. This publication that came out at the
very end of 2014, performed, as I mentioned, Hi-C experiment, the chromatin capture at
1 kilobase resolution and several of the ENCODE cell lines. This is genome-wide and gives
us the resolution that we need to see individual promoters interacting with individual enhancers,
but exceedingly expensive to generate, millions of dollars. So by looking at active enhancers and active
promoters and labeling them as positives if they are interacting in the Hi-C and negatives
if they’re not, we have a training dataset. The features we used to try to predict the
interactions were of three general types: one, we looked at evolutionary conservation,
not of the sequence per se, but of the co-localization or the synteny of the enhancer and the promoter.
So if we look across evolutionary time, is there a conserved sequence for that enhancer
across species, and does it stay relatively close or at a similar distance to that gene?
This had been shown to be very predictive of EQTL expression, quantitative trait loci.
It turned out not to be particularly predictive in this — for us on this problem. Most of the data we used and most of what
was very predictive were functional genomics experiments, primarily ChIP-seq for transcription
factors, histone modifications, and various structural proteins. The key — and I’ll
jump a little bit ahead and I will tell you more about this in a minute — is what we
did is we looked at the enhancer and the promoter, which others have done. We heard about that
from Mathieu this morning. The really interesting thing where we learned some really interesting
biology and we really improved our predictions was to look at the window in between the enhancer
and promoter to integrate the signal along that piece of looping chromatin. And this
is different than what I’ve seen others do before. We tried it, frankly, on a bit
of a whim, and it turned out to be a really interesting and important thing to have done. Finally we looked at the sequences themselves,
so looking at the upstream transcription factors that are predicted by motif analysis to bind
the enhancer are those annotated to be involved in similar functions as a potential target
gene. And also, are there shared motifs at the enhancer and promoter? There was also
some evidence from others that these would be useful features. There was some information
there, but most of the data turned out to be in these ChIP-seq experiments — as I’ll
show you in a minute — and specifically on the looping chromatin more than at the enhancer
and promoter. For those that are interested, I’ll tell
you about the computational algorithm. We decided to use decision trees. The motivation
was that we thought that these features might interact in complex combinations, which turned
out to be true. So you might want some event to happen or another event, but not some third
event. And we knew that it wasn’t — from the browser shot I showed previously, we knew
this was going to be complex and that we needed to be able to model these bullying combinations
and that decision trees might be a good way to do this. And by decision trees, I mean
approaches such as random forest and gradient boosting. We tried several different algorithms,
and within this sort of family of Ensembl methods there wasn’t a big difference in
performance across algorithms. So we did get a lot of benefit, however, from this Ensembl
approach, which is essentially that you build many imperfect classifiers by random permutations
of your data and then combine them to get a predictor that does better than your single
best predictor would. This is really important because essentially
what you do is you over-fit some little part of your data through that random sub-sampling.
And by sort of learning these different subsets of features that can sometimes, but not always,
be important, you actually learn a more thorough model than if you just took your best classifier
on the full dataset. This gave us a real boost in performance. So the results of doing this
really surprised me. I knew this was a hard problem and I knew there’d be some information
here. I didn’t expect to see such good performance. But here’s a summary of what we found. And
so this is on three ENCODE cell lines where we had a lot of data to mine for features
and we had the high resolution, Hi-C from the experiments done at the end of last year
at the Broad. So these pictures are probably familiar to everyone. This is the false positive
rate. This is the true positive rate on the vertical axis. Perfect performance would be
in the upper left-hand corner. By having our algorithm outputs a score and
by thresholding that score you can have a curve here where you make many predictions,
have a high false positive rate, but also get all of your predictions are down here.
Less power, finding less of your true positives, but also a much lower false positive rate,
so a stricter predictor. And what you can see is we do a great job by a number of different
measures. The area under this curve is one measure of performance. I think it’s very
important in bioinformatics problems where most of your dataset are false positives to
not just report an AUC, which is the area under this curve, which is how high above
the random guessing line I am, but to also look at precision and recall. Because in a
set — a problem where most of the genome or most of the enhancer promoter pairs are
not physically interacting, a predictor that doesn’t — that predicts no interaction
most of the time, would just randomly have a very low false positive rate. But you would
have very good precision. Most of your predictions would be wrong. So pleasingly we also had a very high precision,
which I was very surprised to see. So when Sean Whelan, the post-doc in my lab showed
this to me, I thought well maybe we just encoded — I mean, this is a mistake, it can’t be
true. So first of all, was there any bleeding between your cross-validation sets, a bunch
of technical issues? We resolved that none of those were going on. And then I said, “Well
maybe these features are just encoding how far away the enhancer is from the promoter
because we know at least at very short ranges, like 10 to 20 kilobases away, there is a higher
chance that an enhancer is interacting with a promoter.” So we looked and it turns out
there was no dependence on performance based on distance between the promoter and enhancer.
And if anything, we do a little bit better the further away the enhancer is from the
promoter. Despite the fact that many of these are millions, up to 2 million base pairs away
from the promoter that they regulate. So it wasn’t just — we weren’t encoding
distance here with these complex feature sets. So what was encoded in the feature set? As
I alluded to earlier, it turned out it was really important to look at the window between
the enhancer and the promoter. What proteins are decorating that looping chromatin? So
we looked — a nice aspect of using Ensembl methods is that there are now some very nice
techniques for feature importance. In other words, how important was each of the very
different datasets for the prediction accuracy, using techniques such as recursive feature
elimination, for example. So you can get a measure of the importance
of a feature. And here I’m making box plots where I’m showing the distribution in different
cell lines, plus a combined model in the four colors for the enhancer, the window in between,
and the promoter. So marks at the promoter, the window or in the enhancer, how predictive
are they on the vertical axis? And what you see is that there is signal in all three regions.
The promoter has a bit more information than the enhancer, but the window in between is
actually where the most information was, the most predictive information. So then I thought,
well maybe this was because there were just more proteins binding there, more signal.
But actually the — this is — this more predictive accuracy is despite the fact that the signal
— which I’m just plotting here as the sort of density of peaks — is actually lower on
the looping chromatin. So there’s not a lot going on there, but what is going on there
is super important. So what is it? What’s binding there and
not binding there? What’s happening to the DNA? As I alluded to, this is a complicated
mixtures of things. There’s not a simple signature, but it’s a very consistent story
when we look at the sorts of things. So if an enhancer and a promoter are looping with
each other and we look not just adjacent to the promoter and the enhancer, but on the
intervening window, we see other enhancers; this makes sense in the sense of super enhancers
or because we know that enhancers tend to cluster together. So not right next to, but
nearby, the enhancer are often other enhancers, so this is a very active region, but we might
have expected that. But what I wasn’t necessarily expecting was that the loop has a lot of marks,
epigenetic and the DNA methylation, et cetera, marks of heterochromatin. So this is first
of all telling you that maybe an intervening gene is not the target of that enhancer, that
it’s repressed. But in some cases there are actually little windows that aren’t
heterochromatinized that have active genes in them, but in between there’s this heterochromatin.
So there may actually be something physical or structural going on where it’s helpful
to compact the chromatin and bring the enhancer and promoter closer together. The biophysical
modeling literature has some sort of spring models and some other theories about heterochromatin
and how it helps these sorts of interactions that we’ve been reading about. So finally I said there were some active promoters
or some quote “active promoters” in the window, but frequently they are kind of false
signals because what we actually see when we look at the gene bodies of those genes
is it seems like the polymerase while loaded up at the promoters is not actively elongating
and making transcript. Now what about the false interactions, the cases where a promoter
and enhancer don’t interact? The window in between often has the cohesin complex on
it, including this zinc finger 143 that we heard about Mathieu this morning, suggesting
that there is a chromatin loop and a pinching-off with the cohesin complex of real chromatin
interaction, but it’s with an intervening promoter, not the one that you’re considering.
And there is some evidence that these loops are actually — can act as insulators, as
well. So it’s giving information that there may be a different target gene and also may
actually be a physical structure that prevents looping to a promoter further downstream.
And then mirroring what we saw, we saw marks of open chromatin and elongation, active promoters
and active gene bodies. Importantly the meaning of these different features was different
if we saw it at a promoter or an enhancer or in the window. So this cohesin complex
that is a negative predictor of an interaction, when it occurs in the window is actually a
positive predictor of an interaction if it’s near — if it’s flanking the enhancer and
promoter, as we heard from Mathieu this morning. So it’s important to actually split up the
feature in terms of these different regions and to keep in mind that a protein can serve
a different function depending on where you see it physically along the DNA. We saw this
not just for cohesin, but a number of other proteins, too. So what we started to see in
cohesin as an example of this is that there seem to be complexes that were forming on
this looping chromatin, that we would see several factors co-occurring or being co-predictive.
So we actually just looked genome-wide and looked at the co-location. We made a map of
the co-location of the predictive features. So here are some of the top features for the
K562 cell line. And if there’s a dark color in this heat map, it means that they actually
on this looping chromatin occur at flanking or overlapping positions. And so here is the
cohesin complex and those proteins are co-located — co-localized, as you would expect. But
it’s not just known complexes. When we form a network out of these co-localizations we
see some interactions or co-occurrences of different features that weren’t previous
known. So an orange is our co-localization data, purple are known protein/protein interactions.
And so, this suggests some potentially cooperative or interacting roles of some of these different
features that could be tested. And certainly from the perspective of prediction, we need
many of these variables in the model. No one of them alone is predictive. We need the combination
of something co-occurring or not co-occurring with something else. So the big question for us and various collaborators
and certainly for studying human accelerated regions and their role in human development,
would be can we do this outside of the ENCODE cell lines? Could we do this without the rich
feature set because we put hundreds and hundreds of datasets into the machine learning algorithm,
and could we have done that with less datasets? So first we assumed that we still had some
good training examples, some validated interactions and non-interactions to train and just asked,
well, what if ENCODE had only chipped five transcription factors, or 10, or 15, or 20?
How is the prediction accuracy affected? So what’s a minimal set of experiments? And
pleasingly we found that performance was very flat down to as few as 16 datasets totally
flat and still near optimal with as few as eight datasets. So you can’t just use one
or two features, as I mentioned. It’s a complex combination of things going on. And
if you look across examples in the genome, okay, many of them have say the cohesin complex
and non-interactions. Many of them have cohesin on the looping chromatin, but not all of them.
But the ones that don’t might have some other feature or a different epigenetic mark.
And so you needed several of these features and it’s not a random eight, but there are
a good number of different sets of eight that give near-optimal performance. They are not
the same features you would use for predicting promoters and enhancers, however, in most
cases. There’s slightly different ones, but this does give some hope for moving into
other cell lines that you wouldn’t need the time and budget and team of an ENCODE
project. Now what if I didn’t have that high resolution,
high speed produced at the Broad for millions of dollars? So I know here that I can get
away with fewer features. Could I get away with less or no training data? So the worst
case scenario would be no training data. Let’s say that I built the model on an ENCODE cell
line and then I plugged in the ChIP-seq from my cardiomyocytes, can I make prediction?
Is the — are the — is the model the same across different cell types? So we tested
that amongst the ENCODE cell lines, which are from different — totally different lineages
and so sort of a worst case scenario to see if a model trained on one cell line could
predict on another. And we heard a little bit along these lines from Tyler last night.
He also went across species. So this is the measure F-max, the predictive
accuracy. It’s the harmonic mean of precision and recall. And I already told you these numbers
where we had good balance of precision and recall when you train and test on the same
— on data held out from the same cell line. And you can see performance does degrade when
you go to a different cell line. So it’s very helpful to have some training examples.
It doesn’t need to be genome-wide, high-resolution, Hi-C, but some good, unbiased training examples
in your given cell type. And we’re testing now if some Chia Pet, for example, might achieve
this, which would be a less expensive experiment. But there — this is not horrible. This is
still decent. So we basically expect about 35 percent precision and 55 percent recall
on a new cell line with only 10 ChIP-seq datasets and no training data. So that’s kind of
a worst case scenario. We think if you use a more closely-related cell line that it will
be better than this. And that if you have a little bit of training data or a few more
features, you can improve. So I thought this was a good place to start from. So to summarize this target finder project,
our problem was to predict these interactions from things that are marking the DNA. It improved
significantly upon using the closest gene, which is frequently wrong and makes many false
positive predictions. The summary of our performance is that we can get more than 90 percent of
known pairs at a low false positive rate and that if you did this on a different cell line
with less data, it could be maybe as bad as 55 percent. So that’s a worst case scenario
with very little data. And the great thing — the most important thing probably is that
the false positive rate is really low. So our precision was high, our false discovery
rate was very low. So in the last couple minutes, I just want
to mention, how do you find the enhancers? Because we’ve also worked — used machine
learning to work on this problem. It’s published work and so I’ll just briefly summarize
it. But some of these same machine learning techniques have been helpful there. So which
sequences function at these long-range enhancers? We’ve been particularly interested in development
because the bioinformatics tell us that many of the human accelerated regions function
in development. And we also have a number of collaborations in heart and brain development
at the Gladstone Institutes where I work. The other reason to think about development
is there are many validated examples of enhancers, for example, from the Vista browser and we’ll
hear from Len Pennacchio [spelled phonetically] about that, I believe, in his talk tomorrow.
So these are pictures of mouse embryos where a candidate enhancer has been transiently
transfected into the single-cell embryo. And you can see staining in the tissues and at
the time points during development when that enhancer functions. It turns on a reporter
gene or it doesn’t. It was tested and there was no reporter gene expression. And so this
is a great proving ground then, or a good training data I should say, for doing a supervised
learning. So we again used genomic features, evolutionary conservation, in this case of
the sequence itself. Functional genomics data, again, at the potential enhancer location
and sequence motifs of known binding sites or position-specific weight matrices, predicted
binding sites, as well as just enumerating all k-mers as a way to get at binding sites
for transcription factors that don’t have a good motif model. All — in this case, all three types of data
were predictive. They did not predict overlapping sets. I mean, they predict partially overlapping
sets of enhancers, but each predicted some enhancers that the other one did not. And
so it was helpful to combine them in a model. And the model that included all three types
of data was the best performing by far. So this was a little different from the chromatin
looping predictions, where we really got most of our power from the functional genomics
data. Here we use to support vector machines, something called — a variance of it called
multi-kernel learning, which allows you to build a separate kernel or predictor for each
type of data and then do a linear combination, a weighted combination of those for the overall
predictor. This was helpful because we knew we needed all three types, and they’re not
on the same scale. They’re very different types of data and so it can be hard to put
them into a model together on a comparable or regularized scale. But I think there may
be room for improvement trying other algorithms here. We didn’t do a lot of experimenting
with different algorithms. So briefly to summarize, here’s our performance.
Again, we saw a very pleasing area under this curve, high power, a fairly low false positive
rate. And importantly, this significantly improves upon using a few different ChIP-seq
datasets. So in red, blue, and green are some of the typical enhancer marks, K4 monomethylation,
and K27 acetylation, as well as binding of the transcriptional coactivator P300. Each
of those by itself is somewhat predictive. If we combine them and we get the union of
all of them, we get a pretty high power, but an exceedingly high false positive rate. So
there was room for improving upon just doing some intersections in unions of datasets.
And we think that this is essentially the benefit of having some training data and using
machine learning framework is that you can improve upon the sort of simple, bioinformatic
combinations. Here the false discovery rate wasn’t quite as awesome as in the looping,
which is interesting because I actually thought this was an easier problem. But still we had
a pretty good recall at a pretty high precision. So we made predictions across the human genome.
This is for all of development, any tissue, about 84,000 predictions. They had many of
the bioinformatic features of enhancers. Important to me, we predicted kind of conservatively
that about a third of human accelerated regions were active during development. I’m not
going to show you the results, but we did some of those mouse experiments ourselves.
In 25 out of the 30 that we tested at just one developmental stage, the embryonic day
11.5 were active enhancers in vivo and we think now that several others are active at
other later time points. And another sort of piece of data supporting
these predictions was work from Adam Siepel’s lab looking at fitness effects of the positions
across the human genome on the FitCon scores. And even though they were looking in ENCODE
cell lines, these were developmental — it was trained on embryos, I was really surprised
to see that we were actually doing almost as well as their method at predicting these
sites with fitness effects. I don’t understand exactly why because it’s totally different
cell types, but I thought that was interesting. So to conclude just where we’re going from
here, these mass experiments are expensive. They’re low throughput; we can just test
one enhancer at a time. But as many of you might be aware, this experiment can now be
done in a high-throughput manner by taking this vector and putting a bar code downstream.
We’re actually putting the enhancer itself downstream as another variant. And that way
for every enhancer that you’re testing, you have a transcribed sequence that tells
you that that enhancer is working. And therefore you can assay the activity of many different
enhancers by RNA-seq. And it’s possible now to synthesize thousands of these, clone
them, and then in cell lines at least, to do in parallel in the same cells, look at
these thousands of — screen thousands of enhancers and specific mutations in enhancers
in parallel. And we are doing this in cell lines that are derived from induced pluripotent
stem cells. My office is right next to Shinya Yamanaka, who’s been referenced several
times today. I feel really honored to work with him and
Bruce Conklin and others at Gladstone who are real whizzes at reprogramming cells and
then differentiating them into different cells like neurons, and here, beating heart cells
in the dish that have a lot of the characteristics of the original tissues. And this is fantastic
especially for human, chimp comparisons because we can never get the tissues and cell lines
to do direct comparisons with an ape. For various ethical reasons, even if we were able
to obtain say human embryos, but here we avoid those issues completely by reprogramming skin
cells and getting these to various developmental cell pipes. So here’s our approach. This
computational things I’ve talked about today, the screening in IPS, and then we have to
still go back to animal models for real, functional studies. I’ll end there, thank our collaborators,
especially Sean Whelan, who led the work on target finder and our funding sources. And
I’m happy to take questions. [applause] Female Speaker:
Hi. Do you think, with the looping studies, that the synteny would have been more predictive
if you’d used species further apart? Katherine Pollard:
We looked across all of mammalian evolution. If you go much further out, there’re very
few of the enhancers are conserved and so it becomes really difficult to do that. So
we looked about as far as we could in terms of being to find a homologous promoter and
enhancer. There was some signal there, but not as much as had thought there would be. Female Speaker:
And was the signal that was actually there, were those mostly developmental genes? Or
do you know? Katherine Pollard:
Yes, actually there is more conserved synteny in developmental loci, yeah. Male Speaker:
Great talk. Katherine Pollard:
Thank you. Male Speaker:
Can you tell me anything about the resolution you end up with, with your prediction for
the Hi-C because for the Hi-C it’s like 1 kb, 4 kb, and there’s often multiple enhancers,
the adjust sites? Katherine Pollard:
You address a resolution of a single promoter and a single enhancer, so a kb or less. Male Speaker:
Single — so you’re able to parse out within a given block called by Hi-C, those that are
most likely to be — Katherine Pollard:
Yes — Male Speaker:
[unintelligible] Katherine Pollard:
— that’s precisely it. Yeah, we can get below — like a regular Hi-C experiment might
be like 25kb. We could — by using the ChIP-Seq — Male Speaker:
Yeah. Katherine Pollard:
— peaks we can resolve it down to a single promoter and enhancer in silico. Female Speaker:
I was wondering if you could comment on how well your enhancer finder works in terms of
enhancers being accumulated in a tissue or cell-specific fashion. Katherine Pollard:
Yeah, how cell-type specific is it? So we tried — besides just predicting any tissue
in the developing embryo, we tried to then go on and predict the tissue. And that’s
a harder problem. The AUC is more like 60, 70 percent. It depends on the tissue, so heart
enhancers were very easy to predict. They had a specific GC content and a low evolutionary
conservation, interestingly, and some very specific motifs. Some of the other tissues
like limb or brain were a bit harder. And I think partially that’s because we didn’t
have quite the right ChIP-seq data. I should have emphasized we’re predicting in the
developing embryo and we’re using ENCODE, an epigenetic Roadmap and about everything
we could get our hands on basically, everything that’s ever been deposited, very little
of which is from a developmental cell type. So we looked at heart development because
we do have collaborators who are studying differentiation into cardiomyocytes. And improved
— prediction did improve a little bit at getting specifically the heart — embryonic
heart by putting in the datasets from the IPS-derived cardiomyocytes. But it didn’t
— it still wasn’t quite as good as just the overall, okay, it’s a developmental
enhancer. So I think there’s some room to still improve on tissue specificity. Male Speaker:
That was a great talk. Katherine Pollard:
Thank you. Male Speaker:
I have a question about the target finder. Katherine Pollard:
Yes. Male Speaker:
Could you talk about what kind of cross-validation you did? Katherine Pollard:
Yeah, cross-validation was incredibly important because we wouldn’t want to over-fit, and
we needed some measure of performance. And so we tried a number of things but the results
— obviously the AUC curves and the precision and recall values I was reporting were from
10 fold cross-validation repeated. So it’s Ensembl learning, so it’s within each step
in the random forest we’re performing that, so it’s very computationally intensive.
But that makes sure that there’s no bleeding from the training data into the test data.
It’s very important that you do that right, that you aren’t within your Ensembl having
a feature sometimes on the training side and sometimes on the test side. That can give
a very rosy but inaccurate measure of performance. So, yeah, those were the cross-validation
error rates. Does that answer your question? Male Speaker:
Thank you. Katherine Pollard:
Yeah, great. Yeah, so maybe lunchtime. [applause] [end of transcript]

Leave a Comment

Your email address will not be published. Required fields are marked *