Testimony to FDA About Dendreon's Provenge, March 29, 2007
Dr. Bo-Guang Zhen, Ph.D
Division of Biostatistics, FDA
DR. ZHEN: Good morning. My name's Bo Zhen. I'm a statistical reviewer for FDA. I'm going to present statistical review and findings. First, I will give a quick review on efficacy results, and then bring up the issues in survival analysis, and the limitations of using post hoc analysis results. Then I will describe the challenges we are facing for this BLA from statistical standpoint.
Here is the quick review. Data from two Phase III studies were submitted to support license application. I call them Study 1 and Study 2. Both studies failed to meet the primary endpoint, and also failed to demonstrate statistical significance for other pre-specified endpoints. The key efficacy evidence was based on the difference in overall survival between the two arms. So the focus of this talk will be
on survival.
Here is the review for survival analysis. The sample size is relatively small for Study 1 and Study 2. And the differences in median survival between the two arms is 4.5 months for Study 1, and 3.3 months for Study 2. However, there are higher levels of variation. As you can see there, the confidence interval for median survival between the two arms, they are overlapped. And the lower bounds of the confidence interval for hazard ratio is 1.13, which is quite close to 1. One means there's no difference between the two groups. And also the survival experience between the two studies are quite different. The placebo patients, the median survival in Study 1 is 21.4 months, compared to the treated patients, the median survival in treated patients in Study 2. This difference could be due to the difference in baseline characteristics between the two studies, and also could be due to the
variation, because the sample size is relatively smaller for both studies.
This slide shows some of the sensitivity analysis for Study 1. P equals 0.01 from log rank test. And this p-value reduced to 0.002 using the Cox regression model after adjusting for a set of covariates. However, there are so many ways to use Cox regression model. You can select different sets of covariates. You can also pick different scale for a covariate. For example, in the way you use the original scale and use the log scale for PSA and the power points for bone metastases. As you can see there, different models. Using different models can come up with different hazard ratios and p-values. This one you get a p-value, it's 0.002, which could be in one of the best case scenario. And this one, you've got p-value of 0.078, which is not statistically significant. That could be in one of the worst case scenario. And this one is 0.048. The other critical issues in using Cox model is excluding patients from the model because of missing covariate data. For this model, 10 patients were excluded. And the next slide will show you how bias can be introduced by excluding patients from the model.
This slide shows that sipuleucel-T treated patients who were excluded from the model had a median survival of 19.4 compared to the rest of the treated patients in the model. And in contrast, placebo treated patients excluded from the model had median survival is 22.1 months compared to the rest of the placebo-treated patients. This is how bias could make the p-value look smaller, and also make the treatment effect looks much better than what it should be.
Here is the summary for Study 1. Exclusion of patients due to missing covariate data could lead to biased estimate. This bias could be in either direction, which means you could increase the treatment effect, or decrease the method of the treatment effect. Although p-values for treatment effect were greater than 0.05 in a few sensitivity analyses, the majority of the sensitivity analyses result in a p value of less than 0.05. So the sensitivity analyses supported the statistically significant findings for overall survival for Study 1. However, I used quotation marks here. Means the so-called statistical significance have the p-value less than 0.05 without adjustment for multiple comparisons. I will have more discussions for these later.
And for Study 2, p equals 0.331 based on log rank test. Also excluding patients in Cox model could also lead to biased estimate. Hypothesis test for treatment effect in Cox model resulted in a p-value range from 0.023 to 0.642. However, in most analyses, p is greater than 0.05, so the sensitivity analysis did not support the statistically significant findings for Study 2. I also used quotation marks here. This graph summarizes the efficacy survival results. Some of you would like to look at the scale on the log scale. But I used the informatic scale just in order to be consistent with the other presentations.
So the sensitivity analysis support the statistically significant findings for Study 1, but not for Study 2.
So it seems the difference in Study 1 is real. However, is this difference statistically significant? In other words, is this difference due to the treatment effect, or by chance alone. There are some issues here for these kinds of analysis. Here's the issues in survival analysis. Overall survival as an endpoint was not defined in either study protocol. A statistical analysis method for the primary comparisons in overall survival was not prespecified. Because of these two reasons, so the alpha level, which means the probability of making a false positive claim for treatment effect was not allocated to the primary test for overall survival. We call this as post hoc analysis. And the post hoc analysis make it difficult to interpret the hypothesis test result.
To know the limitations of post hoc analysis, first of all we should know what is a well pre-specified analysis. For this type of analysis it is very essential to, number one, define endpoint clearly, describe statistical analysis methods, and, if it's more than one method, state which one would be used for primary comparison, and set the alpha level, which in general is 0.05 level. These are also called statistical significance level sometimes. And allocate the alpha level to each test if multiplicity adjustment is needed. Then one is able to say the difference is statistically significant or not based on the p-value from the primary comparisons. Otherwise, it is difficult to interpret the p-values.
And this slide has nothing to do with the submission, but it's very important for statistical concepts. I use hypothetical cases just to show the interpretation of p-value in studies with pre-specified analysis. Just hopefully, through these hypothetical cases, you understand how difficult to interpret the p value from post hoc analysis. Three different designs are presented here. Trial 1, there's only one primary endpoint here, but three primary comparisons, two for interim, and one for final. In order to control the alpha level, that's the probability of making a false positive claim for treatment effect. At the 0.05 level, we need to split this level into several parts. This is one of the ways to split the level. If this is the p-value you obtained from the hypothesis test, they are now statistically significant, although you can see this one is 0.01, because it is greater than the corresponding values. And Trial B and C have two primary endpoints, one primary comparisons for each endpoint, and this is the way how they split the alpha level. If this is the p-value you get from the hypothesis test, this trial is also not statistically significant. So therefore, if you want to control the probability of making a false positive claim for treatment effect under this level, 0.05 level. So all these trials should be considered failure.
So from the previous slide we show that obtaining a p-value of 0.01 or less than 0.05 may not always be considered statistically significant in the well prespecified analysis. When a study fails to meet its primary endpoints, there's no alpha left for other endpoints analysis. So literally, means from pure statistical point of view, the difference in other endpoints should not be considered statistically significant. Therefore, it is very difficult to interpret the hypothesis test result for overall survival in Study 1.
Because in post hoc analysis, one could keep conducting hypothesis tests for treatment effect on different endpoints and - or on the same endpoint using different analyses methods. Just as I show you the
Cox regression model for Study 1, different methods, you would come up with different p values and hazard ratio. Then one - it's
very easy to obtain a so-called statistically significant result, even when there's no treatment effect. So if overall survival is one of the many unspecified endpoints, under testing it is very possible that a p-value of 0.01 was observed just by chance. However, survival is not one of the many, many endpoints that can be randomly selected for testing. Survival is a preferred endpoint for cancer trial. As Dendreon and Dr. Liu just mentioned, this endpoint is reliable, clinically meaningful. This is why we are here seeking advice from the advisory committee meeting.
But here's the changes in survival analysis. Since the analysis was based on post hoc analysis. So it's difficult to interpret the p-value. Here's 0.01 for Study 1. Even someone can make a judgment, this 0.01 is statistically significant. But that statistical significance only demonstrate in Study 1, though there's a trend for Study 2. And the
lower bound of 95 percent confidence interval for hazard ratio is 1.13, quite close to 1, so these results also may not be that robust. That's the end of my talk.
Thank you.
DR. MULÉ: Thanks, Dr. Zhen.
More on this Topic:
Testimony from Prostate Cancer
Patients and Partners,
FDA, March 29 2007.
News reports and releases and about Provenge
and other vaccines