Chapter 21
IN THIS CHAPTER
Beginning with the basics of survival data
Generating life tables and trying the Kaplan-Meier method
Applying some handy guidelines for survival analysis
Using survival data for even more calculations
This chapter describes statistical techniques that deal with a special kind of numerical data called survival data or time-to-event data. These data reflect the interval from a particular starting point in time, such the date a patient receives a certain diagnosis or undergoes a certain procedure, to the first or only occurrence of a particular kind of event that represents an endpoint. Because these techniques are often applied to situations where the endpoint event is death, we usually call the use of these techniques survival analysis, even when the endpoint is something less drastic (or final) than death. Survival data could include time from resolution of a chronic illness symptom to its relapse, but it can also be a desirable endpoint, such as time to remission of cancer, or time to recovery from an acute condition. Throughout this chapter, we use terms and examples that imply that the endpoint is death, such as saying survival time instead of time to event. However, everything we say also applies to other kinds of endpoints.
You may wonder why you need a special kind of analysis for survival data in the first place. Why not just treat survival times as ordinary numerical variables? Why not summarize them as means, medians, standard deviations, and so on, and graph them as histograms and box-and-whiskers charts? Why not compare survival times between groups with t tests and ANOVAs? Why not use ordinary least-squares regression to explore how various factors influence survival time?
In this chapter, we explain how survival data aren’t like ordinary numerical data and why you need to use specific techniques to analyze them properly. We describe two ways to construct survival curves: the life-table and the Kaplan-Meier methods. We guide you in preparing and interpreting survival curves and show you how to glean useful information from these curves, such as median survival time and five-year survival rates.
To understand survival analysis, you first have to understand survival data. Survival times are intervals between a designated starting time point and the time point an event occurs. These intervals have can have a specific type of missing data due to a phenomenon called censoring. Because survival data usually include censored data, they must be analyzed in a very specific way to avoid generating biased estimates that lead to incorrect conclusions.
The techniques described in this chapter for summarizing, graphing, and comparing survival data deal with the time interval from a defined starting point to the first occurrence of an endpoint event. The event can be designated as death or a relapse of a particular condition, such as a recurrence of cancer. Or you could designate the event to be surgical removal (called an explant) of a failed mechanical component, such as an artificial heart valve. If a patient’s heart valve was implanted on January 10 (beginning of time interval), but their body rejected it and the explant took place on January 30 (time of event), then the time interval from implant to explant is 30 – 10, or 20 days.
A person can die only once, so survival analysis can obviously be used for one-time events. But other endpoints can occur multiple times, such as having a stroke or having cancer go into remission. The techniques we describe in this chapter only analyze time to the first occurrence of the event. More advanced survival analysis methods are needed for models that can handle multiple occurrences of an event, and these are beyond the scope of this book.
If non-normality were the only problem with survival data, you’d be able to summarize survival times as medians and centiles instead of means and standard deviations. Also, you could compare survival between groups with nonparametric Mann-Whitney and Kruskal-Wallis tests instead of t tests and ANOVAs. But time-to-event data are susceptible to a specific type of missingness called censoring. Typical parametric and nonparametric regression methods are not equipped to deal with censoring, so we present survival analysis techniques in this chapter.
Survival data are defined as the time interval between a selected starting point and an endpoint that represents an event. But unfortunately, the time the event takes place can be missing in survival data. This can happen in two general ways:
You can describe these two situations in one general way. You know that every participant in the study either died on a certain date (in which case they have the event), or was alive up to some last-seen date when they stopped being observed, in which case they are censored.
Figure 21-1 shows the results of a small study of survival in cancer patients after a surgical procedure to remove a tumor. Ten patients were recruited to participate in the study and were enrolled at the time of their surgery. The recruitment period went from Jan. 1, 2010, to the end of Dec. 31, 2011 (meaning a two-year enrollment period). All participants were then followed until they died, or until the conclusion of the study, on Dec. 31, 2016, which added five years of additional observation time after the last enrollment. Each participant has a horizontal timeline that starts on the date of surgery and ends with either the date of death or the censoring date.

© John Wiley & Sons, Inc.
FIGURE 21-1: Survival of ten study participants following surgery for cancer.
In Figure 21-1, observe that each line ends with a code, and there’s a legend at the bottom. Six of the ten participants (#’s 1, 2, 4, 6, 9, and 10, labeled X) died during the course of the follow-up study. Two participants (#5 and #7, labeled L) were LFU at some point during the study, and two participants (#3 and #8, labeled E) were still alive at the end of the study. So this study has four participants — the Ls and the Es — with censored survival times.
So, how do you analyze survival data containing censoring? The following sections explain the correct ways to proceed as well as mistakes to avoid.
The first task when analyzing survival data is usually to describe how the hazard and survival rates vary with time. In this chapter, we show you how to estimate the hazard and survival rates, summarize them as tables, and display them as graphs. Most of the larger statistical packages (such as those described in Chapter 4) allow you to do the calculations we describe automatically, so you may never have to do them manually. But without first understanding how these methods work, it’s almost impossible to understand any other aspect of survival analysis, so we provide a demonstration for instructional purposes.
Exclusion and imputation don’t work to fix the missingness in censored data. You can see why in Figure 21-2, where we’ve slid the timelines for all the participants over to the left as if they all had their surgery on the same date. The time scale shows survival time in years after surgery instead of chronological time.

© John Wiley & Sons, Inc.
FIGURE 21-2: Survival times from the date of surgery.
If you exclude all participants who were censored in your analysis, you may be left with analyzable data on too few participants. In this example, there are only six uncensored participants, and removing them would weaken the power of the analysis. Worse, it would also bias the results in subtle and unpredictable ways.
Using the last-seen date in place of the death date for a censored observation may seem like a legitimate use of LOCF imputation, but because the participant did not die during the observation period, it is not acceptable. It’s equivalent to assuming that all censored participants died immediately after the last-contact date. But this assumption isn’t reasonable, because it would not be unusual for them to live on many years. This assumption would also bias your results toward artificially shorter survival times.
To estimate survival and hazard rates in a population from a set of observed survival times, some of which are censored, you must combine the information from censored and uncensored observations properly. How is this done? Well, it’s not done by dividing the number of participants alive at a certain time point in the study by the total number of participants in the study, because this fails to account for censored observations.
Instead, think of the observation period in a study as a series of slices of time. Think about how each time a participant survives a slice of time and encounters the next one, they have a certain probability of surviving to the end of that slice and continuing on to encounter the next. The cumulative survival probability can then be obtained by successively multiplying all these individual time-slice survival probabilities together. For example, to survive three years, first the participant has to survive the first slice (Year 1), then survive the second slice (Year 2), and then survive the third slice (Year 3). The probability of surviving all three years is the product of the probabilities of surviving through Year 1, Year 2, and Year 3.
These calculations can be laid out systematically in a life table, which is also called an actuarial life table because of its early use by insurance companies. The calculations only involve addition, subtraction, multiplication, and division, so they can be done manually. They are easy to set up in a spreadsheet format, and there are many life-table templates available for Microsoft Excel and other spreadsheet programs that you can use.
To create a life table from your survival data, you should first break the entire range of survival times into convenient time slices. These can be months, quarters, or years, depending on the time scale of the event you’re studying. Also, you have to consider the time increments in which you want to report your results. You should arrange to have at least five slices or else your survival and hazard estimates will be too coarse to show any useful features. Having many skinny slices doesn’t disturb the calculations, but the life table will have many rows and may become unwieldy. For the survival times shown in Figure 21-2, a natural choice would be to use seven 1-year time slices.
Next, count how many participants experienced the event during each slice, and how many were censored, meaning they were last observed during this time slice and had not experienced the event. From Figure 21-2, you see that
Continue tabulating deaths and censored times for the fourth through seventh years, and enter these counts into the appropriate cells of a spreadsheet like the one shown in Figure 21-3.

© John Wiley & Sons, Inc.
FIGURE 21-3: A partially completed life table to analyze the survival times shown in Figure 21-2.
To fill in the table shown in Figure 21-3:
After you’ve entered all the counts, the spreadsheet will look like Figure 21-3. Then you perform the calculations shown in the Formula row at the top of the figure to generate the numbers in all the other cells of the table. (To see what it looks like when the table is completely filled in, take a sneak peek at Figure 21-4.)
Column B includes the number of participants known to be alive at the start of each year after surgery. This is equal to the number of participants alive at the start of the preceding year minus the number who died (Column C) or were censored (Column D) during the preceding year. Here’s the formula, written in terms of the column letters: B for any year = B – C – D from the preceding year.
Here’s how this process plays out in Figure 21-3:
Column E shows the number of participants at risk for dying during each year. You may guess that this is the number of participants alive at the start of the interval, but there’s one minor correction. If any were censored during that year, then they weren’t technically able to be observed for the entire year. Though they may die that year, if they are censored before then, the study will miss it. What if you don’t know exactly when during that year they became censored? If you don’t have the exact date, you can consider them being observed for half the time period (in this case, 0.5 years). So the number at risk can be estimated as the number alive at the start of the year, minus one-half of the number who became censored during that year, as indicated by the formula for Column E: E = B – D/2. (Note: To simplify the example, we are using years, but you could use months instead if you have exact censoring and death dates in your data to improve the accuracy of your analysis.)
Here’s how this formula works in Figure 21-3:
Column F shows the Probability of Dying during each interval, assuming the participant has survived up to the start of that interval. To calculate this, divide the Died column by the At Risk column. This represents the fraction of those who were at risk of dying at the beginning of the interval who actually died during the interval. Formula for Column F: F = C/E.
Here’s how this formula works in Figure 22-3:
Column G shows the Probability of Surviving during each interval for participants who have survived up to the start of that interval. Since surviving means not dying, the equation for this column is 1 – Probability of Dying, as indicated by the formula for Column G: G = 1 – F.
Here’s how this formula works out in Figure 22-3:
Column H shows the cumulative probability of surviving from the time of surgery all the way through the end of this time slice. To survive from the start time through the end of any given year (year N), the participant must survive each of the years from Year 1 through Year N. Because surviving each year is an independent accomplishment, the probability of surviving all N of the years is the product of the individual years’ probabilities. So Column H is a running product of Column G. In other words, the value of Column H for Year N is the product of the first N values in Column G.
Here’s to fill in Figure 22-3 (with the results shown in Figure 22-4):
Figure 21-4 shows the spreadsheet with the results of all the preceding calculations.

© John Wiley & Sons, Inc.
FIGURE 21-4: Completed life table to analyze the survival times shown in Figure 22-2.
Graphs of hazard rates and cumulative survival probabilities (Columns F and H from Figure 21-4, respectively) can be prepared from life-table results using Microsoft Excel or another spreadsheet or statistical program with graphing capabilities. Figure 21-5 illustrates the way these results are typically presented.

© John Wiley & Sons, Inc.
FIGURE 21-5: Hazard function (a) and survival function (b) results from life-table calculations.
Using very narrow time slices doesn’t hurt life-table calculations. In fact, you can define slices so narrow that each participant’s survival time falls within its own private little slice. Imagine you had N participants. Your life table would have N rows with data from one participant each. You could theoretically add all rest of the rows to fill out the rest of the time slices. These would not have any data in them, and since empty rows don’t affect the life-table calculations, you could just stick with your life table where each row has one participant’s data. And if you happen to have two or more participants with exactly the same survival or censoring time, it’s okay to put each one in their own row.
The life-table calculations work fine with only one participant per row and produce what’s called Kaplan-Meier (K-M) survival estimates. You can think of the K-M method as a very fine-grained life table. Or, you can see a life table as a grouped K-M calculation.
A K-M worksheet for the survival times is shown in Figure 21-6. It is based on the one-participant-per-row idea and is laid out much like the usual life-table worksheet shown in Figure 22-4, but with a few differences in the raw data cells and minor differences in the calculations:

© John Wiley & Sons, Inc.
FIGURE 21-6: Kaplan-Meier calculations.
Figure 21-7 shows graphs of the K-M hazard and survival estimates from Figure 21-6. These charts were created using the R statistical software. Most software that performs survival analysis can create graphs similar to this. The K-M survival curve in Figure 21-7b has smaller steps than the life-table survival curve in Figure 21-5b, so it’s more fine-grained. This is because the step curve now decreases at every time point at which a participant died. You can tell from the figures where participant #1 died at 0.74 years, #9 died at 2.27 years, #4 died at 2.34 years, and so on.

© John Wiley & Sons, Inc.
FIGURE 21-7: Kaplan-Meier estimates of the hazard (a) and survival (b) functions.
Most of the larger statistical packages (see Chapter 4) can perform life-table and Kaplan-Meier calculations for you and directly generate survival curves. You have to identify two variables for the software: one with the survival time for each participant, and a binary variable coded 1 if the survival time represents time to death or the event, and 0 if it represents censored time. It sounds simple, but it’s surprisingly easy to mess up. Here are some pointers for setting up your data and interpreting the results properly.
Dates and times should be recorded to suitable precision. If your study timeline is years, it’s best to keep track of dates to the day. In a Phase I clinical trial (see Chapter 5), participants may be studied for events that happen in a span of a few days. In those cases, it’s important to record dates and times to the nearest hour or minute. You can even envision laboratory studies of intracellular events where time would have to be recorded with millisecond — or even microsecond — precision!
It can be surprisingly easy to miscode the event status indicator. If the name of the variable is Death, and is coded as 1 if the participant died during the observation period and 0 if they were censored, this seems intuitive. But analysts may want to identify all the censored observations in their data, so they may create a censored indicator named Censored, and code it as 1 if the participant is censored, and 0 if they are not. Because data may be used for different types of survival analyses, there could be other event indicators included in the data as well also coded as 1 and 0.
The problem is that if you accidentally use your censored indicator instead of your event indicator when running your survival analysis, you will unknowingly flip your analysis, and you won’t get any warning or error message from the program. You’ll only get incorrect results. Worse, depending on how many censored and uncensored observations you have, the survival curve may also not hint at any errors. It may look like a perfectly reasonable survival curve for your data, even though it’s completely wrong.