Chapter 8
IN THIS CHAPTER
Understanding levels of measurement (nominal, ordinal, interval, and ratio)
Defining and entering different kinds of data into your research database
Making sure your data are accurate
Creating a data dictionary to describe the data in your database
Before you can analyze data, you have to collect it and get it into the computer in a form that’s suitable for analysis. Chapter 5 describes this process as a series of steps — figuring out what data you need and how they are structured, creating data entry forms and computer files to hold your data, and entering and validating your data.
In this chapter, we describe a crucially important component of that process, which is storing the data properly in your research database. Different kinds of data can be represented in the computer in different ways. At the most basic level, there are numerical values and classifications, and most of us can immediately tell the two apart — you don’t have to be a math genius to recognize “age” as numerical data, and “occupation” as categorical information.
So why are we devoting a whole chapter to describing, entering, and checking different types of data? It turns out that the topic of data storage is not quite as trivial as it may seem at first. You need to be aware of some important details or you may wind up collecting your data the wrong way and finding out too late that you can’t run the appropriate analysis. This chapter starts by explaining the different levels of measurement, and shows you how to define and store different types of data. It also suggests ways to check your data for errors, and explains how to formally describe your database so that others are able to work with it if you’re not available.
Around the middle of the 20th century, the idea of levels of measurement caught the attention of biological and social-science researchers and, in particular, psychologists. One classification scheme, which has become widely used (at least in statistics textbooks), recognizes four levels at which variables can be measured: nominal, ordinal, interval, and ratio:
Although you should be aware of the four levels of measurement described in the preceding section, you also need to be able to classify and deal with data in a more pragmatic way. The following sections describe various common types of data you’re likely to encounter in the course of clinical and other research. We point out some considerations you need to think through before you start collecting your data.
You should also be aware that most software has field-length limitations for text fields. Although commonly used statistical programs like Microsoft Excel, SPSS, SAS, R, and Python may allow for long data fields, this does not excuse you from designing your study so as to limit collection of free-text variables. Flip to Chapter 4 for an introduction to statistical software.
Every participant in your study should have a unique participant study identifier (typically called a study ID). The study ID is present in the participant’s data and is used for identifying the participant on study materials (for example, laboratory specimens sent for analysis). You may need to combine two variables to create a unique identifier. In a single-site study that is carried out at only one geographical location, the study ID can be a whole number that is two- to four-digits long. It doesn’t have to start at 1; it can start at 100 if you want all the ID numbers to be three-digits long without leading zeros. In multi-site studies that are carried out at several locations (such as different clinics or labs), the number often follows some logic. For example, it could have two parts, such as a site number and a local study ID number separated by a hyphen (for example, 03-104), which is where you need two variables to get a unique ID.
You may also want to include separate fields to hold prefixes (Mr., Mrs., Dr., and so on) and suffixes (Jr., III, PhD, and so forth).
Addresses should be stored in separate fields for street, city, state (or province), ZIP code (or comparable postal code).
Setting up your data collection forms and database tables for categorical data requires more thought than you may expect. You may assume you already know how to record and enter categorical data. You just type in the values — such as “United States,” “nurse,” or “Stage I” — right? Wrong! (But wouldn’t it be nice if it were that simple?) The following sections look at some of the issues you have to address when storing categorical values as research data.
The first issue you need to decide is how to code the categories. How are you going to store the values in the research database? Do you want to enter the type of care provider as nurse, physician, or social worker; or as N, P, or SW; or as 1 = nurse, 2 = physician, and 3 = social worker; or in some other manner? Most modern statistical software can analyze categorical data with any of these representations, but it is easiest for the analyst if you code the variables using numbers to represent the categories. Software like SPSS, SAS, and R lets you specify a connection between number and text (for example, attaching a label to 1 to make it display Nurse on statistical output) so you can store categories using a numerical code while also displaying what the code means on statistical output. In general, best practices are to set conventions and be consistent, and make sure the content and meaning of each variable is documented. You can also attach variable labels.
When a categorical variable has more than two levels (like the Type of Caregiver or Likert agreement scale examples we describe in the earlier section “Looking at Levels of Measurement”), data storage gets even more interesting. First, you have to ask yourself, “Is this variable a Choose only one or Choose all that apply variable?” The coding is completely different for these two kinds of multiple-choice variables.
You handle the Choose only one situation just as we describe for Type of Caregiver in the preceding section — you establish numeric code for each alternative. For the Likert scale example, if the item asked about patient satisfaction, you could have a categorical variable called PatSat, with five possible values: 1 for strongly disagree, 2 for somewhat disagree, 3 for neither agree nor disagree, 4 for somewhat agree, and 5 for strongly agree. And for the Type of Caregiver example, if only one kind of caregiver is allowed to be chosen from the three choices of nurse, physician, or social worker, you can have a categorical variable called CaregiverType with three possible values: 1 for nurse, 2 for physician, and 3 for social worker. Depending upon the study, you may also choose to add a 4 for other, and a 9 for unknown (9, 99, and 999 are codes conventionally reserved for unknown). If you find unexpected values, it is important to research and document what these mean to help future analysts encountering the same data.
But the situation is quite different if the variable is Choose all that apply. For the Type of Caregiver example, if the patient is being served by a team of caregivers, you have to set up your database differently. Define separate variables in the database (separate columns in Excel) — one for each possible category value. Imagine that you have three variables called Nurse, Physician, and SW (the SW stands for social worker). Each variable is a two-value category, also known as a two-state flag, and is populated as 1 for having the attribute and 0 for not having the attribute. So, if participant 101’s care team includes only a physician, participant 102’s care team includes a nurse and a physician, and participant 103’s care team includes a social worker and a physician, the information can be coded as shown in the following table.
Subject |
Nurse |
Physician |
SW |
|---|---|---|---|
101 |
0 |
1 |
0 |
102 |
1 |
1 |
0 |
103 |
0 |
1 |
1 |
If you have variables with more than two categories, missing values theoretically can be indicated by leaving the cell blank, but blanks are difficult to analyze in statistical software. Instead, categories should be set up for missing values so they can be part of the coding system (such as using a numerical code to indicate unknown, refused, or not applicable). The goal is to make sure that for every categorical variable, a numerical code is entered and the cell is not left blank.
Along the same lines, don’t group numerical data into intervals when recording it. If you know the age to the nearest year, don’t record Age in 10-year intervals (such as 20 to 29, 30 to 39, 40 to 49, and so on). You can always have the computer do that kind of grouping later, but you can never recover the age in years if all you record is the decade.
Some statistical programs let you store numbers in different formats. The program may refer to these different storage modes using arcane terms for short, long, or very long integers (whole numbers) or single-precision (short) or double-precision (long) floating point (fractional) numbers. Each type has its own limits, which may vary from one program to another or from one kind of computer to another. For example, a short integer may be able to represent only whole numbers within the range from
, whereas a double-precision floating-point number could easily handle a number like
. Excel has no trouble storing numerical data in any of these formats, so to make these choices, it is best to study the statistical program you will use to analyze the data. That way, you can make rules for storing the data in Excel that make it easy for you to analyze the data once it is imported into the statistical program.
mmHg) into one column of data. Excel won’t complain about it, but it will treat it as text because of the embedded “/”, rather than as numerical data. Instead, create two separate variables and enter each number into the appropriate variable.Now we’re going to tell you something that sounds like we’re contradicting the advice we just gave you (but, of course, we’re not!). Most statistical software (including Microsoft Excel) can represent dates and times as a single variable (an “instant” on a continuous timeline), so take advantage of that if you can. In Excel, you can enter the date and time as one variable (for example, 07/15/2020 08:23), not as a separate date variable and a time variable. This method is especially useful when dealing with events that take place over a short time interval (like events occurring during a surgical procedure). It is important to collect all potential start and end dates so any duration during the study can be calculated.
Some programs may store a date and time as a Julian Date, whose zero occurred at noon, Greenwich Mean Time, on Jan. 1, 4713 BC. (Nothing happened on that date; it’s purely a numerical convenience.)
If you impute a date, just create a new column with the imputed date, because you want to be cautious. Make sure to keep the original partial date for traceability. Any date imputation should be consistent with the study protocol, and not bias the results. Completely missing dates should be left blank, as statistical software treats blank cells as missing data.

Every research database, large or small, simple or complicated, should include a data dictionary that describes the variables contained in the database. It is a necessary part of study documentation that needs to be accessible to the research team. A data dictionary is usually set up as a table (often in Excel), where each row provides documentation for each variable in the database. For each variable, the dictionary should contain the following information (sometimes referred to as metadata, which means “data about data”):
Database programs like SQL and statistical programs like SAS often have a function that can output information like this about a data set, but it still needs to be curated by a human. It may be helpful to start your data dictionary with such output, but it is best to complete it in Excel. That way, you can add the human curation yourself to the Excel data dictionary, and other research team members can easily access the data dictionary to better understand the variables in the database.