Chapter 4

Counting on Statistical Software

IN THIS CHAPTER

Bullet Examining the evolution of statistical software

Bullet Surveying commercial, open source, and free options

Bullet Considering code-based versus non–code-based software

Bullet Storing data in the cloud

Before statistical software, complex regressions we could do in theory were too complicated to do manually using real datasets. It wasn’t until the 1960s with the development of the SAS suite of statistical software that analysts were able to do these calculations. As technology advanced, different types of software were developed, including open-source software and web-based software.

As you may imagine, all these choices led to competition and confusion among analysts, students, and organizations utilizing this software. Organizations wonder what statistical packages to implement. Professors wonder which ones to teach, and students wonder which ones to learn. The purpose of this chapter is to help you make informed choices about statistical software. We describe and provide guidance regarding the practical choices you have today among the statistical software available. We discuss choosing between:

Commercial software, such as SAS and SPSS
Open-source software, such as R and Python
Free software applications, such as G*Power and PS (Power and Sample Size Calculation)

We also provide guidance on how to choose between code-based and non–code-based software, and end by providing advice on cloud data storage.

Considering the Evolution of Statistical Software

The first widespread commercial statistical software invented is called SAS, and it is still used today. SAS was developed originally in the 1960s and 1970s to run on mainframe computers. Around 2000, SAS was adapted to personal computers (known as PC SAS), adding a user-friendly graphical user interface (GUI). During the growth of SAS, other commercial statistical packages appeared, the most popular being IBM’s SPSS. SAS continues to be the go-to program for big data analysis, where analysts can easily access large datasets from servers. In contrast, SPSS continues to be used on a personal computer like PC SAS.

If you were to take a college statistics course in the year 2000, your course would have likely taught either SAS or SPSS. Professors would have made either SPSS or SAS available to you for free or for a nominal license fee from your college bookstore. If you take a college statistics course today, you may be in the same situation — or, you may find yourself learning so-called open-source statistical software packages. The most common are R and Python. This software is free to the user and downloadable online because it is built by the user community, not a company.

As the Internet evolved, more options became available for statistical software. In addition to the existing stand-alone applications described earlier, specialized statistical apps were developed that only perform one or a small collection of specific statistical functions (such as G*Power and PS, which are for calculating sample sizes). Similarly, web-based online calculators were developed, which are typically programmed to do one particular function (such as calculate a chi-square statistic and p value from counts of data, as described in Chapter 12). Some web pages feature a collection of such calculators.

Comparing Commercial to Open-Source Software

Before 2010, if an organization performed statistical analysis as part of its core function, it needed to purchase commercial statistical software like SAS or SPSS. Advantages of implementing commercial software include the ability to perform many statistical functions, technical support from the software company, and the expectation that the software will remain in use in the future as the company continues to support and upgrade it.

However, organizations today are hesitant to adopt commercial software when they can instead use open-source software like R or Python. Admittedly, even though it is free of charge, there are many downsides to open-source software. First, you need to hire analysts who know how to use it so well that they can figure out what to do when there’s a problem because open-source software does not have tech support. Next, you need to hire a lot more analysts than you would with commercial software because a lot of their work will be in trying to customize the software for your use and keep it updated so that your organization runs smoothly.

So, why are new organizations today hesitant to adopt commercial software when open-source software has so many downsides? The main reason is that the old advantages of commercial software are not as true anymore. SAS and SPSS are expensive programs, but they have much of the same functionality as open-source R and Python, which are free. In some cases, analysts prefer the open-source application to the commercial application because they can customize it more easily to their setting. Also, it is not clear that commercial software is innovating ahead of open-source software. Organizations do not want to get entangled with expensive commercial software that eventually starts to perform worse than free open-source alternatives!

As a result, many organizations use both commercial and open-source statistical software in integrated application pipelines. Therefore, it is important to be comfortable evaluating and using various commercial software, even if open-source options are becoming more popular.

Checking Out Commercial Software

In the following sections, we discuss the most popular commercial statistical software available currently.

SAS

SAS is the oldest commercial software currently available. It started out as having two main components — Base SAS and SAS Stat — that provided the most used statistical calculations. However, today, it has grown to include many additional components and sublanguages. SAS has always been so expensive that only organizations with a significant budget can afford to purchase and use it. However, because individual learners need to be able to practice SAS even if they cannot afford it, SAS developed a free, online version called SAS OnDemand for Academics (ODA) that is available at https://welcome.oda.sas.com.

Originally, SAS ran as a command-prompt software without a guided user interface, or GUI, which came later in the 2000s when PC SAS was invented. In the original SAS, the user would gain access to datasets in SAS format that resided on a SAS server in the same environment. The user would write code files using SAS code and run these files against the SAS data. This action would produce a log file that explained how the code was executed and reported any errors. It would also produce output that provided the results of the statistical procedures.

Today, the experience of using SAS has been modernized. In PC SAS and SAS ODA, it is easy to view code, log, and output files in different windows and switch back and forth between them. It is also easier to import data into and out of the SAS environment and create integrated application pipelines involving the SAS environment. The new commercial cloud-based version of SAS called Viya is intended to be used with data stored in the cloud rather than on SAS servers (see the later section “Storing Data in the Cloud” for more).

SAS is entrenched in some industries, such as pharmaceutical, insurance, and banking, because SAS has historically been the only program powerful enough to handle the size of their datasets. Those settings traditionally used SAS servers for data storage. Now, this practice is being challenged because other analytic options may look more appealing than what SAS has to offer (see the section “Focusing on open-source and free software”). In addition, many companies are having trouble maintaining their old-fashioned SAS servers and want to move their data to cloud storage. These industries are looking for SAS users to help them modernize their operations.

Students often find that SAS is challenging to learn when compared to other statistical software, especially open-source software. Why learn legacy commercial software like SAS today, when it is so much harder to learn than other software? The answer is that SAS is still standard software in some domains, such as pharmaceutical research. This means that even if those organizations choose to eventually migrate away from SAS, they will need to hire SAS users to help with the migration.

SPSS

SPSS was invented more recently than SAS and runs in a fundamentally different way. SPSS does not expect you to have a data server the way SAS does. Instead, SPSS runs as a stand-alone program like PC SAS, and expects you to import data into it for analysis. Therefore, SAS is more likely to be used in a team environment, while SPSS tends to have individual users.

Like SAS, SPSS produces output, but unlike SAS, SPSS is typically manipulated by the user through selections in menus rather than through writing code and running it. SPSS produces one long output file that includes all the output from each SPSS session. In the output file, SPSS includes code it writes automatically from the way you manipulate the menu. Therefore, like with SAS, it is possible to save SPSS code files and output files and rerun the same code later. SPSS is available from IBM’s website at www.ibm.com/products/spss-statistics/pricing.

Microsoft Excel

Microsoft Excel has been used in some domains for statistical calculations, but it is difficult to use with large datasets. Excel has built-in functions for summarizing data (such as calculating means and standard deviations talked about in Chapter 9). It also has common probability distribution functions such as Student t (Chapter 11) and chi-square (Chapter 12). You can even do straight-line regression (Chapter 16), as well as more extensive analyses available through add-ins.

These functions can come in handy when doing quick calculations or learning about statistics, but using Excel for statistical projects evokes many challenges. Using a spreadsheet for statistics means your data are stored in the same place as your calculations, creating privacy concerns (and a mess!). So, while Excel can be helpful mathematically — especially when making extra calculations based on estimates in printed statistical output — it is not a good practice to use it for extensive statistical projects.

Microsoft Excel is available in different formats, including both downloadable and web based. Purchase it from Microsoft at www.microsoft.com.

Online analytics platforms

A more modern approach to statistical software is to create an online platform known as an analytics suite that allows you to connect to data sources and conduct analytics online. Here are a few popular online platforms:

Tableau: Tableau is known for being able to provide real-time data-driven graphical displays online, and organizations may adopt Tableau to develop customized dashboards. It is available at www.tableau.com.
GraphPad: This online platform provides analytics support, such as curve-fitting, and provides a graphical suite called Prism. It is available at www.graphpad.com.

There are both advantages and disadvantages to using these online commercial platforms. Advantages include that online software tends to follow a cheaper subscription paid monthly or annually, and you get continuous upgrades because the software is web based. The main downside is these platforms have a high learning curve and require a lot of work to fully adopt, so you have to ask yourself if it makes sense with your project.

Focusing on Open-Source and Free Software

Open-source software refers to software that has been developed and supported by a user community. Although open-source software has licenses, they are typically free but require you to adhere to certain policies when using the software. In this section, we talk about the two most popular open-source statistical software packages: R and Python.

Open-source software

The two most popular and extensive open-source statistical programs are R and Python.

R: R is statistical software that has been developed and is maintained by the R user community. It has two interfaces: R GUI, which looks similar to PC SAS and SPSS, and RStudio, which is an integrated development environment (IDE). Analysts prefer to use RStudio when developing graphical displays for the web, while R GUI is fine for most statistical work. To run R, you download and install the base application. Then, for specified functions not included in the base application, you install additional R packages. Like with PC SAS, in R, you import or connect to datasets, develop and save code files to run on those datasets, and produce output you can save. Base R, R packages, and documentation are available on the Comprehensive R Archive Network (CRAN) server at https://cran.r-project.org.
Python: Python is an open-source programming language that is often used to analyze data. As with R, Python is developed and maintained by its own user community and runs in a similar way. Although you still develop code that runs against datasets in the Python environment, the Python and R code are different. Instead of packages as in R, Python has libraries. Python is available at www.python.org/downloads.

Students often wonder what the differences are between R and Python, and which one to learn. They are essentially the same, although scientific disciplines have leaned toward adopting R, and engineering disciplines have leaned toward Python. Many students find themselves easily learning both.

Other free statistical software

Other statistical software packages are free, but they are not technically open-source — meaning they were not developed by an open-source community, and they are not licensed the same way.

Software that performs many functions

This section provides examples of free software that performs many functions like SAS and R.

OpenStat and LazStats are free statistical programs developed by Dr. Bill Miller that use menus that resemble SPSS. Dr. Miller provides several excellent manuals and textbooks that support both programs. OpenStat and LazStats are available at https://openstat.info.
Epi Info was developed by the United States Centers for Disease Control to acquire, manage, analyze, and display the results of epidemiological research. What makes it different than other statistical software is that it contains modules for creating survey forms and collecting data. Epi Info is available at https://www.cdc.gov/epiinfo/index.html.

Software for calculating sample size

Biostatisticians frequently encounter the problem of estimating sample size. The following are two free applications we recommend for performing sample-size calculations:

G*Power: G*Power was developed at the Universität Düsseldorf and is used to estimate the sample size for many different types of tests. Throughout this book, when we discuss sample-size calculations, we give you advice on how to do them using G*Power. G*Power is available at www.psychologie.hhu.de/arbeitsgruppen/allgemeine-psychologie-und-arbeitspsychologie/gpower. To use the program, you download it from this website and install it on your computer.
PS (Power and Sample Size Calculation): The PS program was developed by W.D. Dupont and W.D. Plummer at Vanderbilt University. Like G*Power, you download the application from its website and install it on your computer. The PS interface is similar to that of G*Power. PS is available at https://biostat.app.vumc.org/wiki/Main/PowerSampleSize.

Choosing Between Code-based and Non–Code-Based Methods

Most of the software mentioned up to this point in this chapter — including SAS, SPSS, R and Python — use code files that can be saved and rerun on data at a later date. These programs run fundamentally differently from programs such as Microsoft Excel, where you can run statistics on data, but no code files are produced and saved. Also, when you use web-based calculators, specialized apps like G*Power and PS for sample-size calculations, or online commercial platforms, no code files are produced and saved.

This is an important issue in statistics. When no code files are produced or saved, you have no record of the steps in your analysis. If you need to be able to reproduce your analysis, the only way to be sure of this is to use software that allows you to save the code so you can run it again.

Storing Data in the Cloud

Cloud-based storage refers to storing large data files on a set of Internet servers designed specifically for large data storage. Unlike old-fashioned stand-alone servers in server rooms, cloud-based servers share storage space across the Internet, providing instantaneous access and back-up capabilities. If you want to get rid of an old-fashioned server in your server room (that could be a SAS server), you will have to contract with a cloud-based storage company to use its space for your data. Then, you will have to find a way to move your data from your server into your new cloud storage. You will also have to make sure you want to have a long-term relationship with this company, so you don’t have to move your data out anytime soon.

Although moving data to the cloud may be an onerous task, you may not have any choice, because physical storage space may be running out. Many new organizations start with cloud data storage for that reason. Once your data are stored in the cloud, they are more easily accessed using online analytics platforms such as SAS Viya and Tableau.