Statistics – Spring 2008
Lab #1 – Data Screening
The purpose of data screening is to:
(a) check if data have been entered correctly, such as out-of-range values. (b) check for missing values, and deciding how to deal with the missing values. (c) check for outliers, and deciding how to deal with outliers.
(d) check for normality, and deciding how to deal with non-normality.
1. Finding incorrectly entered data
; Your first step with “Data Screening” is using “Frequencies”
1. Select Analyze --> Descriptive Statistics --> Frequencies
2. Move all variables into the “Variable(s)” window.
3. Click OK.
; Output below is for only the four “system” variables in our dataset because copy/pasting the output for all
variables in our dataset would take up too much space in this document.
; The “Statistics” box tells you the number of missing values for each variable. We will use this information
later when we are discussing missing values.
; Each variable is then presented as a frequency table. For example, below we see the output for “system1”. By
looking at the coding manual for the “Legal beliefs” survey, you can see that the available responses for
“system1” are 1 through 11. By looking at the output below, you can see that there is a number out-of-range:
“13”. (NOTE – in your dataset there will not be a “13” because I gave you the screened dataset, so I have
included the “13” into this example to show you what it looks like when a number is out of range.) Since 13 is
an invalid number, you then need to identify why “13” was entered. For example, did the person entering data
make a mistake? Or, did the subject respond with a “13” even though the question indicated that only numbers
1 through 11 are valid? You can identify the source of the error by looking at the hard copies of the data. For
example, first identify which subject indicated the “13” by clicking on the variable name to highlight it
(system1), and then using the “find” function by: Edit --> Find, and then scrolling to the left to identify the
subject number. Then, hunt down the hard copy of the data for that subject number.
2. Missing Values
; Below, I describe in-depth how to identify and deal with missing values.
; Why do missing values occur? Missing values are either random or non-random. Random missing values may
occur because the subject inadvertently did not answer some questions. For example, the study may be overly
complex and/or long, or the subject may be tired and/or not paying attention, and miss the question. Random
missing values may also occur through data entry mistakes. Non-random missing values may occur because
the subject purposefully did not answer some questions. For example, the question may be confusing, so many
subjects do not answer the question. Also, the question may not provide appropriate answer choices, such as
“no opinion” or “not applicable”, so the subject chooses not to answer the question. Also, subjects may be
reluctant to answer some questions because of social desirability concerns about the content of the question,
such as questions about sensitive topics like past crimes, sexual history, prejudice or bias toward certain
groups, and etc.
; Why is missing data a problem? Missing values means reduced sample size and loss of data. You conduct
research to measure empirical reality so missing values thwart the purpose of research. Missing values may
also indicate bias in the data. If the missing values are non-random, then the study is not accurately measuring
the intended constructs. The results of your study may have been different if the missing data was not missing. ; How do I identify missing values?