SPSS Syntax Part 4: Double Entry Comparison
When entering data, mistakes inevitably happen. The wrong number was pressed, the selected cell was in the wrong row, or any number of things can potentially happen. To catch these mistakes so they can be corrected, double entry of data is fairly common. Once entered twice, the two sets of data can be compared to see if there are any differences, and the items that have a difference can be checked a third time to ensure accuracy. But checking double entry manually, especially for a large database, is unrealistic. Instead, you can use SPSS syntax to do the comparison for you!
In order to have SPSS syntax do the comparison, we first want to have the first and second entry data in the same database. Typically I will have one database that has the actual variable names, then another database that has "_2ndentry" appended to the end of the variable names (except for the ID variable). You can use whatever naming procedure you prefer, as long as you can easily refer to the first and second entry of the same variable easily. Then, you just merge the two databases into one (using Data > Merge Files > Add Variables...) matching by the ID variable.
Once merged, you're ready to prepare the syntax for the comparison! Let's take a look at the framework of the SPSS syntax, then break down what each part does:
DO REPEAT var_1st = variable1 variable2 /var_2nd = variable1_2ndentry variable2_2ndentry /var_test = variable1_test variable2_test. RECODE var_1st var_2nd (MISSING = -888). DO IF (var_1st = var_2nd). COMPUTE var_test = 0. ELSE. COMPUTE var_test = 999. END IF. RECODE var_1st var_2nd (-888 = SYSMIS). END REPEAT. COUNT num_errors = variable1_test variable2_test (999). EXECUTE.
Alright, there's a lot there, so let's break it down:
DO REPEAT var_1st = variable1 variable2 /var_2nd = variable1_2ndentry variable2_2ndentry /var_test = variable1_test variable2_test. END REPEAT.
DO REPEAT tells SPSS that you want to start a loop. Loops are a way to repeat the same bit of calculation over and over until it reaches the specified end. In this case, we're telling SPSS to do a loop (everything from DO REPEAT to END REPEAT) until it reaches the end of the variable list. You can include as many variables as you want in the list, I just have two for the sake of making the code shorter.
The first pass through, SPSS will assign the first variables in the lists to the designated placeholders. That is, "var_1st" will hold whatever value is in variable1, "var_2nd" will hold whatever value is in variable1_2ndentry, and "var_test" will be a new blank variable (variable1_test) that will hold the result of the comparison.
On the second pass, SPSS will go down one spot in the list. "var_1st" will then hold the value of variable2, "var_2nd" will hold the value of variable2_2ndentry, and "var_test" will create another new variable (variable2_test) that will hold the comparison for the second variable.
Running the code above wouldn't actually do a comparison, and would just create blank "_test" variables. So, let's look at the comparison piece of the syntax:
RECODE var_1st var_2nd (MISSING = -888). DO IF (var_1st = var_2nd). COMPUTE var_test = 0. ELSE. COMPUTE var_test = 999. END IF. RECODE var_1st var_2nd (-888 = SYSMIS).
This chunk of code needs to be included within the loop (so before the END REPEAT). As part of the loop, it will do a comparison of the variables before moving on to the next set. First, we see that it makes MISSING have a value of -888 (so blank cells will have some value, rather than being blank; SPSS often freaks out if you're trying to assign or compare null values). Then, it compares whatever values are being held by "var_1st" and "var_2nd." If there is a match (the first case), then the "var_test" variable will be given a value of 0. If there is a mismatch (caught by the ELSE; that is, if there is not a match then do the second thing), then the "var_test" variable gets a value of 999.
This is done by participant, so it does the comparison between the two variables for each participant and assigns the value appropriate for that participant.
Finally, it ends by converting -888 back into a blank cell (which SPSS call SYSMISS, for System Missing). If -888 is a possible value within your data, then you will want to choose a different value to be assigned to null cells within this block of syntax.
Once the loop is finished, we have another bit of code that is optional but helpful:
COUNT num_errors = variable1_test variable2_test (999). EXECUTE.
COUNT is a function supported by SPSS that tallies the number of times a specified value occurs across variables for a participant (in this case, 999 as indicated in the parentheses at the end of the variable list). The tally is then placed into the specified variable (in this case, "num_errors"). The use of this count is to see how many errors were identified for a participant, so if there were no errors you know not to do any additional checking, and if there are a lot of errors for many participants then you know something might not have worked properly.
So that's the basics of using SPSS syntax to compare double entry! To end, I want to give a few tips:
- The above syntax will compare string variables as well. However, it will count a mismatch if anything is different about the string variables (e.g., different capitalization, different use of punctuation). You can help prevent a high number of mismatches due to string variables by standardizing how strings should be entered (e.g., all lowercase with no punctuation), but you'll probably still get some false mismatches because of this.
- Entering a long list of variables would obviously take a long time to do manually (and it raises the risk of typos). To save time, you can select the variable names column in Variable view in SPSS, then copy it and paste into the appropriate spot in the syntax.
- Please keep in mind that SPSS doesn't know if it's matching up the correct variables or not. It's important that you make sure that the variable lists are in the same order between first and second entry. That is, first items on the list will always be matched, second items will be matched, and so on by their location in the lists. If you suddenly get a lot of mismatches, see if the lists get mismatched at any point.
Have questions about how to use this syntax? Want tips on how to do anything else in SPSS? Let me know in the comments!