In preparing a list of respondents for a research survey I’m conducting, I was faced with the task of reducing a very large list of potential respondents to a much smaller list. A number of approaches are possible for selecting a respondent subset. Because any of these approaches can be automated using a script, I chose to create a script that takes advantage of the method most capable of generating a representative subset, that is, through the use of Simple Random Sampling using a table of random numbers.
The random sampling process used within the script is based on the description given in Gay, Mills, and Airasian (2009). I adapted the directions given in chapter 5 under “Selecting a Sample,” pages 126-127.
To explain how the script works, let’s use an example of a list of email addresses. The first steps are to generate a list of all the available email addresses of the desired population. This list can be in the form of a text file with each record on a separate line, or it can be exported as a CSV file from a program such as Microsoft Excel or from OpenOffice.org Calc. The researcher would then have to copy and paste the list into the left text area in SRSG. By default, SRSG will generate a random subset of 400 items, but this can be manually changed to a suggested maximum of 10,000. Once the researcher has placed the original list into SRSG, he or she would click the Randomize button to generate the subset. The subset is then displayed in the right text area. The results are random, de-duplicated, and unsorted. The result can then be copied and pasted back into the desired program.
Next, the script places the input records into an numbered array. The script counts the total number of records in order to determine the number of digits it will require for matching from the random number table. So, if in our example the count of original records is 2,000, the script will use the last four digits of random numbers it pulls from the random number table.
The script then uses the uses the first random number it finds to identify the item from the input records. In this example, the random number found was 27335. Because our original input has a total count with four digits (ie, a total count of 2,000), the script will use the last four digits of the random number, in this case 7335. It will try to find item 7335 in the input list but since the list only has 2,000 items, the script abandons the current random number and tries the next one in the random number table, 73665. The last four digits of this next random number also exceeds the total count of input records, so this random number is abandoned and the next random number is chosen. This will continue until a suitable random number is found. In our example, that random number would be 61085, the last four digits of which are 1085. This will take record 1085 out of the input data set and place it into the output data set. Once this is done, the next random number is chosen and the process repeats itself until the required number of items are placed into the output data set. If the end of the random number table is reached, the pointer jumps to the start of the random number table and continues from there.
Because the input array is indexed without leading zeros, the script will strip any leading zeros from random numbers it finds. This is mechanically different than the process described in Gay, Mills, and Airasian’s process, but it has the identical functional effect. In our 2,000 item example, Gay, Mills, and Airasian would take item number, say, 456 and pad it with a leading zero to make it 0456. If the random number 60456 is chosen, their process would take the last four digits, 0456, and match it to the item I just described.
Once the output data set is complete, the array is formatted for output and displayed in the right text area.
As soon as the results are generated, a report will be displayed in a pop-up window. The purpose of this report is to confirm and document the use of the random sampling script along with some means to verify the results without divulging any confidential information that might exist in the input or output data sets. The report will indicate the number of original input records and output records. For both the input and output records, a one-way SHA-256 hash is generated. If verification is needed that a certain input or output was processed using the SRSG script, a SHA-256 hash could be generated from the input or output data which can then be matched to the documented hash. Also in the report will be the pseudo-random seed and the first random number selected based on the seed. The report will also indicate the web browser used.
The random number table used in SRSG was created by RAND Corporation and is available here. This random number table has 200,000 sets of 5-digit numbers, for a total of 1 million digits. RAND Corporation provides documentation that explains how the random numbers were generated and validated.
The SRSG script was tested in Firefox and Safari in Mac OS X. A set of 100,000 records took a minute or two to process on a 2.2 GHz CPU, 2 GB RAM laptop. At 200,000 records, the web browser seemed to stall. The script was written and tested primarily to create an output of 400 records. Very large output sets may reveal shortcomings in the script, so are not recommended. The intent of this script was to allow the researcher to create relatively smaller data sets from relatively larger ones.
Gay, L. R., Mills, G. E., & Airasian, P. W. (2009). Educational research: Competencies for analysis and applications (9th ed.). Upper Saddle River, New Jersey: Prentice Hall.
RAND Corp. (2001). Datafile: A million random digits. Pittsburgh, PA: RAND Corp. Retrieved from http://www.rand.org/pubs/monograph_reports/MR1418/