Digital Journal

Simple Random Sample Generator for Statistics

In preparing a list of respondents for a research survey I’m conducting, I was faced with the task of reducing a very large list of potential respondents to a much smaller list. A number of approaches are possible for selecting a respondent subset. Because any of these approaches can be automated using a script, I chose to create a script that takes advantage of the method most capable of generating a representative subset, that is, through the use of Simple Random Sampling using a table of random numbers.

It is implemented in JavaScript and runs entirely within any modern web browser. No web servers, databases, installations, or configurations are required.

The random sampling process used within the script is based on the description given in Gay, Mills, and Airasian (2009). I adapted the directions given in chapter 5 under “Selecting a Sample,” pages 126-127.

Example usage

To explain how the script works, let’s use an example of a list of email addresses. The first steps are to generate a list of all the available email addresses of the desired population. This list can be in the form of a text file with each record on a separate line, or it can be exported as a CSV file from a program such as Microsoft Excel or from OpenOffice.org Calc. The researcher would then have to copy and paste the list into the left text area in SRSG. By default, SRSG will generate a random subset of 400 items, but this can be manually changed to a suggested maximum of 10,000. Once the researcher has placed the original list into SRSG, he or she would click the Randomize button to generate the subset. The subset is then displayed in the right text area. The results are random, de-duplicated, and unsorted. The result can then be copied and pasted back into the desired program.

Internal process

First, the script generates a pseudo-random number (a seed) using the random number generator of the JavaScript language. This pseudo-random seed is the equivalent of a blind selection of a random number from a traditional, printed random number table. This pseudo-random seed is in the range of 0 to 199,999, representing all the five-digit numbers in the random number table included in SRSG. SRSG selects the five-digit number it finds at the location found by the pseudo-random seed within the random number table. For example, if the pseudo-random number is 121065, the script will jump to that position and select 27335 as the random number.

Next, the script places the input records into an numbered array. The script counts the total number of records in order to determine the number of digits it will require for matching from the random number table. So, if in our example the count of original records is 2,000, the script will use the last four digits of random numbers it pulls from the random number table.

The script then uses the uses the first random number it finds to identify the item from the input records. In this example, the random number found was 27335. Because our original input has a total count with four digits (ie, a total count of 2,000), the script will use the last four digits of the random number, in this case 7335. It will try to find item 7335 in the input list but since the list only has 2,000 items, the script abandons the current random number and tries the next one in the random number table, 73665. The last four digits of this next random number also exceeds the total count of input records, so this random number is abandoned and the next random number is chosen. This will continue until a suitable random number is found. In our example, that random number would be 61085, the last four digits of which are 1085. This will take record 1085 out of the input data set and place it into the output data set. Once this is done, the next random number is chosen and the process repeats itself until the required number of items are placed into the output data set. If the end of the random number table is reached, the pointer jumps to the start of the random number table and continues from there.

Because the input array is indexed without leading zeros, the script will strip any leading zeros from random numbers it finds. This is mechanically different than the process described in Gay, Mills, and Airasian’s process, but it has the identical functional effect. In our 2,000 item example, Gay, Mills, and Airasian would take item number, say, 456 and pad it with a leading zero to make it 0456. If the random number 60456 is chosen, their process would take the last four digits, 0456, and match it to the item I just described.

In SRSG, the same item number, 456, would not be padded with an additional zero. When the same random number is chosen, 60456, the script first takes the last four digits, 0456, then it identifies the leading zero and strips it off, leaving 456. This approach was implemented primarily because of the way the JavaScript language stores and indexes data internally.

Once the output data set is complete, the array is formatted for output and displayed in the right text area.

Report

As soon as the results are generated, a report will be displayed in a pop-up window. The purpose of this report is to confirm and document the use of the random sampling script along with some means to verify the results without divulging any confidential information that might exist in the input or output data sets. The report will indicate the number of original input records and output records. For both the input and output records, a one-way SHA-256 hash is generated. If verification is needed that a certain input or output was processed using the SRSG script, a SHA-256 hash could be generated from the input or output data which can then be matched to the documented hash. Also in the report will be the pseudo-random seed and the first random number selected based on the seed. The report will also indicate the web browser used.

Technical details

The random number table used in SRSG was created by RAND Corporation and is available here. This random number table has 200,000 sets of 5-digit numbers, for a total of 1 million digits. RAND Corporation provides documentation that explains how the random numbers were generated and validated.

The SHA-256 one-way hash implementation in JavaScript was developed by Chris Veness and is available here.

The SRSG script was tested in Firefox and Safari in Mac OS X. A set of 100,000 records took a minute or two to process on a 2.2 GHz CPU, 2 GB RAM laptop. At 200,000 records, the web browser seemed to stall. The script was written and tested primarily to create an output of 400 records. Very large output sets may reveal shortcomings in the script, so are not recommended. The intent of this script was to allow the researcher to create relatively smaller data sets from relatively larger ones.

References

Gay, L. R., Mills, G. E., & Airasian, P. W. (2009). Educational research: Competencies for analysis and applications (9th ed.). Upper Saddle River, New Jersey: Prentice Hall.

RAND Corp. (2001). Datafile: A million random digits. Pittsburgh, PA: RAND Corp. Retrieved from http://www.rand.org/pubs/monograph_reports/MR1418/

Veness, C. (2009). SHA-256 Cryptographic Hash Algorithm. Retrieved from http://www.movable-type.co.uk/scripts/sha256.html

Leave a Reply

Your email address will not be published. Required fields are marked *