Return to DWB Site

The Relative Efficiency
of Two-Stage Testing Versus Traditional Multiple Choice Testing
Using Item Response Theory
in Licensure


Reed A. Castle


Presented to the Faculty of
The Graduate College at the University of Nebraska
In Partial Fulfillment of the Requirements
For the Degree of Doctor of Philosophy

Major: Psychological and Cultural Studies

Under the Supervision of Professor Barbara S. Plake

Lincoln, Nebraska

November, 1997


This research applied two-stage testing (TST) to a licensure testing examination. Using actual data from an allied health profession’s licensure examination (N = 4,611), various combinations of TST were compared to the fixed length traditional multiple choice (TMC) exam and various “best” shorter tests. The variables evaluated were: a) Routing Test length (nt = 13 and 20), b) Measurement Test length (nt= 24 and 47), c) the number of Measurement Tests (2 and 3), d) IRT model choice ( 3 parameter logistic model with all parameters estimated (3p-V) and a 3 parameter logistic model with a fixed c parameter (3p-F)), e) the effects of intentional misrouting, and f) the effects of shifting the cut score up one standard error (se) or down one se on the baseline exams while maintaining the correct cut score on the TST, also known as Boundary Theory analysis. To evaluate the relative efficiency of TSTs versus TMC examinations, three indices were used: a) kappa (decision consistency) based on pass/fail status, b) RMSE evaluated across ten levels of the theta continuum, and c) BIAS evaluated across the same ten levels of the theta continuum as RMSE. Actual routing errors were also evaluated. The results of this study indicate that the 3p-F model was associated with decision consistency indices superior to the 3p-V model. Increased items in either the Routing or Measurement test, led to higher kappa values. Although actual routing errors were high (13% to 31%), IRT appeared to rectify some of the negative effects of routing error in terms of decision consistency.

Chapter I


Licensure Testing and Development

Licensure testing has become increasingly popular in the latter part of the twentieth century (Fortune, 1985). For a variety of reasons it is often necessary to evaluate individuals’ competencies for entering a new career. These reasons may include, but are not limited to, (a) liability or other legal issues, (b) maintenance of minimum competencies necessary to perform in certain vocations, and (c) licensure regulations. It is believed that proficiency testing began as early as 2200 B.C., when employees of a Chinese emperor were tested to determine their fitness for serving the emperor (DuBois, 1970; Teng 1943, cited in Gregory 1992). In 1986, it was estimated that at least 800 professions were licensed in the United States (Schmitt, 1995).

The goal of licensure testing is to identify those individuals who possess minimum competencies as related to a specific vocation. Licensure tests are typically mandatory evaluations of specific competency standards for all beginning practitioners in a given profession (Schmitt, 1995). That is, all licensed professionals must have passed a given licensure exam prior to practicing in that vocation. A critical issue related to licensure test (LT) development is creating a test that minimizes classification errors in relation to those who posses minimal competency and those who do not. A passing score is used as a cut point to determine who is considered minimally competent and who is not.

Licensure testing is considered high stakes. Test developers need very high quality and high precision in assessment because the decision based on test scores (P/F) is critical to the public (i.e., the public does not want to license incompetent practitioners, or fail to license competent practitioners). Advances in measurement have permitted the possibility of increased precision and more efficient assessment through an adaptive testing methodology. But these advances are potentially costly in resource demands. The purpose of this study is to investigate the viability of a variation on the development of licensure tests that return some of the advantages but reduces some of the disadvantages of applications of the adaptive testing methodology.

Typically, there are five basic steps in the development of a test designed for licensure. The LT development steps are (a) Job analysis, (b) Item writing, (c) Examination Development, (d) Setting the Passing score, and (e) Scoring (estimating examinee ability). The fifth step (Scoring) is discussed more thoroughly in the section on item response theory.

Job Analysis

The first step in developing a licensure test (LT) is a job analysis. Test developers must evaluate and determine what specific and essential content areas or domains exist in the licensed profession. Using surveys and/or interviews, test developers obtain information from professionals and experts currently practicing in the vocation to be licensed. The information gained from the experts is usually transformed into task statements. Task statements are statements used to help identify the behaviors or practices used within a given profession. These task statements are then used to create the content areas for the exam. From the identified content areas, a table of specifications is developed. The table of specifications is a guide used to assign the number or proportion of questions per content area to the total exam. Often, subject matter experts (SME’s) determine the essential content domains to comprise the table of specifications and then decide on the number of questions that should represent a given content domain.

Item Writing

After the content areas and their representative proportion of questions have been determined from the job analysis, the next step in developing a LT is to write the test items. Frequently, SME’s are recruited to write items in one or more content areas; usually SME’s write exam items in their area(s) of expertise. An item writing workshop is sometimes needed to cover the main item writing principles so that the item writers have a better understanding of item writing practices. Typically, item writing is an iterative process that goes through many stages of item revision. During the item writing process SME’s may be called upon to review fellow SME’s written items. The review of peer SME’s items may include relevancy, difficulty, and readability. Finally, the test developer in conjunction with the SME’s review items to assure good format and consistency.

Examination Development & Item Selection

Assuming items have been written and pre-tested, the next step in developing a LT exam is to construct the test. Following the table of specifications, items are selected to represent the number of items per content area. Traditionally when developing a LT using item response theory, items are selected to maximize the quality of assessment (information) in the vicinity of the passing score.

Selecting items for a test is generally based on either an Optimal Item Selection (OIS) method or a Content Optimal Item Selection (COIS) method (Hambleton, Arrasmith, and Smith, 1987). Both of these item selection methods are designed to maximize information at the cut score. Maximization of information at the cut score is achieved by utilizing items that possess item parameters that minimize measurement error around the cut score. OIS builds the test to maximize information at the cut score without regard for content balancing, while COIS maximizes information while maintaining the content specifications or table of specifications. COIS is the preferred method for selecting items when content specification exist as is often the case for Licensure testing.

Selecting the Passing Score

The next area related to the development of a LT is the setting of a passing score or passing standard. The passing score is one of the most important issues in licensure testing. The passing standard is often used to determine who is minimally competent in a given field and who is not. One of the greatest concerns when developing a test that utilizes a passing standard is the minimization classification errors (i.e., passing someone not competent or not passing someone who is competent). Through appropriate item selection approaches it is possible to maximize a test’s precision around the passing score. With a goal of reducing classification errors, the maximization of information at the passing score should also serve to reduce classification errors.


Usually examinees must meet certain requirements (e.g., educational or experiential) prior to actually taking the exam. After fulfilling the requirements to become a candidate for the exam, candidates take the examination. Once the test has been administered, the candidate’s score on the assessment must be determined and compared to the passing standard to ascertain whether he/she passed the test. Scoring can be as simple as summing the number of correctly answered questions or more complicated as is the case with assessments based on item response theory. Regardless of the procedure used to determine the candidate’s score on the assessment, typically with licensure tests, it is the candidate’s performance relative to the passing score that is of critical importance as that determines whether the candidate passes (and is therefore qualified for a license) or fails (and therefore not granted a license).

Evaluating the Quality of Licensure Examinations

Decision Consistency. To help evaluate relative classification errors across exams, decision consistency is often used. Decision consistency (DC) can be viewed as the extent to which identical decisions are made for two or more different, but parallel, exams (Crocker & Algina, 1986). Usually, DC is based on a contingency table analysis that utilizes the pass or no pass decision made as a result of the examinee test score. The rows and columns of the DC contingency table are the different test forms and the different test decisions (i.e., pass or no pass). Given two exams, there are four decisions: (a) pass time 1 & pass time 2, (b) pass time 1 & no pass time 2, (c) no pass time 1 & no pass time 2, and (d) no pass time 1 & pass time 2. A kappa coefficient can be used as an index of decision agreement across various test forms.

Licensure Test Efficiency. When developing a LT, a test developer must consider what is cost effective and what is measurement efficient. Typically, the ideal situation is to construct a test that minimizes measurement error to an acceptable level while also not expending a large amount of resources. There does appear to be a trade-off between efficiency and cost effectiveness. If larger or more resources are available, then LT’s should become more precise in measurement. Some factors that contribute to the cost effectiveness of LT’s are, (a) administration mode (traditional pencil and paper versus computer), (b) the number of available items or size of item pool, (c) scoring mechanism, and (d) the number of candidates to be tested. Item Response Theory has played a pivotal role in the move toward more efficient testing practices.

Item Response Theory

Item Response Theory is a relatively modern approach to measurement. An important feature that differentiates IRT and Classical Test Theory (CTT) is the property of invariance. Invariance is a concept that indicates that item parameters and ability estimates are stable regardless of the population. IRT models are not sample dependent while CTT models are sample dependent.

There are three item parameters that are estimated when using a three parameter logistic IRT model. The parameters are: (a) item discrimination denoted as the “a” parameter, (b) item difficulty denoted as the “b” parameter, and (c) the pseudo guessing parameter denoted as the “c” parameter. Item discrimination is an index of the item’s ability to discriminate among examinees (i.e., the ability of an item to differentiate more able examinees from less able examinees). The item difficulty parameter is an index of the difficulty of the item. The b parameter values can range across the theta ability ( EMBED Equation.2 ) scale. IRT item difficulty parameter values that are low or negative indicate relatively easy items while higher positive b parameter values indicate more difficult items. The c parameter is an index of the probability that a very low able examinee will be able to correctly answer an item (possibly by guessing correctly). Lower c parameter values indicate a lower probability of an examinee correctly guessing while higher c values indicate examinees have a higher probability of answering the item correctly by guessing.

The Item Information Function (IIF) incorporates all three of the item’s parameters to produce an index of the information or measurement precision for a given item at various values of examinee ability (theta). The Test Information Function (TIF) is the sum of all item information functions across the items on any given test. The TIF, when used in test development, provides information that indicates the range of theta values for which the test is most efficient and relatively free from error. Moreover, decisions about where information should be relatively maximized on the theta scale can be easily accomplished using a TIF.

Although IIFs provide estimates of where an item performs most efficiently, there are two, more precise and specific indices that will be used to select items. Thetamax is an index that considers all three item parameters to provide a specific point on the theta continuum where the item is most informative. Maximum information or Maxinfo is an index that provides information about how much relative information an item possesses. Using these two indices, developing a LT that minimizes classification errors may become a relatively less difficult task.

Scoring (Estimating Examinee Ability) Using Item Response Theory

When scoring a licensure test in order to estimate examinees’ proficiency levels, a test developer should addresses two primary issues: (a) the number of items used in estimating examinee ability, and (b) choice of IRT model.

The first issue relates to the number of items used for estimating examinee ability. It is well known that as the number of psychometrically sound items increases, so does the precision of measurement. Therefore, when scoring a LT, the scoring should utilize as many items as necessary to attain the desired level of measurement precision. While adhering to the table of specifications, the ideal licensure test length is one that allows for accurate examinee proficiency estimate while reducing measurement error to a satisfactory level.

IRT model choice is the next issue that should be evaluated when deciding how to score a licensure test. Assuming the test data are unidimensional, it is necessary to decide what IRT model to employ when obtaining examinee ability estimates and item parameter estimates. There are three basic IRT scoring models, of which there are variations of each. Hambleton and Swaminathan (1985) provide seven considerations for selecting an IRT model. They are a) the model data fit, b) the sample size, c) the quality of data, d) the availability of resources, e) the choice of estimation procedure, f) the availability of computer programs, and g) the methods used to assess model fit.

The first consideration for selecting an IRT model is whether to choose an IRT model to fit the data or to choose/edit the data to fit the model. Among other factors, this consideration is dependent on the number of available pre-tested items. If there is a large number of items that have been calibrated with a large sample, then editing the data to fit the model is much easier.

Sample size is the second consideration. If the sample size is less than two hundred, test developers may be limited to a one parameter (the item difficulty or b parameter) logistic IRT model. Large sample sizes allow the luxury of choosing from more of the IRT models and their variations.

The third consideration is the quality of data. If the sample is such that there are few examinees in the low ability level, as may be the case in licensure testing, estimations of the pseudo guessing parameter may not provide any relevant information. In such cases it is recommended that the c parameter be set fixed at a predetermined value (Hambleton & Swaminathan, 1985).

Next, the availability of resources must be evaluated. When determining the IRT model choice for a LT, the availability of resources is extremely important. Large sample sizes are needed for all 3 parameter logistic IRT models, and therefore computers, scanning equipment, and computer software needs must be evaluated. The other IRT models (1 and 2 parameter logistic IRT models) need fewer subjects to calibrate item parameters and examinee estimates and may not need scanning equipment, but they still require a computer and IRT calibration software.

The fifth consideration put forth by Hambleton and Swaminathan (1985) is choice of estimation procedure. Some different IRT estimation procedures are Maximum likelihood estimation and Bayesian estimation. Bayesian estimation allows the use of priors or distributional properties to begin the estimation process. The sixth consideration is the availability of computer programs. With the rapid developments in computer software, many IRT programs exist. A recently developed program (X-CALIBRE, by Assessment Systems Corporation) costs about $400.00 and works on a windows platform which allows relatively easy operation. Other IRT estimation programs include, but are not limited to, BILOG & BILOG for Windows, MULTILOG, and other Rasch model (1 parameter IRT model) computer programs.

The seventh and final consideration is evaluation of model fit. Model fit assessment can be accomplished three basic ways: a) evaluating item parameter estimate invariance, b) assessing ability parameter estimate invariance, and c) using a goodness-of-fit statistic to assess the discrepancy between observed and expected scores (e.g. Q1 statistic, Yen, 1981).

Test development using Test Information Functions is another consideration for IRT model selection. If a one parameter logistic model is used, only the item difficulty parameter is considered when constructing a test; item discrimination is totally neglected. A two parameter logistic model utilizes both item difficulty and item discrimination, thus using more information to construct the TIF. The three parameter logistic model utilizes item difficulty (b), item discrimination (a), and a guessing parameter (c). The item discrimination parameter is more robust to violations of model data fit than the pseudo guessing parameter. Large c parameter values tend to decrease the amount of information that an item contributes to the total TIF and in turn makes the test less efficient. If sample size is sufficient and is scored right or wrong, consisting exclusively of multiple choice items either a three parameter logistic model or a modified three parameter logistic model is a logical choice.

When the assumptions are met, IRT has been shown to provide more accurate estimates of examinee proficiency than those Classical Test Theory, however the goal of a licensure test is to minimize classification errors not necessarily to provide highly accurate proficiency estimation across the full ability continuum. Traditionally LT’s have minimized classification errors by building assessments with high levels of information in the vicinity of the cut score. Building a LT that has high information maximized immediately near the cut score in conjunction with obtaining relatively accurate (low measurement error) examinee proficiency estimates would result in reducing classification errors. By simultaneously reducing the standard error around the cut score and reducing the standard error around the proficiency estimate, examinee classification should become more accurate and classification errors be less frequent.

Adaptive Testing

With recent developments in technology and measurement theory, some licensure testing programs are considering moving toward adaptive testing. One of the greatest limitations in making the transition from traditional multiple choice testing procedures to adaptive testing procedures is resources. Pencil and paper tests are scored and administered more cheaply, while adaptive testing procedures that use computers often require more resources. When compared to traditional multiple choice testing techniques, computerized adaptive testing techniques require: a) relatively larger item banks, b) relatively more resources for the computer hardware and software, c) and a mode or system for test delivery. In addition, in order to apply IRT, the construct being measured must be sufficiently unidimensional. For this reason, several licensure testing programs have delayed the implementation of computer adaptive testing.

Developing a computerized adaptive LT is similar to developing a traditional pencil and paper test, but has an added step. Adaptive testing requires examinees to be matched with items or tests that are relatively similar to examinee’s proficiency level. Therefore, an additional step in the testing process is the individualization of the test to the examinee. However, when a LT is adaptive, a mechanism must be implemented to insure the given adaptive test adheres to the original table of specifications. Kingsbury and Zara (1991) indicated that the price for content balancing a computer adaptive test is paid in an increase in the number of required items when compared to adaptive exams that are not content balanced.

Adaptive testing techniques provide test developers the opportunity to develop more efficient tests that have increased measurement precision. In a licensure application, the goal in using an adaptive strategy is to reduce measurement error while administering as few items as needed to determine a testing candidate’s competency in relation to the cut score. This goal is more easily achieved when resources are large, but when resources are not large, cost effectiveness of this approach becomes an issue.

Adaptive Testing Strategies

Two general forms of adaptive testing formats exist. The first method is a format that has items selected using a pre-structured item selection technique. Two-stage testing falls into this category. The operative term is pre-structured. A two-stage test has items pre-selected so that the various adaptive tests exist prior to administration. A two-stage adaptive exam has one general “Routing” test, followed by one of a set of longer “Measurement” tests. The Measurement tests differ in overall difficulty so depending on the examinee proficiency estimate derived from the administration of the Routing test, an examinee is assigned to a Measurement test with a difficulty level relatively consistent with their proficiency estimate obtained from the Routing test. Earlier attempts at TST using CTT were not very successful (e.g., Cleary, Linn, and Rock, 1968; Lord, 1971); however, IRT provides for a more promising application (Kim & Plake, 1993).

The second adaptive testing format is Computer Adaptive Testing (CAT). This format constantly refines examinee ability after each item response is made. In the administration of a computer adaptive test, items are selected sequentially depending on the most recent ability estimate. The goal of CAT is to minimize testing time and decrease measurement error by constantly refining examinee ability estimates and matching those estimates with items similar in difficulty. Therefore, the test is “constructed” dynamically as the examinee takes the test. These are not pre-structured assessments, as was the case in two-stage testing.

Adaptive Testing Using the Computer

With the rapid developments in computer technology and measurement theory, by using an adaptive testing strategy, an assessment can be accomplished in less time, provide more accurate and reliable scores. Estimating an examinee’s ability has become a relatively easier task with the advances in Item Response Theory in conjunction with advances in computers.

Green, Bock, Humphries, Linn, and Reckase (1984), provide the following four components needed for developing computer adaptive tests: a) a pool of items to select from to develop a test, b) a criteria for selecting items, c) a method for scoring the test, and d) a decision of when the test is finished or a stopping rule. Often computer adaptive testing has been applied to a testing process that makes item selection decisions after each item is answered. That is, a CAT is a special case of TST that uses each item as a Routing test. Measurement precision is consistently refined after each item is presented and answered. This method is more efficient than TST and therefore should provide results with higher accuracy. Depending on the purpose of the test, it is obvious that CAT testing can be an ideal method of testing, but there are a few limitations to this method.

There are three limitations that should be considered when deliberating the use of a CAT. The first limitation is the fact that a computer is required to administer the exam. Another limitation relates to the resources necessary to implement a CAT test. To develop and administer a CAT test, the cost of computers, the cost of going on line or administering the test, as well as the cost of computer software programs must be considered when contemplating the use of a CAT testing program. Assuming the construct is sufficiently unidimensional, the final limitation of CAT testing is the need for content balancing. Depending on the item selection algorithm, a larger item bank may be necessary when using CAT to fulfill the content requirements of a licensure test. All of these limitations of computer adaptive testing can be addressed effectively using CAT testing procedures, but at the expense of a larger investment of resources and time.

Traditional Multiple Choice Versus Adaptive Testing

Due to the introduction of adaptive testing, traditional multiple choice (TMC) testing procedures may no longer be the most efficient mode for testing candidates. When the necessary assumptions are met, the only advantage that may exist for TMC tests is the lower costs associated with scoring and administering the TMC test.

When compared to adaptive testing (both CAT and TST), some limitations of TMC testing procedures are: a) test length for TMC tests must be substantially longer in order to achieve similar measurement precision, b) TMC’s inability to match the test items to examinee’s ability, and c) total examination time. Traditional multiple choice tests are fixed length, have broad coverage of the ability scale, are not adaptive, and are generally longer than adaptive tests.

Two-Stage Testing Versus Computerized Adaptive Testing

An IRT based TST may resolve some of the limitations of CAT testing. Using Two-stage testing, the test developer has more control over the test development and can therefore assure compliance to the table of specifications. In addition, particularly with licensure testing, it is necessary to have items that relate directly to criteria developed from the job analysis. Because there might be a large number of content areas, a test developer using CAT may use more resources in developing extra items that have different item difficulty or Thetamax values for a specific content area to be able to ensure meeting a specific content area requirement. With full control during test development afforded by TST, these requisite content areas could be included in the Routing test which is common to all examinees. For this reason, two-stage testing may be a better application for licensure testing programs that do not have the resources to develop CAT testing programs.

Limitations of TST

One limitation of two-stage testing is the potential for routing error. Routing error results from assigning an examinee to an inappropriate Measurement test. This error could have major ramifications for examinees, including misclassification. Routing error can result from either routing too high or routing too low. Routing too high would occur when an examinee is assigned to a Measurement test that is above the individual’s ability. Routing too low would be assigning an examinee a Measurement test that is lower than their actual ability. If a candidate is misrouted, the candidate may be able to pass the test when they should have failed or may fail the test when they should have passed. Earlier studies on TST based on CTT (e.g. Lord, 1971), found substantial magnitudes of misclassification.

Another limitation of TST is related to selecting items. Because Routing tests are supposed to be as short as possible, and should be representative of the test specifications, tests with a large number of content areas may make a relatively short Routing test difficult to assemble. Deciding on the proportion of questions from any given content area to be used in the Routing test may be a difficult task .

TST and Licensure Tests

Given the tradeoff of cost effectiveness and measurement precision often facing licensure test development, TST may provide an attractive alternative to CAT and TMC testing procedures. Because the Routing test typically contains a relatively few number of items, it is often possible to score the Routing test by hand and, using a look up table to obtain a theta, or ability, estimate, assign the examinee to the longer Measurement test. Alternatively, the Routing tests could be scored on site using an optical scanner and a laptop computer. TST allows test developers to develop fixed tests so that items providing the most information can be used for the Routing and various Measurement tests. Because of this, and using TST strategy, a test developer might just write one essential item for a given domain and use it in the Routing exam thus providing the item to every examinee. Using a TMC test forces the test developer to use more items than would be necessary for achieving the same level of precision on a TST. Developing a test using TST methods may help a contractor develop a set of tests that yields more reliable results compared to traditional multiple choice tests. Overall, the developer will likely use the same or more items across examinees, but individual candidates will likely take fewer items and receive more precise proficiency estimates than afforded by TMC.

Design Features of a TST for Licensure Testing

The optimal shape of the TIF will most likely differ for the Routing and Measurement tests. The Routing test should have a Test Information Function that covers a broad range of abilities. In licensure testing applications the TIF of the Routing test should not be a strictly uniform distribution. The reason the Routing test should not be strictly uniform is that most licensure tests candidates do not possess abilities in the extreme areas of the theta scale. In theory an ideal Routing test, used on a pass/fail test, should be at least dome-like so that more information is located at or near the passing standard while still maintaining high information at all levels of the theta scale.

The different Measurement tests, on the other hand, should be developed for differing levels of ability. Therefore the Measurement test’s information will be maximized at different points on the ability scale continuum. The different Measurement tests will reduce the error associated with the candidates’ proficiency estimates which should reduce pass/fail classification errors. Test results from the Routing and Measurement test will have higher measurement precision than using just the Measurement test because the number of items used in the scoring is larger than using test results from just the Measurement test. TST may allow test developers some of the advantages of CAT using less resource investment.

Purpose of the Study

The purpose of this study was to evaluate the effectiveness of TST methods in terms of decision consistency accuracy using real data from a large licensure testing program. The goal was, using IRT, to compare the accuracy of several TST configurations with themselves and with that from the original TMC test and two non adaptive optimal exams. The two optimal exams were created to maximize information at the cut score and were approximately 52% and 74% of the original test length. The TMC version of this licensure program’s assessment consisted of 138 usable items comprising 13 content categories. Two items did not fit the 3 parameter IRT model and were therefore discarded. Had the two discarded items been attainable, they would have been used for the study, but due to poor statistical performance, BILOG could not ascertain estimates of the two item’s IRT parameters. There were a total of three “baseline” exams (original 138, 52% optimal, and 74% optimal). Using this as the total source of items, various TST configurations were constructed and the accuracy of decisions was compared. Variables considered were: a) IRT model type, b) the length of Routing test, c) the length of Measurement test, d) the number of Measurement tests, e) routing error, f) the fluctuation of the cut score one standard error above and below the original cut score. Variables included in this study were:

Comparison of IRT model. Two IRT models were evaluated for all conditions. The first model was a three parameter logistic model with the c parameter allowed to vary when estimated (3p-V). The second IRT model was a three parameter logistic model with the c parameter fixed at an common value (3p-F).

Comparison of Routing Length. Two Routing test lengths were evaluated. The number of items used for the Routing test in this study were: 13 items or 9.4% of the total test, and 20 items or 14.5% of the total test. The number of items was determined using a representative sampling technique from the content areas associated with the licensure exam used for this study. The total number of items within each content area was multiplied by a fixed percentage to arrive at the total number of items selected within each content area. For the first Routing test length, one item from each content area was selected.

Comparison of Measurement tests. Two characteristics of the Measurement test were evaluated. The first was the length of the Measurement tests. The lengths to be considered were 24 items (or 18.1% of total test), and 47 items (or 33.3% of the total test.) The number of items used in the Measurement test was determined using the same sampling technique as used in the Routing test. The second Measurement test variable was the number of Measurement tests. Two Measurement test sets were evaluated, 2 and 3. Due to a large number of content areas, and the limited number of items within some content areas, only 2 and 3 Measurement tests were considered.

Effect of Routing Error. To evaluate the effect of routing error, three routings were used, routing high, routing low, and correct routing. Candidates were intentionally mis-routed to a Measurement test 1 above or below their assigned Measurement test dervied from the Routing test. Because there was only one direction of Routing error in the two Routing test category, Routing error was only evaluated in the 3 Measurement test condition. Decision consistency (kappa) was used to evaluate routing error in relation to the three Baseline examinations.

Standard Error of Cut Score. To evaluate shifts of the cut score to be reflective of measurement error, the cut score was repositioned on the 138 item baseline exam to observe any trends in classification errors. A total percent across the two error types will be calculated.

Decision Consistency. Decision consistency was used to evaluate the relative consistency across different test forms. For each comparison, Kappa was used to evaluate the DC.

RMSE and BIAS. RMSE was used to evaluate the average squared difference between the TST and baseline ability estimates. BIAS was used to evaluate the average difference between the TST and baseline ability estimates. Both were calculated across ten levels of the ability continuum.


TST may be well matched with licensure testing applications. Some of the potential strengths of TST relative to licensure testing are: a) shorter test lengths when compared to TMC tests, b) easier control over the test when compared to CAT, c) less resources needed to develop and administer a TST when compared to a CAT, and d) better match between examinee ability and test than possible with TMC.

Using real licensure test data, this study evaluated the several TST configurations. Within the configuration the IRT model, various Routing test distributions, lengths of Routing tests, lengths of Measurement tests, and the number of Measurement tests were considered. Also, the effect of routing error on decision consistency (kappa) was studied to determine how substantial the routing errors were in terms of decision consistency.

Go to top.


Angoff, W. and Huddleston, E. (1958). The multi-level experiment: a study of a two-level testing system for the college board SAT. Statistical report NO. SR-58-21). Princeton, New Jersey: Educational Testing Service.

Angoff , W. (1971). Scales, norms, and equivalent scores,. In R. L. Thorndike (Ed.), Educational Measurement (2nd ed., pp.508-600). Washington, DC: American Council on Education.

Betz, N. and Weiss, D. (1973). An empirical study of computer adaptive two-stage ability testing. (research report NO. 73-4). Minneapolis: University of Minnesota, Department of Psychology, Psychometric method program.

Betz, N. and Weiss, D. (1974). Simulation studies of two-stage testing. (research report NO. 74-4). Minneapolis: University of Minnesota, Department of Psychology, Psychometric method program.

Brennan and Kane (1977). An index of dependability for mastery tests. Journal of Educational Measurement, 14, 277-289.

Cleary, T., Linn, R., and Rock, D. (1968). An exploratory study of programmed tests. Educational and Psychological Measurement, 28, 345-360.

Cleary, T., Linn, R., and Rock, D. (1968). Reproduction of total test score through the use of sequential programmed tests. Journal of Educational Measurement, 5, 183- 187.

Fortune , J. (1985). Understanding testing in occupational licensing. Jossey-Bass: San Francisco.

Gregory, R. (1992). Psychological testing, History, Principles, and Applications. Allyn And Bacon: Boston, MA.

Green, B., Bock, R., Humphreys, L., Linn, R., and Reckase, M. (1984). Technical guidelines for assessing computerized adaptive tests. Journal of Educational Measurement, 21, 347-360.

Haladyna, T. (1987). Three components in the establishment of a certification program. Evaluation and the Health Professions, 10(2), 139-172. Haladyna and Roid (1987). A comparison of two approaches to criterion-referenced test construction. Journal of Educational Measurement, 20(3), 271-282.

Hambleton, R. and De Gruijter, D. (1983). Application of Item Response Models to criterion referenced test item selection. Journal of Educational Measurement, 20(4), 355-367.

Hambleton, R., Mills, C., and Simon, R. (1983). Determining the lengths for criterion referenced tests. Journal of Educational Measurement, 20(1), 27-38.

Hambleton, R. and Swaminathan, H. (1985). Item Response Theory: Principles and Applications. Kluwer-Nijhoff Publishing: Boston: MA.

Hambleton R., Arrasmith, D. and Smith, L. (1987). Optimal item Selection with credentialing examinations. Paper presented at the Annual Meeting of the American Educational Research Association. Washington, DC.

Hambleton, R. and Slater, S. (1997) Reliability of credentialing examinations and the impact of scoring models and standard-setting policies. Applied Measurement in Education , 10 (1), 19-38.

Kim, H. (1993). Monte Carlo simulation comparison of two-stage testing and computer adaptive testing. (Doctoral Dissertation, University of Nebraska, Lincoln)

Kim, H., and Plake, B. (1993). Monte Carlo simulation comparison of two-stage testing and computerized adaptive testing. Paper presented at the annual meeting of the National Council on Measurement in Education. Atlanta, GA.

Kingsbury, G. and Zara, A. (1991). A comparison of procedures for content-sensitive item selection in computerized adaptive testing. Applied Measurement in Education, 4(3),241-261.

Kingsbury, G. and Zara, A. (1989). Procedures for selecting items for computerized adaptive tests. Applied Measurement in Education, 2(4), 359-375.

Kingsbury, G. and Houser, R. (1993). Assessing the utility of item response models: computerized adaptive testing. Educational-Measurement: Issues and Practice, 12(1), 21-37,39.

Larkin, R. and Weiss, D. (1975). An empirical comparison of two-stage and pyramidal adaptive testing. (Research report NO. 75-1). Minneapolis: University of Minnesota, Department of Psychology, Psychometric method program.

Linn, R., Rock, D., and Cleary, T. (1969). The development and evaluation of several programmed testing methods. Educational and Psychological Measurement, 29, 129- 146.

Livingtson, S. (1972). Criterion-referenced applications of classical test theory. Journal of Educational Measruement, 9, 13.26.

Lord, F. (1971). A theoretical study of two-stage testing. Psychometrika, 36, 227-242.

Loyd. B. (1984). Efficiency and precision in two-stage adaptive testing. Paper presented at the Annual Meeting of the Eastern Educational Research Association, West Palm Beach, FL.

Mislevy and Bock (1995). Bilog for windows. Scientific Software, Inc. Mooresville, IN

Reckase, M. (1979), Unifactor latent trait models applied to multi-factor tests: Results and Implications. Journal of Educational Statistics, 4, 207-230.

Schmitt, K. (1995). What is Licensure? In J. Impara (Ed.) Licensure testing: purpose, procedures, and practices (pp. 1-xxx). Lincoln, NE: Buros Institute of Mental Measurements.

Subkoviak, M. (1976). Estimating reliability from a single administration of a criterion referenced test. Journal of Educational Measurement, 15, 265-276.

Yen, W. (1981). Using simulation results to choose a latent trait model. Applied Psychological Measurement, 5, 245-262.

Go to top.