Rasch models are measurement models which use dichotomous or ordinal data to construct a measure of the latent quantity of interest for each person interviewed. With respect to our study, a linear measure of interest in the four dimensions was constructed from the 1–5 Likert-like scores attached by respondents to the related items. According to the nature of the variables, different Rasch models are available. In the case of two ordered categories, the Dichotomous Rasch model is provided (Rasch 1960), while the Rating Scale model (Andrich 1978) and the Partial Credit model (Masters 1982) are best suited for higher ordered categories. We applied the Rating Scale model:
$$\begin{aligned} &\ln \left( {\frac{{P\left( {X_{ij} = k} \right)}}{{P\left( {X_{ij} = k - 1} \right)}}} \right) = \alpha_{i} - \beta_{j} - \tau_{k} , \\ &\quad \quad \quad X_{ij} \in \left\{ {1,2 \ldots K} \right\},\quad i = 1,2 \ldots N,\quad \, j = 1,2 \ldots J, \\ \end{aligned}$$
(1)
where N is the number of persons; J is the number of items;\(X_{ij} \in \left\{ {1,2 \ldots K} \right\}\) is the response of person i to item j: \(\alpha_{i}\) is the so called “ability” of the person (i.e. the degree of his/her positive attitude towards the aspect of interest), and \(\beta_{j}\) is the so called “difficulty” of the item (i.e. the degree of difficulty to endorse the question posed), expressed with the same scale of the latent trait; \(\tau_{k}\) is a “threshold” that measures the difficulty to endorse category k, identical for every item. Higher values of \(\alpha_{i}\) mean that the persons have high probability of answering to the question using high scores; conversely, lower values of \(\alpha_{i}\) mean that the persons have high probability of answering to the question using lower scores. Higher values of \(\beta_{j}\) mean that is unlikely that a person answers to the question j using high scores; conversely, lower values of \(\beta_{j}\) mean that a person would likely answer to the question j using high scores. In this model, \(\tau_{k}\)—with \(k \in \left\{ {2,\ldots K} \right\}\), the “threshold” that measures the difficulty to endorse category k—is called Andrich Threshold (Linacre 2001) and is expected to satisfy an ascending order (see later).
For instance, considering the latent dimension “Mountain attractiveness (MA)”, if respondent i assigned a high score to the item MA01, he/she would then be expected to show more positive attitude \(\alpha_{i}\) than respondent g who assigned to the same item a lower score: this mean that the person i has a higher attitude for the latent dimension MA than the person g. Moreover, if MA02 was more difficult to endorse (i.e. with higher value of \(\beta_{j}\)) than MA01, then the same persons are expected to assign to MA02 lower scores than those assigned to MA01. In practice, persons with higher estimated attitude \(\alpha_{i}\) are expected to assign higher scores to all items than persons with less positive attitude, and items with higher \(\beta_{j}\) value (i.e. greater difficulty to endorse the item j) are expected to received lower scores than items with lower difficulty.
This is one of the fundamental properties of Rasch models, the so called “Specific Objectivity” (Rasch 1960, 1977). Specific Objectivity is the requirement that measures produced by a measurement model should be independent of the distribution of the difficulties of the items and the abilities of the persons. There are indeed examples that show that splitting the original sample in two sets, one including persons with lower positive attitude and the other grouping persons with higher positive attitude, the estimated difficulty parameters βj calculated on the two subsamples are (statistically) the same. This means that, excluding one part of the sample, the estimates of the attitude parameters for persons in the reduced sample should be (statistically) the same as that obtained from the complete sample: if this doesn’t happen, the data do not satisfy the model, possibly due to coding or response errors, such as miscoding in the data, random answers, or misinterpretation of measurement scales (e.g. respondents interpret the scale in a reversed manner). This property is also known as “Person-free Test Calibration” and in practice ensures that, to estimate a Rasch model, we do not necessarily need a random sample. Symmetrically, “Item-free person measurement” means that estimates of person attitudes are as statistically independent as possible, whatever item and whatever distribution of item difficulties are included in the test. In particular, the familiar statistical assumption of normal (or any known) distribution of model parameters is not required. To put it another way, if we estimate persons’ attitudes using different sets of items, these attitudes tend to coincide within the usual statistical error margin.
Given the optimal theoretical properties of the Rasch model, the main problem in the analysis is to understand how good the data fit the model. Data were analyzed using Winsteps (www.winsteps.com), one of the most widespread software for Rasch Analysis (Bond and Fox 2015). To verify compatibility of data with the model and compliance with its assumptions, the correlation coefficient between the empirical observations \(X_{ij} \in \left\{ {1,2\ldots K} \right\}\) and the Rasch measures obtained in a first run of the estimation program were examined. The correlations were calculated both for items and persons. These correlations are expected to be positive as, according to the assumptions of our model, persons with more positive attitude tend to answer to the item using higher scores, while persons with less positive attitude tend to answer using lower scores. Therefore, should the correlation be negative or very low, this would mean that the item would not work as intended, or that its codes would have been reversed, i.e. 1 means “more” (rather than less), and K means “less” (instead of more). In this case the item should be excluded, or codes reversed. For what concerns the correlations for persons, they measures the correlation between the answers \(X_{ij} \in \left\{ {1,2\ldots K} \right\}\) of the persons to the item j, and the estimated “easiness to endorse” the item j \(\left( { - \beta_{j} } \right)\) (putting a minus in front of the difficulty parameter \(\beta_{j}\) give us the easiness of the item). Since persons should tend to assign higher scores to easier to endorse items, and lower scores to less easy items, if the data fit the model, we expect a positive correlation. Negative or very low correlations would imply that the person answered randomly to the questions without reflecting enough or answered using codes in a reversed manner. Should this be the case, the unit would be excluded. Unfortunately, our dataset showed negative or very low correlations for a large part of the sample (ranging from 25 to 40% of the sample, depending on the dimension investigated); therefore, these observations were dropped to reduce bias in the estimated measures for items and persons. Rather than manipulating the results, statistical data treatment is essential to handle errors related to response bias or badly conceived items. In other words, the Rasch Model Analysis checks for good quality data before obtaining the final model estimate.
Further reduction of the number of observations may potentially derive from the analysis of extreme scores, i.e. respondents who answer using only the lowest (or the highest) score of the scale, 1 (or 5) in our case, to every item. Extreme scores imply extreme, but indefinitely located measures (abilities). Methods to estimate measures corresponding to extreme scores are available, but these produce truncated variables. Hence, we opted for their exclusion from further analysis. The presence of extreme scores implies that the items used to investigate the dimension of interest are either too easy (leading to maximum scores) or too difficult (resulting in minimum scores), and the only way to solve this question is to add new items to the dimension. However, such option was not included in our survey, and depending on the dimension investigated we observed extreme scores ranging from 10 to 20% of the sample. Finally, additional exclusions are related to the analysis of fit indexes, which will be discussed later.
Once completed the preliminary analysis, the other assumptions of the model should be checked to understand whether the categories created assuming value 1,2 etc. have an actual meaning and therefore can be interpreted. Once estimated the model, the average of the estimated attitude for persons, for each item, is expected to grow with the scores 1,2 etc. Indeed, the model assumes that an individual’s more positive attitude should result in the expression of a high score, hence we should observe for lower scores, lower average attitudes, and for higher scores, higher average attitudes. The dissatisfaction of this assumption, e.g. the average attitude for scores 2 is higher than the average attitude for score 3, would mean that respondents inverted or incoherently used scores 2 and 3, hence the two should be merged in a unique category. The Andrich Threshold \(\tau_{k}\) (Linacre 2001) provides further information on the coherent/incoherent use of category measures: in case of inverted use (i.e. not in ascending order by the score k), a common solution is to reduce the number of categories, merging adjacent ones with very similar average attitude or Andrich Thresholds.
Another important aspect of fit is the possible violation of local independence hypothesis (Lord and Novick 1968) and multidimensionality (Linacre 2011). For what concerns the first problem, we may look at the correlation for the standardized residuals: if this is low (< 0.70) we may conclude that the local independence hypothesis is not violated; a correlation > 0.70 means that some couples of items have almost the same meaning; therefore, one of them must be eliminated to satisfy the local independence hypothesis. Regarding the second problem, in a dataset fitting the Rasch model, variability depends on both the model and residual variability due to randomness. Rasch “Principal Component Analysis (PCA) of residuals” looks for patterns in the part of the data due to randomness. This eventual pattern is the “unexpected” part of the data that may be due, among other reasons (Smith 2002), to the presence of multiple dimensions in the data. In the Rasch PCA of residuals, we are looking for groups of items sharing the same patterns of unexpectedness. In particular, the matrix of item correlations based on residuals is decomposed to identify possible “contrasts” (the principal components) that may be affecting response patterns. Usually, the contrast needs to have the strength (eigenvalue) of at least two items to be above the noise level. If the largest eigenvalue of PCA is around 2 or less, the latent measure under investigation may be considered unidimensional. Instead, if it is much greater than 2, we can look at the correlation between measures obtained on the same set of persons, splitting the items in clusters according to the loadings, and applying the model separately for each cluster: as suggested by Linacre (2011), if these correlations are near 1, we can consider the items as making part of a unique dimension; otherwise, if the correlations are very low (< 0.30) we may split the items and exclude the one that do not seem compatible with the dimension of interest.
Once these issues have been investigated and resolved, analysis of fit statistics will give an estimation of the degree to which participants and items are responding according to our expectations based on the model. These fit statistics will be therefore a summary of all the residuals (the difference between what is observed and what was expected) of each item for each person. They can assume values between zero and infinite. Values above 1 indicate greater variation than expected, while values less than 1 indicate lower variation than estimated. Values around 1 mean that the data fit the model adequately. The fit statistics will be divided in two categories, a weighted one, called Infit, and an unweighted one, called Outfit. For suggestions regarding good practice interval see Bond and Fox (2015). In our analysis we followed the suggestions of Linacre (2011): 0.5–1.5 for items. Persons that do not fit should be removed from the model to increase the validity of the results obtained. We decided to retain from the analysis persons with Infit or Outfit < 3.
Finally, we may proceed to examine the overall fit of the model, and in particular the reliability and separation indexes for items and persons: values of reliability > 0.80 and separation > 3 are indicators of good fit of the scale, telling how well this sample of respondents have spread out the items along the measure of the test, and so defined a meaningful dimension.