A Factorial Survey on the Justice of Earnings within the Soep-Pretest 2008

In the 2008 Socio-Economic Panel Study (SOEP) Pretest, the factorial survey method was tested for the first time for use in the SOEP longitudinal study. In this paper, we describe the construction and application of the vignette module, which has its origins in the field of justice research and is used in particular in the measurement of income justice. We show that the factorial survey method is applicable in large-scale survey research when taking certain constraints into account, and that respondents of varying ages and educational groups are able to deal sufficiently well with answering the questions. The results obtained suggest that older respondents tend to take fewer dimensions into consideration in forming their opinions. Further studies will be needed to determine whether this is evidence that the evaluation tasks were too complex for these respondents and should thus be interpreted as a method effect, or whether it represents a valid substantive result. The results of the study demonstrate convincingly that alongside occupation, education, and performance – factors relating directly to employment – familial aspects such as civil status, the partner's employment status, and number of children constitute important criteria for determining what constitutes a "fair" income. The factor survey in the 2008 SOEP Pretest offers diverse analytical potential, both from a methodological point of view and in terms of the empirical results obtained. The positive experience with the 2008 SOEP Pretest suggests that the SOEP vignette module can be used effectively in a future wave of the main SOEP survey.


Introduction
Over the last decade there has been a marked increase of studies in academic and nonacademic attitude and decision research which use a comparatively new method: the factorial survey design. 1 The factorial survey is an experimental method which confronts respondents with hypothetical descriptions of objects or situations (vignettes). In these descriptions some attributes (dimensions) are experimentally varied. The respondents' task is to normatively evaluate each of these descriptions or to indicate what they would recommend or how they would act in the presented situations. The aim is to identify those dimensions which affect the evaluation or the decision and to assess their relative impact. The issues addressed in various studies resemble attitudes towards the justice of income and wages (Alves and Rossi 1978;Hermkens and Boerman 1989;Jann 2003;Jasso 1994;Jasso and Meyersson Milgrom 2008;Jasso and Rossi 1977;Jasso andWebster 1997, 1999), on just taxation (Liebig and Mau 2005) and just punishment (Berk and Rossi 1977;Miller et al. 1986). There are also studies on the measurement of norms and values (Beck and Opp 2001;Jasso and Opp 1997;Mäs et al. 2005) and the degree of life satisfaction (Kapteyn et al. 2008). Other studies simulate bargaining situations (Auspurg and Abraham 2007;Auspurg et al. 2009b) or deal with trust (Barrera and Buskens 2007). The joint endeavor of this kind of studies is to measure the evaluation of certain outcomes e.g. income, grades, satisfaction, penalties, or certain decisions-making processes which strongly depend on the particular situation and the social context. The use of the factorial survey method is driven by the promise that it allows for a more differentiated measurement compared to classical item based approaches in attitudinal research. The main advantages of the factorial survey design in comparison to item-based measurement are: (1) The vignettes describe a situation more realistically -in everyday life people judge, decide or evaluate on a bundle of information and this is what factorial designs consider in their multidimensional descriptions; and (2) the experimental approach of the design, where respondents rate vignettes in which the dimensions vary independently from each other.
Despite the growing applications in attitudinal research there is little empirical knowledge on the methodological implications and effects of the factorial survey design. This is especially true for the use of factorial surveys in population surveys. Most of the studies are using homogenous respondent populations, most often students, and are carried out in the lab or comparable settings (e.g. classroom). As the research 1 The factorial survey was established in the social sciences by Peter H. Rossi in his dissertation in 1951. It was used for the measurement of social status and prestige of households (Alves/Rossi 1978;Rossi 1979;Rossi/Nock 1982). Rossi's central goal was the development of a method of measurement that distinguishes between the relative relevance of several factors for social attitudes (Rossi/Anderson 1982:15 et seq..;Rossi/Nock 1982). design and the respondents' rating task are usually very complex a number of methodological effects may occur and, as a consequence, may cause methodological artifacts. Against this background, we implemented a factorial survey module in the SOEP-pretest 2008 to find out which practical and methodological problems are associated with this technique, especially when it is used in large scale population surveys. The main focus is hereby on the acceptance of and comprehensibility for respondents and interviewers.
The subject matter of the implemented factorial survey module consists in the justice of wages. Respondents were confronted with 25 descriptions of fictitious earners who differed in certain characteristics such as gender, age, education, occupation or level of individual effort. In each of the cases respondents had to evaluate whether the presented gross income was just or unjust.
The paper is organized as follows: first, we give an overview on the construction of factorial survey designs (Chapter 2). Second, we describe the implementation of the instrument in the SOEP-Pretest 2008 and the respondent and vignette sample (Chapter 3). Third, we investigate the capability of the factorial survey design. We analyze the direct feedbacks of respondents and interviewers as well as the respondent behavior using data on the response times and consistency of responses (Chapter 4). Fourth, we present some results with regard to the perceived justice of earnings (Chapter 5) and in the last chapter we summarize our findings and stress the main methodological implications of this study.

The Factorial Survey Approach
Constructing the vignettes describing persons, situations or objects is the most important step in designing factorial surveys. At first those characteristics or dimensions of persons or objects have to be identified which hypothetically effect the response behavior. This step should be based on theoretical considerations (Alves 1982;Jasso 2006) and be carried out very carefully as seemingly marginal specifications (such as the definition of the number of levels used) have a great impact on the conceptual design and analysis of factorial surveys. The main task in defining the dimensions (i.e. the characteristics of the fictitious earners) is to find those that are relevant for the evaluations (of just earnings).
We intend to construct vignettes which describe persons who work full time and earn a certain amount of gross income. The rating task is to evaluate whether the income of the described person is just or unjust. Qualitative dimensions (such as the sex of the person) have a naturally limited number of levels (male and female). In contrast to this, the range and number of level of continuous dimensions (such as age) have to be defined. Age for example could be restricted to a range from 30 to 60 years with four (30, 40, 50 and 60 years) or seven different (30,35,40,45,50,55,60) levels. It is important to note that the number of parameters which have to be estimated in the data analyses increases exponentially with the number of dimensions and levels (Alves 1982;Jasso 2006).
After specifying the dimensions and levels the vignette universe of all possible vignettes is generated by multiplying all attribute levels with each other (Cartesian product). In the case of three dimensions with five levels each the universe consists of 5*5*5 = 125 vignettes. 2 Usually the complete vignette population cannot be rated by single respondents because of their vast extensiveness. The solution is to work with samples only (similar to matrix-sampling, for detail: Thomas et al. 2006). One may draw a unique sample for each respondent or a few samples rated by a number of respondents (so called decks) in order to obtain multiple ratings on each vignette (Jasso 2007).
The vignette sample can be obtained by using a random (Jasso 2006) or a quota design (Dülmer 2007;Kuhfeld 2005;Kuhfeld et al. 1994;Steiner and Atzmüller 2006). In both cases, the aim is to keep correlations low between different attributes. The dimensions stand orthogonal to each other in a full factorial design (vignette universe) and so all main and interaction effects can be estimated. This assumption gets relaxed in a reduced sample because some effects will be confounded. As recent studies suggest quota designs are more efficient compared to random samples due to their higher orthogonality and balance (that is: maximum variance of attribute levels). This is especially the case within small samples (Dülmer 2007;Steiner and Atzmüller 2006). 3 For the evaluation task of each vignette different scales can be used. The main criterion hereby is: Is it necessary to employ a metric scale or is an ordinal scale appropriate? In most vignette studies rating scales with up to 15 categories are used (Dülmer 2001;Mäs et al. 2005;Schulte 2002;Thurman et al. 1988) but there are also a lot of applications using magnitude scaling (cf. Wallander 2009).
2 In this full factorial design the correlation between the dimensions is zero. Some combinations lead to unrealistic scenarios (for example a medical doctor without a university degree) and were excluded from the vignette universe. This is why the correlation of dimensions in the resulting vignette universe is unequal to zero. 3 These quota designs systematically draw vignettes out of the universe with the overall goal to have all level combinations uncorrelated. This can be done by statistical software. In regular cases the algorithm detects the maximum efficient design. Besides low correlation, efficiency also means a maximum variance of dimensions. Alongside the fractional factorial designs, which only maximize orthogonality, D-efficient designs are available. In D-efficient designs orthogonality looses on ground because maximum variance of attributes gets the main target criterion. The D-efficient design should be the preferred approach especially for vignette populations where implausible combinations have been deleted.

A Factorial Survey in the SOEP Pretest 2008
The program of the annual SOEP questionnaire for the following wave is pretested in each summer of the preceding year. The objective of this pretest is to test new modules and modifications of questions. Since a couple of years the SOEP-Pretest goes far beyond the standard format of a pretest. Since 2002 the sample size is around 1,000 respondents and considered representative for the German resident population of 16 years and older (Siegel et al. 2009).
Within the SOEP there are two main differences between the pretest and the main survey. First, all interviews in the SOEP-Pretest are programmed as CAPI versions (in contrast, in the main survey most of the interviews are based on paper and pencil questionnaires), that is why this SOEP-Pretest is useful to test experimental designs. 4 Second, whereas in the main survey all members of a household from the age 16 on are interviewed, the SOEP-Pretest is arranged in a much simpler fashion. There is one questionnaire to be filled out by one member of a household. The pretest sample is not related to the main survey, meaning that these respondents are not part of the panel study. The interviews of the SOEP-Pretest 2008 were conducted in the period from 1st to 31st August in 2008. The duration of the whole questionnaire was planed for 45 minutes which is matching with the realized median. In sum 1,066 interviews were conducted.
Within the SOEP-Pretest the factorial survey module focuses on the justice evaluation of the wages of fictitious full time employees (40 hours per week) who are described by ten dimensions. The respondents had to rate in sum 25 vignettes, where the last vignette consisted of two additional dimensions on the nationality of the earner and his or her duration of stay in Germany. In the following we will concentrate on the results for the 24 vignettes with ten dimensions. 5

Vignette Dimensions and Levels
The ten dimensions presented on the vignettes were based on the evidences of earlier vignette studies on the justice of earnings (Alves 1982;Alves and Rossi 1978;Jann 2003;Jasso 1978;Jasso and Rossi 1977;Jasso andWebster 1997, 1999). These studies show that the dimensions age, gender, number of children, occupation and education 4 Further topics in the SOEP-Pretest are: (1) daily moods: self assessment of the respondents in regard to moods in a typical week, (2) Questions referring to strength of character: a German translation to the Values in Action (VIA) -Classification of Strengths concept, (3) new questions to measure (chronic) diseases. 5 All respondents had to rate a blind vignette with the help of the interviewer at the beginning. The content of this vignette was: "A 35 year old single man with vocational training works as a hair dresser in a small company which achieves substantial gains. His performance on the job is outstanding and he earns a gross income of 350 Euro per month. Is the gross income of this employee in your opinion just or unjust?". have a significant influence on justice evaluations. Further dimensions that are commonly known as relevant from justice research and related fields were added. These are the performance on the job and the marital status Schupp 2005, 2008a,b;Struck et al. 2006). As the size and economic situation of the company (Abraham and Hinz 2005a,b) are important for the actual income, we assume that these two dimensions are also relevant for the subjective justice evaluations. Table 1 gives an overview of the dimensions and the levels used to describe the fictitious earners.

Vignette Universe and Illogical Cases
The vignette universe is the combination of all attribute levels with each other. In the present study this combination of all dimensions and their levels sums up to 980,000 cases. Some combinations were excluded from the vignette universe as they describe cases which can definitely not be found in the real world and are therefore illogical. This is true for certain combinations of income and occupation: -Gross income of more than 3,800 Euro for manufacturing workers -Gross income of more than 5,400 Euro for doorkeepers and engine drivers -Gross income of more than 6,800 Euro for administrative associate professional, hair dressers and social workers -Gross income below 1,200 Euro for electrical engineers -Gross income below 2,500 Euro for general managers or medical doctors There are also combinations of vocational training and occupation which are definitely unrealistic: -Electrical engineers without vocational training -Physicians without a university degree We drew the vignette sample with a quota design (D-efficient design) under exclusion of the mentioned illogical cases (Kuhfeld 2005;Kuhfeld et al. 1994). Firstly, we drew 240 vignettes with a D-efficiency of over 90 and secondly we fractionalized them on ten decks with 24 vignettes 7 each.

Rating Task and Presentation of Vignettes
The rating task was a three step procedure: first the respondents had to evaluate whether the gross income of the person described on the vignette was just or unjust. The respondent continued with the next vignette if he/she had judged the income as just. If the respondent evaluated the income as unjust he/she had to reconcile in a second step whether the income was too high or too low. In the third step the respondent had to express the amount of felt injustice using a metric scale from 1, some injustice, to 100, extreme injustice. A disadvantage of this procedure could be that respondents are more familiar with rating scales which means that it may be more difficult to use this kind of scale. The advantage of a metric scale is that respondents have the opportunity to differentiate their judgments in a finer way compared to, for example, a five-point rating scale. Figure 1 presents a vignette with the rating steps.
The complete questionnaire within the SOEP-Pretest 2008 was designed as a computer assisted personal interview (CAPI) 8 and the interviewer read the questions to the respondent. In the vignette module, however, the vignettes were presented to the respondent on a computer screen. The interviewer was sitting next to the respondent to answer any questions that occurred during the evaluation task. In an introduction screen the respondent additionally got information about what to do and how to use the scale. Afterwards the respondent judged an example of a vignette and was able to ask the interviewer for help if there were any ambiguities. After this blind vignette the respondents were randomly assigned to one of the ten decks with 24 vignettes. The vignettes were programmed in a fixed order which means that respondents could not skip to the next vignette without a rating. 9 Therefore, the respondents were forced to rate every vignette. 7 The maximum D-efficiency in a symmetric design is 100. Often the best achievable efficiency is less than 100 so one has to choose the best out from some alternative designs. A D-efficiency above 90 is deemed to be good. 8 We thank Andreas Stocker, TNS Infratest Sozialforschung, Munich, who bestowed great care on the implementation and programming of the vignette module in the computer assisted questionnaire. 9 This procedure is somewhat uncommon with regard to measuring the acceptance of a new module. But it is possible to reconstruct refusals by very short response durations (see Chapter 4.2.1).

Figure 1: Vignette with Ten Dimensions and Rating Task
A 45-year old woman, married, with two children, and a husband who does not have own income, she has vocational training and works as a hairdresser in a large company, which is threatened of bankruptcy, Her performance on the job is below the average, She earns 1200 Euro gross income per month before taxes.
Your rating: F 1: From your point of view, is the gross income for this person just or unjust? □ Gross income is just ( carry on with the next person description) □ Gross income is unjust ( carry on with F 2) F 2: Is the gross income unjustly too high or too low?
□ unjustly too high ( carry on with F 3) □ unjustly too low ( carry on with F 3)

F 3:
With regard to your personal feeling, which number between 1 and 100 describes most adequately the amount of injustice?

Respondents and Vignette Sample (SOEP Pretest 2008)
The realized sample of the SOEP-Pretest relies on a three step probability sampling procedure according to the ADM-Design. The response rate reported by TNS Infratest Sozialforschung is about 50 percent (Siegel et al. 2009). The realized sample (N = 1,066) was weighted in regard to regional and demographic distribution. It is warranted that the weighted sample is representative for the German population, even though only unweighted data are used in the report at hand. Table A1 in the Appendix gives an overview of the realized sample.
Respondents were assigned to one of the ten vignette decks randomly. The distribution of respondents to each deck is reported in Table 2. The number of realized respondents by deck ranges between 96 (decks 2 and 7) and 127 (deck 9). 10 The correlations between the dimensions in the whole vignette sample as well as in the single decks are very low (see Appendix A2), which means that the design is efficient in a statistical sense.

Methodological Results
The main research question focuses on methodological effects resulting from the higher-than-average complexity of a factorial survey and its application in population surveys. This is investigated by using three sources of information: (1) respondents feedback, (2) interviewer impressions, and (3) response behavior. After a short depiction of openly asked respondent feedbacks, the more profound analyses of the latter aspects are presented. We attempt to analyze in detail the differences in respondent behavior. As factorial surveys require much more attention and concentration from respondents, age and educational effects are likely to occur. Therefore we categorize respondents in the following analyses in three age groups (between 16 and 39 years, 40 to 65 years, and over 65 years) 11 and three educational groups (general educational level: lower (Hauptschule), middle (Realschule) and higher secondary school certificate (Abitur)).

Respondents Feedback and Interviewer Impressions
The questionnaire provided the opportunity to criticize and comment the vignette module in an open question. A total of 191 respondents made a comment. It is not traceable whether the other respondents had no critique at all or did not want to answer the open question. Table 3 shows the mostly mentioned comments. In 36 percent of the cases (that is seven percent of the whole respondent sample) respondents declared the descriptions to be unrealistic in some cases. 35 percent (six percent of all respondents) of those who made a comment perceived the vignette part as too long. Only twelve percent (two percent of the whole sample) had problems with the comprehension of the rating procedure. Nine percent of the 191 respondents who made comments had difficulties to assign the income as just or unjust.
Based on the interviewers' assessment we are able take closer look on the respondents' comprehension and willingness to participate in the vignette module. 12 As shown in Table 4 more than 80 percent of the respondents understood the vignette part well (categories: very good and good). In comparison with the vignette module, the whole questionnaire has more than 90 percent in this category. This difference of ten percentage points indicates that the vignette module is more complex than other parts of the interview but it can still be considered similar to other complex modules in the SOEP-Pretest 2008. The question given to the interviewer was: 'Please state precisely for the last question, or group of questions, in regard to the topic 'Income Justice', how you would evaluate the respondent's performance with respect to comprehension and willingness to reply.' (closed question, categories: very good, good, satisfying, adequate, inadequate, deficient). The assessments of the interviewers are obviously subjective and should be interpreted with caution. Still, these impressions provide some valuable insights on the interview situation itself. Figure 2 shows that some differences between the age groups occurred. More than 50 percent of the youngest respondents had a very good understanding of the vignettes. In comparison, only 30 percent of the oldest interviewees are in this group. 40 percent of the latter group comprehended the task well (category: good). Only ten percent of respondents over 65 years understood the vignettes worse than satisfying. As shown in Figure 3, there are only few differences between educational groups in regard to comprehension. In 50 percent of the cases respondents with the highest education level had a very good comprehension of the vignette module. In 40 percent of the cases the comprehension was still good. The middle group performed almost as well with a total of 90 percent who understood the task at least well. From those who have a lower secondary school certificate (Hauptschule) almost 80 percent understood the task well. The differences between educational levels can be considered smaller than the differences between age groups. Notable is the fact that the differences between age and educational groups are similar in the whole questionnaire (not displayed analyses). This means that no vignette specific comprehensive problems occurred. 13 The respondents' willingness to answer -as the interviewer perceived it -is presented in Table 5. In over 80 percent of the cases the willingness to participate in the vignette module was good or very good in comparison to almost 90 percent for the whole questionnaire. As Table 6 shows the willingness to answer differed between age groups. The youngest group performed significantly better than the oldest group. The middle group performed the task almost similar in comparison to the youngest group.  Table 7 reports only marginal differences between educational groups. In the group of respondents with a lower secondary school certificate (Hauptschule) 78 percent of the interviewers classify their willingness to answer at least good, in comparison to 83 percent and 87 percent in the other groups. In sum, interviewers' impressions do not show dramatic differences between age or educational groups. This can be interpreted as a first hint that vignettes are applicable in public surveys.

Respondent Behavior
Respondent behavior provides valuable insight on the rating situation and allows for drawing conclusions from the evaluation with regard to the capability of the vignette tool. In the following, a closer look on response time, the use of the rating scale and the consistency of the judgments is taken.

Response Time
The response time is only available for the whole vignette module. The analysis of this kind of process produced data is problematic because of the fact that important context information like interruptions during interviews is often neglected. Nevertheless the gathered data provide useful information -for instance in respect to factual refusal. The CAPI programming excluded the possibility of refusing or drop outs during the module (compare part 4). A measured response time of 20 seconds for the complete module can be interpreted as a factual refusal. Approximately five percent needed less than three and a half minutes to complete this part of the questionnaire which is an average of eight seconds per vignette. Two interviews build the counterpart with 137 and 139 minutes process time (on the average five minutes per vignette). This distortion is an indicator for unmeasured breaks. Besides these outliers the data seems analyzable. The respondents needed an average of thirteen and a half minutes for a completion of the vignettes (24 plus example and vignette with two further dimensions, see Footnote 5). The median is twelve (12.4) minutes. Table 8 informs about important data points in regard to process time. Mentionable is also that the respondents started the vignette module on average after 25 minutes of questioning.  Figure 4 shows the box plots of the process time for respondents' age (left box plots) and education (right box plots). There are no dramatic differences between the groups, respondents with higher education level and older respondents were in need of slightly more time to fulfill the questionnaire (median for older aged being one minute more).
These results indicate that vignettes in a population survey can be evaluated in a tolerable amount of time. The median shows an average of 30 seconds per vignette. The differences between education and age are narrow. All respondents are able to process the vignette module in a similar span of time.

Use of the Scale
How do respondents use the response scale? The range of the scale reaches from -100 to -1 describing the view that the fictitious earner on the vignette is underrewarded. The zero point of the scale marks a just income and positive values from +1 to +100 reflect a situation where the presented fictitious earner is overrewarded. Table 9 shows the frequencies of the categorized responses distinguishing between underrewarded, just income and overrewarded. About 9,000 vignettes were rated as just, slightly fewer as underrewarded and about 8,000 vignettes as overrewarded. At the first glance the ratings show a dominance of the "just" category.  Figure 5 displays the distribution of the evaluations using not the categorized responses but the scale values. As shown in the graph the category "zero" respectively "just" extremely dominates the other scale values. The agglomerate at the borders of the distribution shows a ceiling effect, especially in the negative number range. In addition, some often mentioned values stand out (-100, -50, 0, 50, 100). The respondents did not fully use the metric scale. For the following analyses this result is taken into consideration by using the categorical variable with three values (see Table 9). .06 .08 Density -100 -50 0 50 100 justice rating To determine in more detail how the respondents used the response scale we concentrate on two aspects. First of all, the clustering of the category "just" is remarkable. It might indicate that respondents wanted to accelerate the procedure by overleaping the second and third part of the evaluation task (see part 1). Second, we analyze in more detail, how many different values the respondents really use to make their judgments (maximal 24) and what kind of scaling they apply.
From Table 10 it can be seen that the respondents rated on average 8.3 vignettes (out of 24 -more than one third) as just with marginal differences between educational groups.  Table 11 shows a significant difference (p<.01) between age groups. The group of the 16 to 39 year old respondents uses the "just"-category more than the group of 40 to 65 year old respondents. Respondents of 66 years and older lie between the other two groups with an average of 8.5 vignettes rated as just.  One could assume that choosing the "just"-category reflects the wish to speed up evaluation task and goes hand in hand with fatigue effects as respondents have to fulfill only one rating step instead of three. Both would imply a structure where more just ratings can be found at the end of the module. However, the correlation between the position of the vignette within the module and the rating is low (see Figure 6). This means that there are no hints for more "just"-ratings in later positions. The use of the category "just" might not reflect fatigue effects or the desire to speed up evaluation task. Do the respondents really use the full 100 point scale to differentiate their judgments?
The average use of different values is 8.53, the median is 8. As seen in Table 12, there are differences between educational groups. Respondents with a lower secondary school certificate (Hauptschule) use significantly fewer values (p<.05) for the income rating compared to respondents with a higher secondary school certificate (Abitur). This could indicate that people with higher education are using the scale in a more fine-grained fashion. Similar results exist with regard to magnitude scales in methodological studies within Conjoint-Analysis (Steenkamp and Wittink 1994;Teas 1987). The age groups do not differ significantly from each other (Table 13). This analysis provides only a first hint about how the respondents used the scale. In a next step we focus on the range of values and the distances between them. Table 14 shows which numbers were used. Nearly eight percent of the respondents used numbers with a distance of 25 in each of the 24 vignettes (25, 50, 75, 100). For two third a tenpoint scale would have been appropriate as they only used decimal steps. Together with persons who additionally used finer five-point steps (which is covered by a 20-point scale), 90 percent of the respondents are detected. Only ten percent use additional numbers and thus only this small group would be narrowed in their ratings by a 20-point scale instead of a 100-point scale.

Consistency of Judgments
There are two recognized strategies to check for consistency of responses in factorial surveys. Both strategies rely on the results of individual regression models where the rating is the dependent variable and the independent variables resemble the dimensions. The first strategy is to take a look at the model fit during different response sequences (in OLS-models the model fit would be specified by the R²). The questions are: Is there an adaptation phase in the first judgments and is that sequence therefore not comparable to latter ones? Is there a phase in which the respondents judge most consistently? Is there evidence for fatigue effects at the end of the vignette module? To answer these questions we compare the consistencies in terms of R 2 within different phases of the vignette module.
The second strategy is to investigate how the consistencies depend on respondent attributes such as age or education level. 14 In fact the design of vignettes is more complex in comparison to item batteries (in order to complete the rating it can take up to three steps). Therefore we may question if old or young as well as high or low educated respondents are to the same degree able to deliver consistent responses? For an answer we compare the model fits of each age and education group.
The basis of the following analyses is a multinomial Logit-Model which conservatively considers the reached scale level. 15 We transform the dependent (metric) variable into one with three outcomes, -1 (unjustly underpaid), 0 (just) and 1 (unjustly overpaid). We measure the model fit in the case of categorical regression analysis by the Pseudo-R² by McFadden (Long 1997;Long and Freese 2006). The Pseudo-R² does not measure the variance explanation (unlike the R² in OLS-Models), but gives a hint for the goodness 14 However, most researchers use the consistency measured by the model fit to underline the fact that their dimensions are adequate. But this criterion is not sufficient (Auspurg et al. 2009a) because respondents may produce consistent judgments also when they fade out some dimensions in cases of over-burden or fatigue. Therefore, the effect sizes of different coefficients and their significance (standardized number of cases) is also important. 15 Alternative censured data can be estimated by Tobit-regression models. In this case the zero leads to a modeling problem which is the reason for choosing a Logit-model.
of fit of the model and at the same time for the consistency of respondent behavior. The ten dimensions are included into the model as independent variables (see Table 1). Figure 7 lists the Pseudo-R² values in six phases of the vignette module (multinomial Logit-Models for each sequence under consideration of the clustered data structure). 16 Every sequence includes all judgments of the respondents that is we have pooled regression results. The most consistent phase is the fifth (vignette 17 to 20) with a Pseudo-R² value of .4. In the first sequence the Pseudo-R² is far below .4 and also the lowest value of all parts. In the middle part the Pseudo-R² is slightly less than .4. There are only marginal differences between the phases of the vignette module indicated by the goodness of fit. At the first glance these results imply the absence of fatigue effects in the vignette part. But the respondents could also have produced consistent ratings by fading out some dimensions. Therefore we investigated in a further research step the number of significant effects (significance at the .05 level) per sequence. A multinomial Logit-Model, which includes all ten vignette attributes, has 15 independent coefficients (because of the dummy-split for the variables marital status, vocational training, performance, firm size and economic situation of the firm). The dependent variable has three categories. Without the constant, a prediction of 15 x 2 = 30 coefficients is conducted in a multilogit-model. Obviously, the maximum number of significant coefficients is 30. Figure 8 displays the number of significant coefficients (grey bars, left scale) as well as the Pseudo-R² from Figure 7 (connected line, right scale) in one graph. The figure shows a result that contrasts strongly with the results from above. In the first, third and sixth phase of the vignette module more than 20 coefficients are significant. In the second phase 20 coefficients are significant but in the fourth and fifth sequence we find only 16 significant effects. Figure 8 clearly highlights the differences between the consistency measured by the Pseudo-R² and the number of significant coefficients. The highest Pseudo-R² and at the same time the lowest number of significant effects are found in the fifth sequence. The respondents are seemingly reaching a higher consistency by using a heuristic to simplify their ratings. One could challenge these results with regard to the fact that they are the result of the specific split into six parts. That is why these findings have to be confirmed by analyzing smaller and wider splits. Nevertheless, the results remain stable also in alternative splits. 17 In the further part of this analysis we take a look at the results of the second research strategy, the analysis of the differences between age and education groups. As we have seen in the previous analyses both measures have to be taken into account, on the one hand the Pseudo-R² and on the other hand the number of significant coefficients. We estimate pooled multilogit-models by age and education group and report the model fits in Table 15. 18 The results for the education groups show that respondents with Abitur achieved marginal higher Pseudo-R² values than the other two groups (.38 with respect to .37 (Realschule) and .35 (Hauptschule)). The number of significant coefficients 17 Pseudo R²-values for a split in two halves are: .35 and .37; the numbers of significant coefficients are: 20 and 18. Additional analyses with three groups also show that the number of significant dimensions decreases during the response process while the R²-values increase. The different numbers of significant coefficients could also result from differently strong correlations between independent variables (due to different 'efficient' vignette samples, compare Chapter 2 and 3.2). However, the correlation tables for singular phases presented in the appendix show that this interpretation may be excluded: The correlations and variances of the vignette variables differ only marginal between the phases (compare Tables A2). 18 One has to take into consideration that the number of observations differs in each education group. To avoid interferences, we drew ten random samples of a size of N=245 out of the respondents with the education level Haupt-and Realschule. We proceeded the same way in the case of the age groups. The table reports the respective means of these samples.
varied inversely to education levels. Regression models of respondents with Abitur lead to 16 significant coefficients. For the other groups 17.6 (Realschule) and 17.9 (Hauptschule) can be reported. Common to both measures of model fit is that the differences between the groups are marginal.
For the three age groups we find that respondents of the middle group achieved the highest Pseudo-R² (.37) in comparison to the others (.36, youngest group, and .35, oldest group). The differences are, however, marginal. There are no major differences in the number of significant coefficients. The regression for the youngest group shows 22 significant effects, for the middle group 18.4 and the oldest group 14.7. This could be a hint for a fatigue effect in the older group. Further analyses -under control of the six phases -indicate relatively constant respondent behavior regarding significant coefficients and model fit in this age group (not displayed). We find differences in the number of significant effects between age group but we can not conclude that this is a fatigue effect rather than a result of different justice evaluations. An analysis of the consistency of ratings shows that the response fit cannot be measured only by goodness of fit values such as Pseudo-R². Respondents who take fewer dimensions into account may also achieve high Pseudo-R² like those who do not use such heuristics. An examination of the significant coefficients points out that the best sequences (respectively broadest judges) are found between the ninth and twelfth vignette.

Justice Attitudes
To show the potential of factorial design for analyzing different research questions we will present in the following some content based results on the justice attitudes of the participants of the SOEP-Pretest 2008. These results stem from a multinomial logitmodel as introduced in Chapter 4. The analyses are based on the 24 evaluations each of the 1066 respondents made within the vignette module, resulting in 25.584 justice judgments. The Logit-Model is estimated with robust standard errors correcting for the clustered data structure. The dependent variable is the categorical variable of justice evaluations (see Table 9). The independent variables are the vignette dimensions described in Table 1. In order to generate an interpretable table, marginal effects are reported instead of beta coefficients. For example, in Table 18, the likelihood that the income is perceived as too low (underrewarded) increases by 2.35 percent if the described person has a university degree instead of no vocational training. Without vocational training, the likelihood of being overrewarded decreases by 6.14 percent .
We take a look at the different variables step by step. The gender of the vignette person has a significant effect on the justice evaluation. The income of male earners is more often rated as too low or just than the income of females. Because the regression model controls for all covariates this effect goes reduces to the gender of a vignette person. This is an evidence for the just-gender-wage gap known from earlier studies (Jann 2003;Jasso und Webster 1997, 1999.
The age of the vignette person is also a relevant criterion for justice evaluation. Respondents award higher wages to older people which can be seen as an indication for the seniority principle.
The dimension "vocational training" is included as a dummy set into the model. The reference category is a vignette person without vocational training. Vocational training and the university degree are relevant for income evaluations independent from the actual occupation, in other words: more training should lead to a higher income.
The same effect can be observed for the occupational prestige operationalized by the magnitude prestige score (MPS): People in occupations with a higher prestige should earn more (than people with a job of lower prestige).
The gross income of the vignette person is the basis of the just evaluation of the respondent and has a strong effect. The higher the stated income the more likely it is perceived as just or even too high (for 1,000 Euro income more the likelihood for rated as overrewarded increases by 9.45 percent). Performance on the job (in the model as a dummy set with the reference category "below average") also has the expected effect: Higher performance should lead to higher payment. A person who performs below average instead of above average has a higher probability (13.7 percent) to be rated as overrewarded.
The economic situation of the firm has an impact too. In contrast to the reference category, high profit employees in companies are more often in the "just"-category. The vignette persons who work in companies which are threatened by bankruptcy are more likely to be in the overrewarded group. The company size does have a similar but smaller effect: In medium and large companies the likelihood to be in the overrewarded category is lower in comparison to small companies.
The last two dimensions are related to the family situation. Married and sole earners are the reference group. Singles and double earners are more often in the overrewarded category. From the respondents' viewpoint, single earners with a partner should earn more. The number of children is also a relevant predictor for the justice evaluation: The more children the vignette person has the more this person should earn.
But factorial surveys do not only provide the opportunity to analyze the reactions of respondents to the variation of certain dimension of the vignettes, they also allow studying differences in the response behavior between groups of respondents and the effects of the characteristics of the respondents on their evaluations. To give an example of this kind of analyzes we ask in a first step, if the occupation of a vignette person has a different effect on the justice evaluation depending on the age or the education of the respondent. Our representative sample is predestined for this kind of content based questions and has more potential for analyses than homogeneous student samples or other specific populations, which until now are the standard in factorial surveys.
We categorize our respondents in the same age and educational groups like in the preceding analyses. For each age or educational group we calculate the means of justice evaluations regarding the ten occupations of the vignette persons ( Figure 10). 19 Interestingly, the curves of the three age groups and the three educational groups do not differ in their shape. There is a consensus -in the different age groups as well as in the education groups -what people in the ten occupations should earn and, more important, that occupations with a higher prestige should also earn more than occupations with a low prestige. It is noteworthy that social work professionals and locomotive engine drivers lightly fall out of this order: These two occupational groups should earn a bit more than the prestige scores pretend.  In order to indicate the further reaching potential of factorial surveys we present two more respondent specific analyses. In the regression model, the performance has a significant impact on the justice evaluation. To find out if this is a consensual judgment, three groups beside the unemployed are considered. All employees in the SOEP-Pretest had to rate their own performance on the job -analog to the three categories of the vignette dimension "performance on the job". Figure 11 displays that these grouped respondents differ in their income evaluation. Respondents with a low performance on the job do not consider vignette performance in the same way than respondents who declare they perform well. 20 Another dimension with significant effect in the model is the gender of the vignette person: Men and women assign men a higher just income. The difference of the effect size is similar as well. Thus some possibilities for analyses are indicated. For multivariate models some other procedures are applicable which estimate the level of measurement of the dependent variable less conservatively. Furthermore more analyses (based on content problems) about subgroup specific judgment rules are conceivable, especially for interactions of vignette and respondent attributes.

Conclusions
This research note describes the factorial survey and its implementation in the SOEP-Pretest 2008. The main research objective of this study was to investigate the capability of factorial surveys in large population surveys. Therefore, we created a vignette module that was part of the CAPI-questionnaire with 25 descriptions of fulltime employees. Respondents expressed their ratings using a three step procedure. Afterwards, interviewers and respondents gave feedbacks about difficulty and comprehensiveness of the rating task. We analyzed these evaluations and the response behavior within the vignette module in order to get detailed insights about the usability of this method.
To sum up the most important methodological results: (1) The factorial survey is a useful instrument for attitude measurement if researchers follow some ancillary conditions, such as the creation of realistic vignettes. Respondents of all age and education groups are capable of rating the vignettes. The higher complexity -regarding to general surveys -seems to be manageable by a vast majority of the respondents. shows that respondents only use a few values -with an accumulation at integer values (50, 100). Therefore there is no need for applying large response scales exceeding the common range used in other SOEP item batteries (e.g. life satisfaction). With a rating scale that is common to SOEP respondents most ratings can be covered. (4) The analysis of the consistency of response behavior shows that the average respondents can deal with the rating tasks of factorial surveys.
The second objective of this investigation was to learn more about respondents' attitudes towards income justice. The results exemplify that besides the occupation, the vocational training and performance -thus factors in direct reference to employmentfamiliar aspects like marital status, the occupational status of the partner and the number of children are relevant criteria for justice evaluations, too. The factorial survey features a various analysis potential, both, in respect of methodological research problems and also in regard to substantial research questions. The positive experience of the SOEP Pretest 2008 encourages the use of vignettes in the main survey.