Seminar 4: Measurement Validity in the Social Sciences.


Over the last three weeks we discussed how to construct a research question, compare cases, form concepts, and the importance of theoretical puzzles.

We are now confronted with the question of how to operationalise our concepts, develop indicators, obtain good data and make valid measurements?

Thinking about data and measurement is integral to all steps in a research project.

Data can be defined as “any form of systematic empirical observation that will enable you to answer your research question”. This link between concept formation, measurement and data collection is crucial for all researchers: qualitative and quantitative.

Measurement validity 

The process of linking concepts, indicators and operationalisation is the problem of measurement validity: making sure you measure what you think you are measuring.

In addition to valid measurement as a researcher you’ll also be expected to present reliable evidence (I should see what you see if I am looking at the data in the same way) and replicable evidence (I should be able to replicate your results).

Validity, the central topic of this seminar, is about making sure the concepts you use are correctly expressed in the measurements you use.

Sometimes this is simple: daily calorie intake can be taken as a good indictor of diet. It is also comparable across the population.

But some concepts (in fact, a lot of concepts) in political and social science do not easily translate into comparable data or measures. Think about wealth, democracy, inequality, informal labour, unpaid work, competitiveness, economic freedom, structural reform.

The fuzziness of many social science concepts suggests that we need to be particularly careful about measurement validity in comparative research.

If a researcher cannot present valid measures of a core concept in their research project then it’s going to make communication with their supervisor very difficult.

Discuss: why is measurement important?

For example, what indictors can we use to measure the health of the economy? GDP, unemployment, employment rate, current account balance, happiness?

Operational definitions 

Last week we discussed concept formation. But when it comes to measurement we have to assume relative agreement of the systematised concept in order to operationalise it.

Hence, this week we are moving away from the process of interpreting the meanings and understandings associated with a concept to the process of developing measures.

In comparative case study research, this involves generating the operational definitions employed in classifying cases, and then developing scores for these cases.

For example, many international organizations have developed synthetic indicators to try capture complex multidimensional economic concepts such as competitiveness.

But what do we mean when we say a country has “lost competitiveness”? Can we really use this concept for describing the economy of nation-states and/or regions?

Concept formation is a philosophical conflict over meaning. Measurement validity is about trying to find good indictors to operationalise a concept within a scientific community.

Let’s continue with the example of “competitiveness”. What indictors would we select to score countries on this measure? What would it look like if we seen it?

Now compare this with a concept such as “enterprise or industrial policy”.

Break into groups of 3 and discuss how you would operationalise and develop indicators for both of these concepts.

Measurement validity is not a philosophical debate, it is about making sure that we measure what we think we are measuring, using an adequate set indicators to score cases.

Data collection 

If we cannot measure a concept directly using a given set of indicators, we should try measure its observable consequences.

For example, how do we assess whether the theory of a comet hitting the earth, which then wiped out the dinosaurs, if we cannot see it? What are the observable implications of this hypothesis? Answering this question is the process of data collection.

Reliability and replicability is much more complicated in discursive research setting because the data literally does not exist without the researchers interpretation.

But this makes the importance of systematic data collection all the more important; archival, interviews, and content analysis requires ordering and reporting, such that your (examiner) can study and examine the data you have used to make your argument.

Equivalence and contextual specificity 

The number of public databases and official statistics  has increased exponentially over the past few years, because of advances in technology and communication systems.

These databases are invaluable for researchers but they were usually not designed for the question a researcher has in mind.

For example, countries measure unemployment in different ways, but the OECD uses a standardized rate, by imposing a uniform definition across all cases. Standardising indicators in comparative politics creates problems of equivalence and contextual specificity.

Think about this in the Irish case today, what is the impact of using ‘unemployed’ rather than ‘joblessness’ in the measurement of the unemployment rate?

To take another example. GDP figures in Ireland are infamously unreliable. This is because aggregate productivity is skewed by a handful of firms engaged in transfer-pricing, whereby it “appears” that a given level of economic activity takes place in Ireland, whereas in actual fact, it is an accounting exercise for tax avoidance purposes.

National income, therefore, is a better measure to capture the contextual specificity of economic activity in the Irish case, whereas GDP might be perfectly reasonable in most other advanced economies of Europe.

Getting as close as possible to the production of data is always preferable, as it allows you to engage in a dialogue with the researchers and scientific community who produce the data. You can contact researchers, ask questions, get clarifications, and find out why they measure things the way they do.

How would you assess the validity and reliability of secondary data?

Survey data/interview data

Surveys are the most popular way to collect data. The obvious problem with survey data is that you have to assume that the respondents mean what they say when answering questions (income declarations) and that the question is properly understood (scales: agree, strongly agree, somewhat agree, neither disagree or agree).

Most case-study researchers in political science will generate data via elite interviews or surveys with secondary sources: archives, research reports, focus groups, official reports, newspaper articles. Reliability often becomes the problem.

All of this is useful but does not get away from the measurement problem: making sure your indicators adequately reflect the concept you are trying to operationalize.

Figure 1 illustrates a 4-step process to operationalize a contested concept such as “structural reform”, which I have defined as “cost competitiveness” and then used “unit labour costs” as a comparative measure, and scored cases from 1-100.

The overall objective of such a measure, in policy terms, is to improve economic growth.



Think about the problem of this! Measurement validity is specifically concerned with whether operationalization reflects the concept the researcher seeks to measure. This measure of structural reform is defined as cost competitiveness and constructed around a replicable set of indicators that all researchers can critique and engage with.

But does it really measure what it is supposed to measure? Whilst it might be externally valid, is it internally valid?

Are ULC’s really a good measure and/or indicator of competitiveness? What are the policy implications of using ULC’s as opposed to public infrastructure as a measure?


Downward and upward movement in Figure 1 (adapted from Adcock & Collier 2001) can be understood as a set of research tasks.

Measurement is valid if the scores derived from a given indicator can be meaningfully interpreted in terms of the systematized concept that the indictor seeks to operationalize.

Systemic error arises when the links between concept, indictor, and scores are poorly developed. This happens more often than you might think in social research.

For example, is counting newspaper articles a good measure of media bias? Is taxation as a percentage of GDP a good measure of whether a country is a low-tax or high-tax country? Probably not. Therefore, better measures need to be developed.

Discuss: is there a tradeoff between precision and validity? Is there a trade off between internal validity and external validity?

What should we be most concerned about?

It is important to note that the same score on an indicator may have different meanings in different contexts. Context matters in comparative politics and the only way to avoid error is to engage in reflexive careful reasoning in the stages of operationalization.



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s