Which Rating Scales Should I Use?

Summary: Different survey rating scales are bound to yield different results as we are mainly dealing with human perception. We need to be careful in how to use them.

2 minutes to read. By author Michaela Mora on June 20, 2022
Topics: Analysis Techniques, Market Research, Survey Design

Which Rating Scales Should I Use?

Clients often request we review surveys or analyze data collected via surveys they developed themselves. More often than not I find rating scales of different sizes and directions within the same survey. When I ask why I get answers such as “This is the one we have always used.”

Reliability

It seems this question type is often chosen based on preference or habit (e.g. legacy surveys). This is not surprising since there is no consensus on which scales work best. They all yield different results, which is disheartening in a way. Consequently, the search for reliability is prioritized in the use of rating scales.

Reliability refers to the extent to which a scale produces consistent results if repeated measurements are made within the same study or across studies. Systematics errors affecting measurements in constant ways, don’t have to affect reliability. 

We assess reliability by determining the proportion of systematic variation in a scale and determine the correlation between the scores obtained from different administrations of the scale. If the correlation is high, the scale yields consistent results and is therefore considered reliable.

Validity

The validity of a scale refers to the extent to which differences in observed scale scores reflect true differences among the characteristics being measured rather than systematic or random errors. There are different types of validity categories.

Content Validity

This is sometimes called “face validity” and refers to a subjective evaluation of how well the content of the items included in rating questions covers the domain we are studying. It is about the statements used to represent the phenomenon we are studying and helps in a common-sense interpretation of the scale scores.

Criterion Validity

This type of validity evaluates whether the measured items perform as expected in relationship with other meaningful criteria included in the study (e.g., behavioral, attitudinal measures, demographics, etc.).

Construct Validity

In construct validity, we address the question of what characteristic we are actually measuring. This is connected to the underlying theory we use to develop the items measured with the rating questions. In practice, we often don’t know what items should be included to describe the phenomenon we are trying to study (e.g., drivers of user experience, customer satisfaction, brand attitudes, barriers to purchases, etc.), so conducting exploratory qualitative research to develop relevant items and support construct validity is highly recommended. Otherwise, we are just guessing or working from biased assumptions that may miss key aspects of what we are trying to study.

Convergent validity 

This is the extent to which the items used in the rating questions correlate positively with other measures of the same construct, even if they are not measured with rating scales.

Reliability Vs. Validity

Reliability and validity are related in ways that sometimes sound counter-intuitive to those not familiar with measurement scales and statistics.

  • Perfect validity implies perfect reliability: If the question items reflect and measure accurately the characteristics we are studying, we will get valid and reliable measurements every time. If a measure is unreliable, it can’t be valid. 
  • Perfect reliability may or may not imply perfect validity: Consistent and reliable measures may have systematic errors. We can get the same bad measurement every time! Consequently, reliability is necessary but not a sufficient condition for validity.

Scaling Technique Choice

In addition to issues concerning the reliability and validity of the items being included in a question, we need to consider:

  • The level of data (nominal, ordinal, interval, ratio) we need for the analytical approach selected in the study.
  • The characteristics of shown stimuli.
  • Data collection mode (e.g., online,  phone, paper, in-person).
  • Cost of administration.

The most commonly used scaling technique in market research surveys is the itemized rating scale, which is a measurement scale that has numbers and or labels associated with each scale point and ordered in a particular position.

The most popular types are:

  • Likert Scale: Indicate the level of agreement or disagreement with a series of statements about the study subject. They typically use 5 response categories from “strongly disagree” to “strongly agree.” 
  • Semantic Differential Scale: A 7-point rating scale with endpoints associated with bipolar labels that have semantic meaning (e.g., bipolar adjectives like “friendly” and “unfriendly”).

Variations of the Likert and Semantic Differential scales abound. They have been adapted to different topics and extended to different numbers of scale points and labels. The debate on whether to use bipolar or unipolar scales and research trying to find definitive answers continues to this day.

What The Research Says

A lot of research has been dedicated to this subject. Unfortunately, there is no simple answer to the question of which rating scales we should use.

 

Research On Rating Scales
Source: International Journal of Social Research Methodology, Vol. 13, No.1 Feb. 2010, 17-27 (Hartley and Betts)

Advantages of Rating Questions

Rating questions are a familiar question format to internal stakeholders, researchers, and participants.

  • They are easy to implement.
  • They are used in many types of statistical analyses and results are also easy to report (e.g., means, Top2Box frequencies).

Disadvantages of Rating Questions

  • Rating categories (scale points) are subject to different interpretations by respondents depending on cultural and educational backgrounds, language use, and individual experiences.
  • Scale points are not equidistant in respondents’ minds, making interpretation of frequencies and means difficult. 
  • Low discriminatory power in comparative analyses as respondents don’t have to make trade-offs.
  • Susceptible to acquiescence (e.g. driven by socially desirable answers) and satisfying behaviors (select answers requiring the least effort that meet minimal requirements, often seen in straight-lining behavior – selecting the same scale point across all items).

How to Avoid or Handle Rating Scales

This extensive body of research shows that different rating scales are bound to yield different results as we are mainly dealing with human perception. They mean different things to different people and the values, words, and order in which we present them have an impact on how they are interpreted. What to do?

  • Whenever possible, favor question formats other than rating scales. For example, MaxDiff has been shown to discriminate better in preference and important measurements.
  • If you still have to use rating scales, strive for consistency and use them with full knowledge of the bias they introduce in the data, particularly if you want to analyze data from different rating scales and data from different surveys. This is particularly relevant in tracking studies. A change in rating scale from one wave to another may show artificial significant differences mainly due to the measurement error introduced by the change in scale.
  • Above all, triangulate the results with other data sources to understand how different scale points correlate with actual behavior and ask why the person gives a particular rating. If possible use a text analytics tool to get at the heart of what the scale really means for a respondent. The example below says it all.
Product Rating