PROMIS®

PROMIS measures are scored on the T-score metric. High scores mean more of the concept being measured.

PROMIS, Neuro-QoL, ASCQ-MeSM, and many of the NIH Toolbox® measures use a T-score metric in which 50 is the mean of a relevant reference population and 10 is the standard deviation (SD) of that population.

On the T-score metric:

• A score of 40 is one SD lower than the mean of the reference population.
• A score of 60 is one SD higher than the mean of the reference population.

For PROMIS measures, higher scores equals more of the concept being measured (e.g., more Fatigue, more Physical Function). Thus a score of 60 is one standard deviation above the average referenced population. This could be a desirable or undesirable outcome, depending upon the concept being measured.

PROMIS scores have a mean of 50 and standard deviation (SD) of 10 in a referent population.

• Scores 0.5 – 1.0 SD worse than the mean = mild symptoms/impairment
• Scores 1.0 – 2.0 SD worse than the mean = moderate symptoms/impairment
• Scores 2.0 SD or more worse than the mean = severe symptoms/impairment

Methods
The cut points or thresholds for PROMIS Global Physical and Mental score categories of excellent, very good, good, fair, and poor were constructed by 1) creating groups based upon responses to Global01 “In general, would you say your health is excellent, very good, good, fair, or poor?”, 2) calculating mean scores for each group, and 3) identifying the midpoint between two adjacent means. For example, the mean Global Mental score for “Excellent” was 61 and the mean score for “Very Good” was 51. The midpoint between these scores is 56. Cut points are:

• Global Mental: 56, 48, 40, 29
• Global Physical: 58, 50, 42, 35

Pediatrics and Parent Proxy
Some pediatric and parent proxy measures use different thresholds.

PROMIS Global scores are interpreted as excellent, very good, good, fair, and poor.

How do PROMIS scores compare to a relevant reference population?

A unique aspect of PROMIS measures is their use of standardized scores that are centered on a relevant reference population. Such scores are called “normative” because their value represents how close or far away they are from a normative population. The word “norm” has different meanings for different contexts. Here, we are not talking about social “norms,” the behaviors we expect from others and ourselves in society. We are also not talking about “normal” per se, even though the term originated from its reference to a standard, bell-shaped distribution curve that labels everything in the vast middle as normal. We use the word norm without applying judgment as to the “normality” of any given score relative to the distribution of scores seen on the same measure in a large group of people. Sometimes we refer to the group as a “reference group” and similarly to “norms” as reference values, because they are points of reference from which to understand a given single score.

The meaning of the score is defined by how it compares to the scores of others in a referent population. For some PROMIS measures there are subpopulation norms.

Norms are based on the distribution of scores on a measure for a well-characterized and relevant population. For example, I am 5’6” tall. I might be interested in comparing my height to other people in the world, or maybe just the United States since that’s where I live. On the other hand, I am a woman and so average height without considering gender doesn’t really matter to me. I would be interested in knowing my height relative to other women in the United States. So I might be interested in comparing my height to the mean height of other persons in the United States. According to Wikipedia, the average height for women in the United States is 5’4” and therefore I’m feeling pretty tall right now. I would not feel quite so tall if I were comparing myself to men in the United States (5’9”); and, I would only be eye-to-eye with the average woman in the Netherlands (5’6”), which brings us, finally to normative score comparisons for health outcomes measures.

Using a Norm to Center a Metric

Let’s start with a distinction between norms that are used for comparative purposes and norms that are used to anchor a scale. Many of the PROMIS measures (see elsewhere on this site) are centered on the mean score of a sample of individuals that, collectively, matched the US 2000 General Census with respect to important demographics (e.g., gender, age, race/ethnicity, education). The beauty of folding the norms into the metric is that it is easy to interpret a score relative to the population whose norm was used. With the PROMIS scores that are centered on the US Census population, the mean is 50 and standard deviation is 10. So, if my fatigue score is 60, I know that I’m not just feeling tired; my fatigue is one standard deviation above the general population (or at least a sample that matched it). But, as of this writing, I’m 62 years old. I’d like to know how my fatigue compares to people my age. That’s where sub-norms can be helpful.

Sub-Norms for Interpreting Scores

Sub-norms divide a relevant population into subgroups to aid interpretation of scores. Above, I compared my height to that of women in the US, not to all people in the world or even to all people in the US. I found it more relevant to compare my height to a sub-norm value—the mean height of women in the US.

It is theoretically possible to develop sub-norms for scores for a measure based on any relevant population; though such data collection can be expensive. As HealthMeasures continued be used and more data accumulates, however, it may become practical to develop many sub-norms for comparing and interpreting scores.

Fortunately, the initial norming sample for PROMIS was quite large and it was feasible to disaggregate by gender and age ranges in order to estimate sub-group norms. This was done in 2011 for comparing the fatigue and pain by age range in the general population to that of samples of individuals with disabilities. The gender and age range norms were calculated for all PROMIS measures developed in the first phase of PROMIS testing. These were never published but are provided here for users who are interested. Means, standard deviations, and frequencies by domain are reported in the tables below.

WARNING

The original PROMIS norming sample was not powered to develop subgroup norms. The user should pay particular attention to the size and characteristics of the sample used to develop each sub-norm. For example, much larger samples were used to calculate sub-norms for males and females than for the age ranges. More confidence is warranted for sub-norms estimated with larger sample sizes. Nevertheless, these sub-norms can be useful both in comparing samples and interpreting scores. For reference, consider how they were used in papers by Cook et al, and Molton et al.

Gender and Age Range Sub-norms for Adult PROMIS Measures Centered on the US General Census 2000

 Domain Gender Age in Years Female Male 18-34 35-44 45-54 55-64 65-74 75+ Anger N 1865 1204 730 565 499 495 401 379 Mean 50.6 49.1 53.0 51.5 50.4 48.8 47.5 45.7 SD 10.2 9.6 10.7 10.3 9.5 9.7 8.7 7.9 Anxiety N 1654 1069 659 496 417 442 365 345 Mean 50.9 48.6 52.4 50.9 50.1 49.3 48.1 46.9 SD 10.2 9.5 10.7 11.1 9.5 9.5 8.8 7.9 Depression N 1269 890 496 366 359 373 290 276 Mean 50.9 48.7 52.3 50.6 50.8 49.5 48.4 46.5 SD 10.1 9.7 10.9 10.9 10.0 9.7 8.8 7.2 Fatigue N 1884 1183 706 551 513 516 396 385 Mean 51.1 48.2 50.5 51.0 51.6 49.7 48.1 48.0 SD 10.1 9.6 9.7 10.7 10.1 10.8 9.3 8.3 Pain Behavior N 1851 1199 699 561 507 507 402 374 Mean 50.7 49.0 47.6 50.0 52.2 51.3 50.1 49.7 SD 10.1 9.7 10.2 10.6 10.1 9.7 9.3 8.7 Pain Interference N 1856 1180 712 548 499 488 406 383 Mean 51.1 48.3 47.8 50.1 51.9 51.6 49.9 49.7 SD 10.3 9.3 9.0 10.2 11.1 10.9 9.3 8.7 Physical Function N 2044 1363 782 605 567 565 457 431 Mean 48.9 51.7 55.1 52.0 49.0 47.5 47.2 45.6 SD 10.0 9.7 8.4 9.8 10.4 10.4 9.0 8.5 Global Mental Health N 3008 2206 1183 863 902 873 715 679 Mean 49.4 50.8 48.5 48.4 48.2 50.3 53.1 53.4 SD 10.0 10.0 9.7 10.4 10.3 10.5 8.8 8.4 Global Physical Health N 3015 2212 1182 865 910 875 713 683 Mean 49.1 51.2 51.6 50.1 48.2 48.8 51.0 49.9 SD 10.1 9.8 8.4 9.8 10.9 11.3 9.9 9.2

Reference Populations

A T-score is a standardized score, like z-scores and IQ scores. All standardized scores have a “middle” score; it is zero for z-scores, 100 for IQ scores, and 50 for T-scores. This middle score is the mean of a large sample that is representative of a relevant population—a reference population. The large sample used to represent the reference population is called the Centering Sample.

For some PROMIS measures the reference population (and the centering sample) was a clinical population. This is the case for PROMIS Smoking measures, for which the reference populations was daily smokers; the centering sample was a sample of daily smokers. For many PROMIS measures, the reference population was the 2000 General US Census. The centering sample was a large sample of individuals who represented the 2000 US General Census.

Centering Sample and Calibration Sample

It is helpful to remember that the middle score of a standard score range has to be defined. For measures that use a T-score metric, 50 is the mean and 10 is the standard deviation, but they do not start out that way. The scores are first estimated using an item response model and the IRT-calibrated scores are transformed to a T-score metric using a linear transformation. But first you have to decide which score on the IRT metric is going to be the middle score—a score of 50. This is done by collecting scores from a large sample that represents the reference population and then calculating the mean for that sample. That sample is the centering sample.  That score becomes the middle score (e.g., 50 for T-scores). A linear transformation spaces all other scores along the continuum so they have the correct values relative to the middle score (mean of the centering sample) used to represent the middle score.

IMPORTANT: The Centering Sample and the Calibration Sample may not be the same sample.

The purpose of a calibration sample is to estimate item parameters (item characteristics such as difficulty and discrimination) using an item response theory model. Here’s where it can get confusing. Sometimes a single sample was used as both the calibration sample AND the centering sample. Other times one sample was used as the calibration sample and another was used as the centering sample.  Sometimes, a subset of the calibration sample served as the centering sample.

The Reference Population tables show the calibration and the centering samples for PROMIS. Most users will be particularly interested in the last column (Centering Sample). If you want to know what a score in the middle is (e.g., 50 for those scored on a T-score metric), go to the Centering Sample column. For example, if you go to the row for PROMIS-Cancer-Anxiety you will see that the item parameters were estimated (calibrated) using a hybrid of individuals with cancer and individuals from the general population. BUT, the centering sample was the general population. A score of 50 on this measure is comparable to the general population average level of anxiety.

What does the middle score mean?

When developing a measure with standard scores, an important consideration is what the middle score means. The scores of such measure are purposefully “centered” at the mean of a specific sample or subsample. PROMIS uses T-score, so the middle score is always 50. Centering scores in this way allows quick interpretation of where an individual is on a symptom or outcome compared to others in the reference population. A score of 50 on PROMIS Fatigue, for example, is comparable to the U.S. “average”. T-scores have a standard deviation of 10, so a score of 60 would indicate fatigue that is a standard deviation higher than the U.S. average.

TIP : Failure to be specific about the reference population invites confusion.

This can all get very confusing because sometimes the calibration sample (the sample used to estimate item response theory parameters) and centering sample (the sample used to define the middle of the score range) were the same. But sometimes they were different. For example, a measure may be calibrated in a clinical sample but then centered in the general population. The mean of T=50 for that measure reflects the average in the general population, not the clinical sample.

 PROMIS Adult Item Bank/Scale Calibration Sample Centering Sample Global Health General population General population Emotional Distress – Anger General population General population Emotional Distress – Anxiety General Population General population PROMIS-Cancer - Anxiety Hybrid* General population Emotional Distress – Depression General population General population PROMIS-Cancer - Depression Hybrid* General population Cognitive Function (v2.0) General Population General population Psychosocial Illness Impact – Positive, Negative Clinical sample (cancer) Clinical sample (cancer) Self-Efficacy – Manage Emotions, Meds/Treatment, Social Interactions, Daily Activities, and Symptoms Clinical sample Clinical sample Alcohol – Alcohol Use, Positive and Negative Consequences, Positive and Negative Expectancies General population subset + Clinical sample General population subset + Clinical sample Smoking – Coping Expectancies, Emotional/Sensory Expectancies, Negative Health Expectancies, Nicotine Dependence, Negative Psychosocial Expectancies, Social Motivations Clinical sample Daily smokers Fatigue General population General population PROMIS-Cancer - Fatigue General population General population Pain – Behavior General population + Clinical sample General population Pain – Interference General population General population PROMIS-Cancer - Pain Interference Hybrid* General population Pain Intensity People with pain drawn from the general population and pain support groups People with pain drawn from the general population and pain support groups Physical Function General population General population PROMIS-Cancer - Physical Function Hybrid* General population – Mobility General population General population – Upper Extremity General population General population Physical Function for Samples with Mobility Aid Users Clinical Sample Clinical Sample Sleep Disturbance, Sleep-Related Impairment General population + Clinical sample General population + Clinical sample Sexual Function and Satisfaction** (v1.0): Global Satisfaction with Sex Life, Interest in Sexual Activity, Lubrication, Vaginal Discomfort, Erectile Function Sexually active general population Sexually active general population Satisfaction with Participation in Discretionary Social Activities (v1.0), Satisfaction with Participation in Discretionary Social Activities (v1.0) General population General population Satisfaction with Participation in Social Roles and Activities (v2.0) General population General population Ability to Participate in Social Roles and Activities General population General population Companionship General population General population Emotional, Informational, and Instrumental Support General population General population Social Isolation General population General population

* Items that did not have differential item functioning (DIF) between general population and cancer patients used the general population parameters. DIF items used cancer-based parameters. All items included in fatigue item bank did not have DIF and thus all used general population parameters.

** Qualitative related components were conducted using clinical sample but data from sexually active general population was used for parameter estimations.

 PROMIS Pediatric Bank/Scale Calibration Sample Centering Sample Global Health General population General population Emotional Distress – Anger General population General population Emotional Distress – Anxiety General population General population Emotional Distress – Depressive Symptoms General population General population Cognitive Function General population General population Life Satisfaction General population General population Meaning and Purpose General population General population Psychological Stress Experiences General population General population Positive Affect General population General population Fatigue General population General population Pain – Behavior Clinical sample (Children with painful conditions) Clinical sample (Children with painful conditions) Pain – Interference General population General population Pain – Quality Clinical sample (Children with painful conditions) Clinical sample (Children with painful conditions) Physical Function - Mobility General population General population Physical Function - Upper Extremity General population General population Physical Activity General population General population Physical Stress Experience General population General population Strength Impact General population General population Asthma Impact Clinical sample Clinical sample Peer Relationships General population General population
 PROMIS Parent Proxy Bank/Scale Calibration Sample Centering Sample Global Health General population General population Emotional Distress – Anger General population General population Emotional Distress – Anxiety General population General population Emotional Distress – Depression General population General population Cognitive Function General population General population Life Satisfaction General population General population Meaning and Purpose General population General population Psychological Stress Experiences General population General population Positive Affect General population General population Fatigue General population General population Pain – Behavior Clinical sample (Children with painful conditions) Clinical sample (Children with painful conditions) Pain – Interference General population General population Pain – Quality Clinical sample (Children with painful conditions) Clinical sample (Children with painful conditions) Physical Function - Mobility General population General population Physical Function - Upper Extremity General population General population Physical Activity General population General population Physical Stress Experience General population General population Strength Impact General population General population Asthma Impact Clinical sample Clinical sample Peer Relationships General population General population