Steve Haley: Good afternoon or good morning or good evening wherever you are it’s early afternoon here in Boston. It’s really quite a pleasure to be part of this initial web cast with our research and training center. Today I want to spend a little time giving you an introduction regarding the traditions, the possible advances in technology and the bugaboo of all clinicians, the amount of time it takes to collect data with respect to our work here at the research and training center.
Slide 2: Now the background info for this particular talk can be seen in 2 excellent presentations that are part of the sos meeting. Dr. Lisa Iezzoni presented a keynote on setting the stage: outcomes research in post acute care and Dr. John Ware presented a very stimulating talk on improving the utility of outcome measures. So I would suggest that if you have not seen those 2 videotapes, please do so there’s also a text dl as well at the website that’s on your screen. Now let me summarize for those of you who have not seen those presentations.
Slide 3: Dr. Iezonni suggested that the current system of post acute care is really made up of a number of different and very independent silos or outcomes of care. However, we need a better info structure to examine the quality across post acute care settings and that is not currently in place as we speak.
Slide 4: Dr. John Ware spoke on improving the utility of outcomes measure. Dr. Ware presented his work on using more modern psychometric techniques, particularly item response theory, and providing the basis for making instruments more useful to clinicians and patients. He also suggested that we need to link items together for multiple instruments so that we can begin to develop, as he calls them, item pools. These are large item banks of items from existing instruments as well as new items that will help standardize our assessment of key outcomes in rehab. And of course those of you who know, Dr. Ware suggested that computer adaptive testing or I will call it CAT is a trend for the future in health status and functional assessment. CAT allows a test to be tailored to an individual so that rather than using all items of a thick form a computer can help decide which items are most relevant for that individual based on previous responses of that individual. A CAT asks questions much in the same manner as that of an experienced clinician who avoids asking questions that are either irrelevant or redundant. In CAT programs there’s no need to administer all items to each individual thus saving a
lot of clinical time. This can provide a more practical means of collecting data. For those of you who would like to know more about CAT programs I do advise you to look at Dr. Ware’s presentation.
Slide 5: What I want to do today is to talk briefly about 3 themes in rehab outcomes that we have tried to address in our research and training center: tradition, technology and time. There is a tradition in our outcomes that we use a variety of instruments and many of these are setting or disease group specific with very little standardization or ability to compare these together. Our technology has been quite limited in collecting data. There are new advances both in psychometrics as well as computer interfaces that we believe are going to change this and of course time is a practical barrier in that it is an impetus to make changes in the way in which we collect rehab outcomes data.
Slide6: Again about tradition, many of our measures are setting specific; they’re not directly comparable examples are the functional independence measure (FIM), which is used in inpatient care and the OASIS, which is used in home care. There are many similar types of items but they’re different enough so that scores can’t be easily compared across these instruments. There’s also a wide range of disease-specific instruments that
are used and again, although constructs are quite similar there are many opportunities to try to look at data across these different instruments, it’s difficult to compare them.
Slide 7: The technology that we currently use for the most part in most settings is an old technology. It’s really paper/pencil or clinician-based report forms. They’re what we
call fixed-form instruments. Now I have one here fifteen plus items, sometimes it’s many
more, but they’re called fixed forms because every patient, every person gets the same number of items and the very same items. This can create a lack of comprehensiveness in content that is measured because we try to make these as short as possible and so therefore we compromise on the comprehensiveness and breadth. There may be content domains that are the greatest importance to the persons we serve that are missing. We present items often that are not relevant to individuals and many of these instruments are limited to one setting or one patient group.
Slide 8: And time, practicality. Clinicians and patients have limited time for assessment, long instruments are quite impractical now in the health care arena and had been discarded and very short forms have been adopted even tough there are severe limitations that are clearly recognized with these.
Slide9: So we have proposed and begun work on what we call a new generation of outcomes measures and rehab. Now we are building a family of these and I’m going to
spend most of my time today, because it is limited, on one of them called the activity measure for post acute care we’ll call it the AM-PAC for short. The AM-PAC uses a
strategy to develop what we call item pools, large sets of items that we can move along the continuum or spread along the continuum and many of these item pools can be 100 items or even more. Once we develop an item pool for a particular functional domain, then we can develop short forms, which are fixed forms which are intermediate to perhaps the CAT, for those sites that can’t use a computer in their point of care service, short forms develop from an item pool can be preferable and of course CAT is developed directly from the item pool.
Slide 10: Now to give you some examples of our work the Activity Measures for Post Acute Care has 3 domains: physical and mobility, personal care and instrumental and applied cognition. The physical and mobility items include gross motor movements such as bending, moving and also mobility items such as walking, running and use of a wheelchair. Because we’re not really limited by the number of items we can really have a broad range of content that we can put into these item pools. The personal care and instrumental domain includes items that focus on arm and hand activities: meal preparation, dressing and grooming and using instruments around the house. Applied cognition involves items such as remembering a list of errands, money management, use
of a telephone, explaining things to others and following instructions. So we believe that these 3 domains are important activity domains for assessment of rehab services.
Slide 11: Again by way of example, I have here on the screen a simulated ruler where there’s a series of items going from fairly easy to hard, moving from bottom to top. Now
the particular item pool for physical movement activity is currently at 145. Now no one person would ever be administered those items either on the short form or the CAT model, but these items are available for use depending upon the ability level of that particular person.
Slide 12: Again, as I mentioned, we have developed some short forms and I provide an example here and these again are for sites that are not quite ready to do CAT, these are an alternative. We have a hospital-short form and a community-short form so we try to tailor the items more to the settings rather than to the individuals. So that the 10 items that are part of the hospital short forms are items that are most likely to be relevant to most persons who are in hospital or the nursing facilities. On the other hand, the 10 items on the community short forms are more likely to be useful for assessment for patients who have returned home. Now what’s important here is that these items are linked
together so if a person goes from hospital to community and there are assessments of these items, these scores would be linked together on the same metric so we don’t have to switch instruments as we go across settings.
Slide 13: A CAT, that’s something even better. Instead of using a strict form for
everybody, it tailors items in a relatively narrow range of functioning in order to establish a precise estimate of where that person is along the continuum. In this particular example the questions that are asked an individual are all within or near the range at which the person is actually functioning. The computer, again based on the responses of the person, can begin to center in on exactly where that individual is in terms of functioning. This has become a much more efficient and a much more precise way of estimating a person along this continuum of items.
Slide 14: Now our experience w/CAT today are as follows: we have found that no more than 5-15 so called tailored items are needed in order to estimate a score and in no case, more than 15% of an item pooled. Let’s say an item pool is a 100 items, 15 items max is needed per individual, even for people who are varied in terms of their ability levels, no more than 15% is needed per individual in order to estimate their ability. We have found that CAT scores are very precise almost as precise as if the person took all the items in the item pools and CAT administrations can be completed at least in our case an average of 2 minutes or less for individuals. This is a tremendous time savings within the clinical environment.
Slide 15: This is a simple scatter plot of our AM-PAC physical functioning score and on the X-axis we have the score that was derived from the CAT and, in this case, it was only the best 4-6 items for each individual administered and on the Y-axis is the total score. Now the correlation here is .95 and of course the correlation of 1.0 would be the best you can do. This indicates to us that the CAT scores even on 4-6 items can be quite accurate.
We have examined these correlations using 10 items or even 15 and we find them to approximate almost one.
Slide 16: So let me summarize the 3 T’s for rehab outcomes. The tradition has been to use setting and disease specific instrument. These can now be synthesized into common item pools. We can begin to use more modern psychometric techniques and computer interface that will allow us to work more clearly tailor instruments to individuals and of course the big bonus and argue is we can have less patient and clinician burden with very little loss in accuracy and precision.
Slide 17: Now there are many more steps to take in this work. We certainly need to do more feasibility testing in different clinical settings, different kinds of computer interfaces: Web based- vs. laptop based- vs. PDA based-assessments. We’re currently
doing some validation work looking at minimal important clinical differences with CAT and more validation work needs to be done in terms of predictions and other uses of outcomes measures. And we encourage you to get in touch with us as we are constantly looking for additional clinical partners to use these systems and to provide us feedback.
Slide 18: So what does this mean for the future? Well we believe it means more efficient outcomes systems, certainly a broader range of content can be provided to any one individual with more relevant outcome demands for that person. We see this as opening up the ability to track patient outcomes across settings because we can have enough content that will spread all the way from early recovery in a facility to community reintegration and return to normal life that is not possible with current fixed forms. And we see expanded use of outcomes measure in the field because they’re more feasible and
more easy to use.
Slide 19: Now we have a number of publications that are on our website. I’m gonna just briefly mention a few of these for you. We have recently developed a couple of conceptual paper on a special supplement for medical care that came out in Jan. of 2004 and those are available. And there’s also another article that’s going to be coming out in a few months in Archives of Physical Medicine and Rehabilitation.
Slide 20: We have also written some articles on how one would approach measuring rehab outcomes across settings and those are published, there’s one that’s coming up in Topics in Stroke Rehabilitation that’s currently in press.
Slide 21: And there are 2 articles coming up, we think within a few months in the Archives of Physical Medicine and Rehabilitation on the development of our short form activity measures as well as our CAT measure and comparing it to the short forms. So I encourage you to look for those articles as they come out.
Slide 22: And we will also have them available on our CRE website. We also have a couple of emails one for the CRE and one for the research and training center. If you can’t find something on our website we’d be happy to guide you towards it.
Slide 23: Before I finish my formal remarks, I would like to express my appreciation to
our local Boston clinical network. The kind of work we’re doing is just simply not
possible without collaborators who are in clinical sites who are very committed to
research and are very committed to improving the way in which we measure outcomes.
These include: The Boston Medical Center, Spaulding Rehabilitation Hospital Network,
The Jewish Memorial Hospital and Rehabilitation Center, Healthsouth New England
Rehabilitation as well as Healthsouth Braintree, St. Joseph’s Hospital in Nashua, New England Baptist Hospital and Northeast Rehabilitation Health Network.
Slide 24: Again we express our appreciation to our funding source for the research and
training center. The NIDRR, the National Institute on Disability and Rehabilitation
Research. And we have also received supplementary funding from the National Center
of Medical Rehabilitation Research as part of NIH.
Slide 25: So I thank you for your attention and I would like to remind you that you can
send email questions to firstname.lastname@example.org. Thank you.
Alan Jette: Thank you Steve. It will take a few minutes to begin to compile questions that
you may have of our speaker, Steve Haley. I would welcome you to write them now and
send them to email@example.com and then what we will do is we will have our colleague
Mary Slavin read the question and then Steve will answer them. I have a couple to get
started with, but before I ask the first question I do want to acknowledge Dr. Mary Slavin,
who will be asking the questions. Mary has organized and planned and designed all the
web casts that, today’s and subsequent web casts. She is the Director of our training and
dissemination core in our Rehabilitation Research and Training Center, and I want to
thank you, Mary, for your outstanding organizational job.
Let me start with a question that comes up a lot when we talk to clinical groups about not only CAT instruments, but outcome measures in general. What happens, Steve,
in many clinical settings when you have patients who are not either cognitively
competent or able, for other reasons, to respond to these kinds of questions? Does that
mean that they get deleted from these types of assessments? What can be done?
Steve Haley: Thank you for that question. The issue of proxy, using either clinicians or
family, is a real important issue. And it’s an important issue whether we’re using fixed
forms or whether we’re using computer-adaptive testing. We have begun to do some
work to compare how people respond to questions, both clinicians, family members, and
patients, to get a better understanding of the possible differences, or errors, as you might
call them, among these three different respondents. There’s been a lot of literature in
many areas, gerontology and others, suggesting that there can be some systematic biases
across the different respondents. I don’t think the CAT or the work we’re doing is going
to make a huge difference with respect to how these errors are analyzed or to what extent
they actually exist. Some of our work has suggested that the general summary scores are
pretty close. Pat Andres has published an article in the American Journal of Physical
Medicine and Rehabilitation, suggesting that summary scores of patients and clinicians
are correlated about .8, in some cases .9. So there are some errors, but the CAT system is
not going to be any different than a fixed form and we’re going to have to understand
these differences across respondents and either ultimately adjust for them or tolerate them within some of our models.
Alan Jette: The second question that I have, Steve, is, if someone out there listening and viewing this web cast wants to get access to one of our CATs to begin using the CAT, how does he or she go about doing that?
Steve Haley: Well, the CAT systems are really in prototype form currently, so they’re not quite available yet for use. We would have some interest in speaking with individuals who would like to begin working with us and testing them and doing some validity studies. But I think we’re probably a year or two down the road from them being actually available and on the market. We will probably, before the CATs are ready, have available our short forms. There’s a publication coming out, as I mentioned, in just a month or so in the Archives of Physical Medicine and Rehabilitation. Once those are peer reviewed and out, we do expect there to be some interest in the short forms, and please contact myself, Mary Slavin, or our e-mail that I have given in order to find out how you can access those.
Alan Jette: Ok, we’re ready. We’re going to switch it over to Mary now.
Mary Slavin: Steve, a question from the audience. Sue Morrow from Blithedale Children’s Hospital wants to know: Can this be used with the pediatric population?”
Steve Haley: Well, Sue, yes, it can be, and although our research and training center has not focused specifically on pediatrics, we have had a number of other studies to examine a pediatric CAT. We have developed, actually two, one based on the PEDI, we call it the PEDI-CAT, it has undergone some testing and we are, hopefully, about ready to embark into a further study of it. And we have also developed a CAT-like product for Pompeii disease. Let me just mention why this is so important in pediatrics. With children, many sites want to examine outcomes from infancy all the way to 14, sometimes 18, and sometimes even 21 years of age. So the breadth of content needed for that is tremendous. No fixed form could ever approach a precise estimate of scoring with that kind of content breadth. So the CATs are really very, very useful because no matter the chronological age, however the child or family member responds to questions will identify the area in which that person or that child is functioning. So we think that the CATs can be really very, very helpful in pediatric settings. It’s an ideal place to do work in this area because it’s an area in which there is no one assessment that one can use. You start with a 0-3 assessment, then you have to go to a preschool assessment, then you often have to go to performance tests for a child. It’s very difficult to compare data across patient groups, particularly when you have a particular condition which has a variety of problems. The reason that we have moved in this direction for rare diseases and lysosomal storage diseases like Pompeii disease is there can be different onsets. The infant onset is very severe; the later childhood onset is much less severe. So we have children who have difficulty with head control that we’re trying to measure, versus children who are skiing—tremendous gap, tremendous breadth of content that is needed. One can do that
by developing item pools and being able to tailor the instrument to that child. So we think there is a large role for CATs to be developed in pediatric care.
Mary: We have a question from Sean Eaton at Ann Arbor Rehab and he wants to know: “What is the form of the response for the various domains? What are the rating scales?
Can you give us specific details?”
Steve: OK, well, the details will come out in some of the publications, but I can briefly give you some examples. Again, one of the advantages of some of these new
psychometric models is that you can have multiple response sets within these item pools. So for some items, for instance, we are looking at difficulty and we have a four-point scale that seems to work quite well. In other circumstances, we are looking at the amount of help a person needs, and again, we have a four- or five-point scale, depending upon the item. Now those two can be merged into a single continuum. Now you have to test and make sure that there’s good item fit and it makes sense to put those together. But in our
experience it has. So certain items early on in the recovery are very important to look at help versus difficulty, they can be combined, and they can make a very rich and meaningful item pool. So there are a lot of options with respect to response sets, and you don’t have to have the same one for all the items. You use the one that’s most relevant for that particular item. We have focused, I would say, more on difficulty, once a person gets into the community because those kinds of questions we ask, how difficult is to, for instance, walk a mile, it doesn’t make a lot of sense to ask assistance about an item such as that. On the other hand, in earlier recovery, being able to get in and out of bed, assistance at that level might be a very important question, so these models are robust enough in order to handle different types of response categories.
Mary: Steve, we have another question from Sachiko Komagata, at Temple University. And the question is: “The three domains described, physical mobility, personal care, and
applied cognition; seem mostly closely related with function. Do you have any plans to measure social and disability level of outcomes?”
Steve: They are watching the Web cast, aren’t they? Well, as I mentioned, the AM-PAC,
the activity measure for post-acute care is one of a family of measures. We also have a companion measure, called the PM-PAC, the Participation Measure for Post-Acute Care, which does exactly what the questioner suggests, and that gets at what we call participation and others may call disability. The work has been spearheaded by Dr. John Ware and Barbara Gandick at the Health Assessment Lab, and let me give you a rundown of some of the domains that are part of that tool. They have role functioning, limitations in work or regular daily activities, community, civil, social activities, information exchange, participation in home life, economic life, social relations. It’s very much related to many of the domains that are part of the International classification of Functioning Health and Disability. In their work, although it’s in the preliminary stage of analysis right now, there are two components. Maybe more, but I think two main ones that they are going to be able to center in on. One is community participation, which is limited in kinds of activities, clearly they’re outside of the home, and then more participation at home, which includes managing the home environment, reading, having
visitors in home, information exchange, either by e-mail or telephone, etc. So it appears
there are two major factors and there are many subfactors within participation. Now those
are being developed currently and are also going to be put into both, well they have
already been put into a short form format, but they will also be put into a CAT format in
the future as well. So stay tuned, that’s coming up.
Mary: We have a question from Nancy Baker at the University of Pittsburgh, and the
question is, “How would these different scores be used in research to compare and
contrast treatment effects? If each patient is doing a different set of questions, how are
these combined to look at change related to treatment?”
Steve: Nancy, hello, nice to hear from you. These scores are really based on the same
metric. And I know that at first it seems weird that you can take different items and
compare them, but you can. The reason is because we know the relationship of all these
items, and they are all on the same so-called ruler or continuum. And so these computer-
adaptive programs will estimate where a person is along this continuum with even one
item. And of course, the more items there are, the more accurate and the more it’s going
to approximate the full item pool. And this type of technology is very well worked out.
It’s been used in education for many years. People who recently have taken the graduate record exam will know they no longer have booklets to fill out, they go to a computer.
And the computer quickly identifies your ability level in mathematics or reading or
whatever and will only give you items that are most relevant to the ability level that’s being estimated. You all get the same score, people are all compared on the same metric
and that’s going to be the same way it’s done in health care as well. So items are used
that are most relevant for that person to score an individual again on the same metric. So
it is possible, it’s been done for many years, we are just now waking up in healthcare and
applying it to the outcome measures that we use.
Mary: We have a question from Gary Bedell at Tufts University. The question is: “Have
you examined the test retest reliability for the CAT?”
Steve: That is part of the validation work that continues to be needed. We have done
some preliminary work on our short forms in looking at test re-test, but we have not fully
implemented the CAT into a large scale assessment or validation so that’s yet to be done.
We do expect however, again that as people may use slightly different patterns even
within a short interval of time and still get a fairly good estimate of performance or a
similar estimate of the performance and that work needs to done and Gary you can help
us if you want.
Mary: Another question from Gary. “Have you examined responsiveness related to
Steve: We have done it in a pediatric sample. And it’s been done by way of simulation, which means that we take a data set that we already have a fixed form a full data set that
we have multiple assessments and we develop what we call simulations, we develop CAT
scores based on the best items for each individual. And in the the PEDI work that we
have done so far, we have found a very, very small drop in responsiveness from a CAT to a full item set. We use these relative ratios and if the standard is one for instance for using the full item set, the responsiveness drops down to a fraction .95 in some cases .98. We see this in discriminate validity as well. So at least in simulated work based on full item sets to date we found a very small drop in the ability to look at change over time. Now we need to do that work and we’re going to be doing some of that work coming up in an RO1 that has been funded and we’re in year 2 of that and year 3 we do have a responsiveness test that will be prospective using the AM- and PM-PAC, so those results will be forth coming.
Mary: We have another question this is from University of Pennsylvania: “Dr. Haley, some outcomes are not skills that are development based, for example standing before walking, how would CAT apply to measuring participation in life activities such as engaging in preferred activities or socially engaging with desired companions?”
Steve: well, the participation areas are probably going to be much more difficult to model in some of these hierarchies and it remains to be seen how well that is done. There are models though, the psychometric models are pretty robust and although we have relied primarily on a one-parameter model to date there are 2- and 3- parameter models that also can be brought forward to better look at this data and develop better hierarchies that work within CAT programs. So I think there will be a lot of work with the multiple parameter models in trying to understand and estimate CAT scores I think particularly on participation. Now let me say this, that there are some people that are so scattered in terms of their item responses along these continuum that you can’t really estimate their scores easily and some—not very many people but sometimes this occurs and we can set
these CAT programs to go maybe 15-20 items and then simply stop and say we don’t
have any idea where this person’s going to score, give them the full item set. There may be individuals like that and it becomes very difficult to estimate their performance. Most of our work so far that has not been the case, but we do anticipate that happening and we can set these computer adaptive testing models to stop when convergence doesn’t come after either a certain number of items or a certain amount of time. So the participation area will be harder but we’re confident we can develop a good CAT program in this are.
Mary: We have another question from Shawn Eaton at Ann Arbor Rehab. “To what extent is the CAT instrument useful in setting accreditation standards for outcomes measurements or program evaluations?”
Steve: Well, it hasn’t been used yet in those arenas but I anticipate it would have as good a validity or better in utility I mean in program evaluation or in accreditation. Of course accreditation is trickier because one has to—depending upon what system you’re in—
have to have a data se that’s able to be benchmarked and I think we’re not very close to that at all. We believe in program evaluation applications that CATs will be very important because people are much more likely to use these more routinely and with less patient or clinician burden. So I think that will be one of the first applications as well as research and then perhaps down the line there will be development of systems that will also need accreditation requirements as well.
Mary: We have another question from Chandresh Mehta at the VA System. The question is: “They’re very curious about what the CAT is going to look like and what kind of format it will take so that they can figure out how they might be able to use it.”
Steve: Well it won’t be like the picture of the CAT on the screen earlier. The CAT will have multiple interfaces. It can be web-based and there are CATs on the web right now that will show you exactly what it looks like, it could be on a computer screen or in some cases it could be on a palm pilot type of mechanism. Most of the CATs work this way. You start out with a general question that helps identify where a person is on a particular domain. Some people start at the midpoint, other people start different places. Once a response is given there is an initial estimate of ability based on that one item. Now what the user would see is not the calculations of that—that’s going on in the background—
they would see then the next question because the CAT algorithm will search for the next best question in order to converge on an estimate. Now that question then will come up on the screen whether it’s the Internet or whether it’s the computer screen and then the person is asked and answers that question and then in the background there will be computations and then a new question will be selected and then a response will be made. We can have touch sensor screens there could be input from a keyboard there will be a lot of options and so it will look like questions coming up on the screen based on previous responses. Then at some point in time depending upon how the CAT was set, the program will stop there will be an estimate of a score that will either be shown on the screen, it can be graphed, you can print it out. In cases in pediatrics we have begun to explore models of graphing it on a growth curve type of format like a head circumference format. There’s a lot of different ways in which the user can interact with the system. Data can be then sent to a central server or it could be put into a database and then used for comparison in the future. Or there could be a menu of concepts that would then become available so that if you have measured physical functioning you could move to social functioning or applied cognition or some other aspect of your assessment but I think there could be a lot of choices. The other thing that I’ll mention about CAT that I think is quite unique and very important in rehabilitation is that we can filter items out for particular persons. For instance if an individual answers a screening question that they use a wheelchair and they do not ambulate then none of the ambulation items need to come up for that individual—they’re filtered out, taken away. On the other hand if a
person does not use a wheelchair and clearly is ambulatory then the wheelchair questions can be filtered out. For the community functioning there are many grooming and ADL tasks that we have found to useful for the item pool but only relevant for certain genders; for instance so that shaving the face is an important item for grooming for men but obviously it’s not likely seen to come up for women. So as long as they identify the gender of certain items, they can be filtered out. This can be done for conditions, it can be condition specific, it can be setting specific, there are certain types of items that are much more relevant to give in a hospital setting than vs. a community setting. So some initial screening questions can be asked on the CAT and then could filter out questions that are irrelevant as an initial start and that will make I think assessment much more meaningful to certain individuals.