Lesson Learned? The Scientific Measure of Student Learning Outcomes in Higher Education
It has come to my attention that my recent blog post below may be construed as critical of faculty. This was certainly not my intent, and I wish to clarify. Having studied the assessment systems of hundreds of colleges and universities over the past eleven years, I can safely say that faculty are indeed experts at classroom assessment. This is how it is (from my observations) and how it should be (from the institution’s perspective). I believe that no one would disagree. It was within this context of faculty autonomy that my post addressed very briefly the use of Chalk & Wire’s tools to perhaps extend assessment beyond the classroom walls, an alternative approach to the study of learning. I meant no disrespect to faculty, who labor long and hard to do right by students, colleagues, and the rest of us as potential benefactors of that learning.
Thelma Seyferth, Ph.D.
Back in 2010, we predicted that the trend in higher education assessment practices would be toward improving the scientific measurement of learning outcomes assessment data. Here, we look at how U.S. colleges and universities that use Chalk & Wire’s assessment technology have changed what they do, based on a growing understanding of learning outcomes-based assessment. We look at several aspects of valid measurement that have been attended to and a few of those that remain to be addressed, keeping in mind that as the number of schools and programs doing rigorous scientific inquiry of learning outcomes grows, so does the support network for those who are “not quite there yet”.
Ten years ago, few universities were using standards (or student learning outcomes as they are now commonly called) as the basis for evaluating student work. Learning outcomes were then, and still are, a curricular matter embedded in syllabi as “course objectives”. Ingrained in faculty thought as the basis for coursework, spoken-about or not mentioned, learning objectives materialize as task-specific lists of what students should know and be able to do at the end of the course, and are of little concern to anyone outside the classroom.
Students’ ability to navigate course requirements is measured through grades for individual tasks and ultimately the course itself. The development of skills is virtually ignored. Faculty knows what is typical for students in their own courses and comparing each new group of students to typical is like falling off a log. The accuracy of the gradebook is never questioned. No one ever takes the time to ask, “Are you sure about this?” It is within this context of faculty autonomy that we look at the evolution of standards-based assessment and how it continues to trend, albeit at a glacial pace, toward valid scientific measurement practice.
As accrediting agencies began to require that departments and programs base their instruction on professional skills and dispositions, various sets of “professional” standards appeared. Exactly how to measure the knowledge and skills described in these new standards was, and continues to be, purposefully left unsaid in deference to faculty autonomy. “Surely,” everyone thought, “faculty expertise at designing course-related tasks and assigning grades for coursework will transfer easily to the evaluation of skill development in relation to new professional standards.”
Unfortunately, the almost ubiquitous reaction from faculty has been resistance to what they perceive as redundant assessment. Although there more recently have been administrators and faculty who “get it”, by and large, the experience from a support perspective has been very much like trying to help someone cross the street who doesn’t want to go.
Recent Positive Shifts Towards Learning Outcomes Measurement and Assessment for Learning Improvement
In reality, we have made significant inroads into helping faculty and administrators gain an understanding of some aspects of standards-based assessment. Four areas that have seen paradigm shifts toward valid scientific measurement are:
- inter-rater reliability,
- the use of assessor pools,
- irrelevant associations, and
- significance testing for learning growth.
Inter-rater reliability involves the comparison of scores given to a single work sample by multiple assessors. Reliability is expressed as percent agreement, which answers the question, “To what percentage of scoring criteria did multiple assessors assign the same score?” There are sources of measurement error in formal reliability testing, including grading bias and the Hawthorne effect, which can be avoided by using informal reliability tests instead of formal.
Use of Assessor Pools
To use informal reliability testing the assessment process must be set up to send student submissions to more than one assessor, preferably randomly via an assessor pool. In this way the assessment is completed as a regular part of the instructional process and is not an add-on, as in formal reliability testing. This practice reduces the Hawthorne bias that may be an issue if assessors know they are involved in a special reliability assessment. Using an assessor pool eliminates the pressure on faculty to assign a grade for the work, which is a common source of bias when faculty scores the work of their own students.
The irrelevant association of scoring criteria to learning outcomes was an issue a decade ago, as systems were set up with every scoring criterion in multiple rubrics linked to every outcome in an outcome set. The resultant “big bucket” phenomenon made it impossible to distinguish specific skills, as all the scores were reported as being associated with all the outcomes. Thankfully, this issue is currently addressed early in Chalk & Wire’s discovery and pre-implementation training for administrators and has all but been eliminated. Lesson learned.
Scientific Measurement and Assessment of Learning – Progress Over Time, Regression Coefficients and Correlation Matrices
Administrators and faculty appear to be developing a much deeper understanding of the measurement of learning progress-over-time (POT). This is set up as pre- and post-assessments and tests for the significance of any change in one or more scoring distributions. POT reports are most useful when used to examine generic skills across programs (i.e. student writing skills across content areas). The broader the scope, the better. Frequency effect size can be calculated from Chi Square analyses and is often used in accreditation reporting as a measure of instructional impact. Another lesson learned.
There remain other dimensions of valid assessment that are of greater importance when working in a more evolved system. For example, correlation matrices can be used to show internal consistency of scoring criteria, an indication that scores are accurate. Analysis of latent traits using regression coefficients is useful when you have developed your own student learning objectives and need to show that you have successfully identified multiple contextual factors that may influence student assessment data.
We’ve come a long way in our journey toward the rigorous scientific measurement of student learning outcomes. Although we still have some way to go, when we started, we weren’t even sure we would get this far. Lesson learned.