The original data have been modified in the following way: Original essays were put through an anonymization process that replaced references to persons, places, and locations with tokens. So, if a student wrote that “her name was Jada, attended school at Jackson High School in Canton, OH”, this would be replaced with “her name was @person, attended @place high school in @location”. There were a few essays that were eliminated because the anonymization process was ineffective in scrubbing the text sufficiently to eliminate information that might lead to the identification of an individual writer.
In order to assess the suitability of the anonymized data for evaluating automated essay scoring systems, a small internal study was completed with the LightSIDE engine to determine the degree to which there might be differences. LightSide is an open-source scoring engine developed at Carnegie Mellon University and was included along with the commercial vendors in the first study. In that study the engine demonstrated a high agreement with human ratings, but had no NLP capabilities. The analysis was performed because it was suspected that the anonymized data might be harder to model than the original data since it would contain less specific information. However, the LightSIDE model showed only a slight correlation drop from .763 to .759, which, based on a t-test, was not statistically significant (p = .15), on quadratic weighted kappa across the data sets. While the data anonymization process therefore seems not to have substantially impeded the ability of machine-learning based systems to model human scores on this data set, it may have had the effect of making it more difficult for participants to develop features related to deeper aspects of writing ability. Since content words were replaced with meaningless symbols in the process, the grammatical structures and meaning relationships within each essay were certainly made less accessible, even to human readers.…
Six of the essay sets were transcribed from their original paper-form administration in order to prepare them for processing by automated essay scoring engines, which require the essays to be in ASCII format. This process involved retrieving the scanned copies of essays from the state or a vendor serving the state, randomly selecting a sample of essays for inclusion in the study, and then sending the selected documents out for transcription.
Both the scanning and transcription steps had the potential to introduce errors into the data that would have been minimized had the essays been directly typed into the computer by the student, the normal procedure for automated essay scoring. Essays were scanned on high quality digital scanners, but occasionally student writing was illegible because the original paper document was written with an instrument that was too light to reproduce well, was smudged, or included handwriting that was undecipherable. In such cases, or if the essay could not be scored by human raters (i.e., essay was off-topic or inappropriate as determined by human raters), the essay was eliminated from the analyses. Transcribers were instructed to be as faithful to the written document as possible keeping in mind the extended computer capabilities had they been employed. For example, more than a few students used a print style in which all letters were capitalized. To address this challenge, we instructed the transcribers to capitalize according to conventional practice. This modification may have corrected errors that would have otherwise been made, but limited the over-identification of capitalization errors that might have been made otherwise by the automated essay scoring engines.
The first transcription company serviced four prompts from three states and included 11,496 essays. In order to assess the potential impact of transcription errors, a random sample of 588 essays was re-transcribed and compared on the basis of error rates for punctuation, capitalization, misspellings, and skipped data. Accuracy was calculated on the basis of the number of characters and the number of words with an average rate of 98.12%. The second transcription company was evaluated using similar metrics. From a pool of 6006 essays, a random sample of 300 essays was selected for re-transcription. Accuracy for this set of essays was calculated to be 99.82%.
Two of the essays were provided in ASCII format by their respective states. The 10th grade students in those states had typed their responses directly into the computer using web-based software that emulated a basic word processor. While the test employed digital technology, the conditions for testing were similar to those in states where the essays had been transcribed.…