The original data have been modified in the following way: Original essays were put through an anonymization process that replaced references to persons, places, and locations with tokens. So, if a student wrote that “her name was Jada, attended school at Jackson High School in Canton, OH”, this would be replaced with “her name was @person, attended @place high school in @location”. There were a few essays that were eliminated because the anonymization process was ineffective in scrubbing the text sufficiently to eliminate information that might lead to the identification of an individual writer.
In order to assess the suitability of the anonymized data for evaluating automated essay scoring systems, a small internal study was completed with the LightSIDE engine to determine the degree to which there might be differences. LightSide is an open-source scoring engine developed at Carnegie Mellon University and was included along with the commercial vendors in the first study. In that study the engine demonstrated a high agreement with human ratings, but had no NLP capabilities. The analysis was performed because it was suspected that the anonymized data might be harder to model than the original data since it would contain less specific information. However, the LightSIDE model showed only a slight correlation drop from .763 to .759, which, based on a t-test, was not statistically significant (p = .15), on quadratic weighted kappa across the data sets. While the data anonymization process therefore seems not to have substantially impeded the ability of machine-learning based systems to model human scores on this data set, it may have had the effect of making it more difficult for participants to develop features related to deeper aspects of writing ability. Since content words were replaced with meaningless symbols in the process, the grammatical structures and meaning relationships within each essay were certainly made less accessible, even to human readers.