Coded Datasets

Coded datasets that can be used to develop QE examples


Chat data from the virtual internship Nephrotex. In this virtual internship, students worked as teams to design a biomedical device for dialysis machines. Discourse is in the content column. Important metadata columns include: uniqueuserid (unique ids for each student), chatgroup (the team students were in), roomName (the activity students were completing), site_x (the school to which students attended), OutcomeScore (score for a task toward the end of the internship) and OutcomeBins (high (1) or low (0) outcome designation.


Transcripts from rounds of debates featuring candidates for the democratic party presidential nomination. Discourse is in the Discourse column. Important metadata columns include: DATE (the date on which the debate occurred), GENDER (the gender of the candidate), THIRD.ROUND (whether the candidate made it to the third round of debates), SPEAKER (the name of the candidate).

Game of Thrones

Scripts from the first seven seasons of HBO’s television series Game of Thrones. Discourse is in the Text column. Important metadata columns include: Season (the season of the show), Episode (the episode number), Scene (the scene number), Speaker (the speaker of the line), House ( the family or house the speaker belongs to), Major.Char (whether the speaker is a major character (1) yes, (0), no), and Gender (the gender of the speaker. Note that it could be useful to remove non-major characters from the units when making a model.


Lines from two of Shakespeare’s most famous plays: Hamlet and Romeo and Juliet. Discourse is in the Line column. Important metadata columns include: Play (the name of the play), Act (the act number), Scene (the scene number), and Speaker (the speaker of the line).