Assignment 3 (optional): Exploring Word Vector Models for a Science Fiction Corpus S26
FINAL Overview
Assignment 3 invites you to compare different abstraction representations of our corpus of about 1000 texts by science fiction writers from Project Gutenberg. This assignment builds on concepts and tools we’ve discussed in class—particularly exploratory data analysis (EDA) using other methods. It asks you to compare differently trained models and to synthesize your findings in a web-facing written essay with supporting evidence.
- Format: Individual or pairs (maximum 2 people)
- Length: Approximately 1500-2000 words (about an 8- to 10-minute read), plus visuals
- Due Date: Tuesday, 12 May 2026, 11:59pm
This assignment is optional. You can receive up to five (5) points of extra credit.
The Corpus
A corpus of 1000 texts by science fiction writers has been scraped from Project Gutenberg and the front and end matter have been removed. Those 1000 texts are the same we used for topic models. They are available in Drive. They have not been included in posit.cloud to save space.
Five Core Elements
This assignment has five core components:
-
Corpus Research: It will be next to impossible to do research on the entire corpus since there are so many authors and texts, but you will want to come up with some different “angles” that are of interest to you in science fiction. These will help you to define certain starting words. If you are interested in looking at specifics of clusters or analogies, you can use the free and open source tool AntConc (which allows for searches like Voyant, but working with 1000 files). This will be useful to “trace back” to individual texts that use specific words.
-
Exploratory Analysis: You will do exploratory research using the multiple methods outlined in a notebook in posit.cloud (Word Vectors and SciFi authors) in order to carry out your research with. You do not have to train the word vector models as they have been pretrained. Here are the models contained in posit.cloud:
| Model Name | Dimensions | Window | Best For |
|---|---|---|---|
model_100d_w4 |
100 | 4 | Tight context - focuses on immediate, local word relationships and syntactic patterns |
model_100d_w6 |
100 | 6 | Balanced baseline - good for general semantic exploration with moderate context |
model_100d_w9 |
100 | 9 | Broader context - explores thematic and discourse-level associations with modest dimensionality |
model_200d_w6 |
200 | 6 | Richer representations - captures more nuanced semantics while maintaining focused context |
model_200d_w10 |
200 | 10 | Thematic focus - strong semantic content with emphasis on topic-level and broader conceptual relationships |
model_300d_w8 |
300 | 8 | Maximum richness - captures complex semantic and thematic relationships across broader context |
The notebook proposes a number of different kinds of analysis: clustering, finding words closest to a single term or closest to multiple terms, semantic “subtraction” (finding words closest to the difference between two terms), analogies, as well as other forms of vector “math” (Vector Averaging, Orthogonal Projection, Centroid Comparison). Given that this is an extra credit assignment, you can do as many of these as you would like.
-
Written Synthesis: Assemble your evidence, analysis, and visuals in a web-published essay in the form of a post that tells a coherent story about your findings. Make sure that your visuals contribute to the telling of your story. In this assignment, visuals will likely be lists of words in tables.
-
Integration of Course Materials: Optionally you can return to the article by Underwood, “The Dangers of Distant Reading.” Refer to it in your assignment where appropriate. Referencing other readings or resources (podcasts, articles) from this course in your essay is optional. You may also draw on external sources as appropriate.
Guiding Questions
As you write up your findings, consider (but don’t feel obligated to answer) all of these questions:
Background & Expectations of the Corpus:
- What time periods are represented in the corpus (think about the topic modeling and time notebook), and how might this affect language and themes?
- What do you expect to be the dominant themes in scifi literature (e.g., space, technology, dystopia)?
- Which words do you expect to appear in similar contexts (e.g., “ship”, “planet”, “crew”)?
- What conceptual differences might exist between early vs later sci-fi texts?
- Do you expect differences between human-centered vs machine-centered vocabulary?
- What biases might exist in a corpus drawn from public domain texts?
Computational Insights:
- What do you think it mean for the word vector model to learn meaning from context?
- Pick some clusters of words from some of the models and identify what they are about? What is the relationship to science fiction?
- Which words are closest to a target word? Do these close words match your expectations?
- If you make the same queries across different iterations (iter) how stable are the results?
- What role does negative sampling play in distinguishing words?
- Does the window size affect what counts as context?
- What differences do you see with higher vs lower vector dimensions?
- What kinds of relationships does the model capture most clearly?
- How much of your human mind do you feel you bring to this kind of analysis?
- If we had included the translations of Wells’ novels in the model training, what do you think would have happened?
Trends & Surprises:
- What word relationships surprised you the most?
- Are there strange or unintuitive nearest neighbors?
- Do certain sci-fi terms dominate the semantic space?
- Did you ever find yourself in a “corner” of semantic space where the relationships were not clear? or seemed very specific to some texts?
- When you switched models, did you find the “best for” explanation above in the column to hold true?
Thought Questions:
- What does it mean for a machine to learn meaning without definitions?
- How does word2vec’s “reading” differ from human interpretation?
- What kinds of meaning does the model capture well? least well?
- How does changing parameters change what the computer “sees”?
- What kind of evidence would you call vector relationships?
- Is
word2vecanalysis an interesting method to you? Compare it to others we have looked at.
Assessment
Your work will be assessed according to the following criteria located here:
Marking your assignment as done
It is fine to publish your assignment iteratively, but when you finish the final version of your assignment, write at the bottom of it “READY FOR GRADING”.
Good luck with your analysis!