Assignment 2: Comparing Stylo with TF-IDF for a Science Fiction Corpus S26
FINAL Overview
Assignment 2 invites you to compare the classification of the Stylo package in R and of a TF-IDF (term frequency inverse document frequency) approach with a corpus of texts by science fiction writers from Project Gutenberg. This assignment builds on concepts and tools we’ve discussed in class—particularly exploratory data analysis (EDA) using the two methods. It asks you to compare the two approaches and to synthesize your findings in a web-facing written essay with supporting visualizations.
- Format: Individual or pairs (maximum 2 people)
- Length: Approximately 1500-2000 words (about an 8- to 10-minute read), plus visuals
- Due Date: Tuesday, 28 April 2026, 11:59pm
The Corpus
A corpus of 18 texts by science fiction writers has been curated for you. These texts were all taken from Project Gutenberg.
| Author | Title | Publication Year | PG ID |
|---|---|---|---|
| Leigh Brackett | The Blue Behemoth | 1943 | 62349 |
| Leigh Brackett | Enchantress of Venus | 1949 | 64043 |
| Leigh Brackett | Black Amazon of Mars | 1951 | 32664 |
| Philip K. Dick | Second Variety | 1953 | 32032 |
| Philip K. Dick | The Defenders | 1953 | 28767 |
| Philip K. Dick | The Variable Man | 1953 | 32154 |
| Robert Bloch & Henry Kuttner | The Black Kiss | 1937 | 76435 |
| Henry Kuttner | The Ego Machine | 1952 | 32108 |
| Henry Kuttner | Thunder in the Void | 1942 | 68253 |
| Andre Norton | Plagueship | 1956 | 16921 |
| Andre Norton | Star Hunter | 1961 | 19090 |
| Andre Norton | Voodoo Planet | 1959 | 18846 |
| H.G. Wells | The Island of Doctor Moreau | 1896 | 159 |
| H.G. Wells | The Salvaging of Civilization | 1921 | 33889 |
| H.G. Wells | The War of the Worlds | 1898 | 36 |
| Marion Zimmer Bradley | Falcons of Narabedla | 1957 | 50566 |
| Marion Zimmer Bradley | Jackie Sees a Star | 1954 | 74144 |
| Marion Zimmer Bradley | The Door Through Space | 1961 | 19726 |
Bonus: If you would like to try something a little extra, keep these 18 texts, but use your favorite LLM to generate a few short novellas (2000 words minimum) in the style of one of the authors on the list. Add these to the corpus and include them in your assets folder in Github for me to see. Include the prompt you used and the conversation with the LLM in the post of your assignment.
Five Core Elements
This assignment has five core components:
-
Corpus Research: You will be provided with a corpus and will be asked to work only with those texts. No corpus choice is required. That corpus will be in a folder in our class drive marked “Assignment 2 corpus.” Nonetheless, you will need to do some research into the authors and the texts included. Three convenient places to do research are Project Gutenberg (summaries), Wikipedia and the Internet Speculative Fiction Database. The latter is specifically helpful for the crowd-created tags it contains for individual works of fiction. As in Assignment 1, research into the texts and contexts will be essential for your discussion. We took some collective notes about the authors in class. They can be found in Drive.
-
Exploratory Analysis: You will do exploratory research using the two methods comparing the way that the two methods treat the same texts. You will be asked to use a notebook in posit.cloud in order to carry out the research with Stylo. The results of the TF-IDF will be pre-computed for you across a variety of parameters (100, 300, 500, 2000, 3000 most frequent words).
-
Written Synthesis: Assemble your evidence, analysis, and visuals in a web-published essay in the form of a post that tells a coherent story about your findings. Make sure that your visuals contribute to the telling of your story.
-
Integration of Course Materials: Return to the article by Underwood, “The Dangers of Distant Reading.” Refer to it in your assignment where appropriate.
Referencing other readings or resources (podcasts, articles) from this course in your essay is optional. You may also draw on external sources as appropriate.
Instructions for the RMarkdown notebook (posit.cloud) and TF IDF visualization
RMarkdown
In our class posit.cloud space you will find a notebook entitled “Stylometry with Science Fiction Authors from PG”. In it, there are steps to create some unsupervised clustering of the 18+ texts of the corpus using the stylo package in R. NB: If you are using the LLM-generated texts in addition, you will need to upload those texts into the corpus folder. If you add more than one at a time, you will need to compress them into a zip file.
You should try the Cluster Analysis (CA) across different numbers of MFWs, as well as bootstrap consensus tree (BCT). Include the visual that helps you explain your findings the best. Pay attention also to the wordlist.txt file that is created.
TF-IDF
I have pre-computed the TF-IDF for you and these are found in Drive. There are five folders in Drive in the Assignment 2 folder corresponding to the 100, 300, 500, 2000, 3000 MFWs. In case you want to run that code yourself (advanced step) the corpus is only included in the folder for 100 MFW.
When you download the folders, click on either light.html or dark.html to view the visualization in the browser. You can use the interactive buttons to look at different information from the data. You do not need to rerun the code. You should have everything you need in the interactive visuals of the analyses at 100, 300, 500, 2000, 3000 MFWs.
Guiding Questions
As you write up your findings, consider (but don’t feel obligated to answer) all of these questions:
Background & Expectations of the Corpus:
- What did you know about science fiction / speculative fiction before working with this corpus?
- Are all of the texts in this corpus the same length
- Do any of the writers write in very different ways
- Did you close-read any of the texts included in the corpus?
- What kind of thematic difference exists in the corpus?
Methodological Expectations:
- How can you compare the different methods Stylo for hierarchical cluster and TF-IDF PCA clustering for what they purport to analyze?
- What do you think about the idea of function words or most distinctive words being the basis for distant reading?
- What kinds of words are likely driving each method’s results?
Computational Insights:
- What did the computational analysis of the corpus teach you about science fiction / speculative fiction that you did not already know?
- How does the wordlist generated by Stylo compare to the loadings that you can visualize in the TF-IDF PCA visualization?
- Would you expect the loadings to be the same across different numbers of MFWs?
- Were there words specific to the loadings or the wordlist from stylo?
- Was the gender of the writers linked to any stable kind of clustering? What about theme of the book?
- Identify one case where the two methods disagree, even if slightly. What might explains this difference?
Comparative Insights:
- How do the clusters that you get in hierarchical analysis compare to the proximity of the points you get in the TF-IDF PCA visualization?
- What is similar about them? What is different?
- Typically one would say that Stylo is good for detecting authorship and similarity in style of writing and TF-IDF is good for detecting content similarity? Did you find this to be true?
- If any of the texts were co-authored, did this make a difference for the clustering?
Trends & Surprises:
- Were there any surprises that you found when working with both methods?
- Did your analysis want to make you want to read any of these novels more deeply?
- If you took on the bonus part of the assignment to generate examples in the style of “Author X” what happened to the clustering?
- What did the LLM understand by style?
Thought Questions:
- Are style and content truly separable?
- By understanding science fiction computationally, do you think this will lead to AI-generated new forms of scifi?
- What did you think about these novels mostly from the 1950s?
- How do you think they spoke to a post-WW2, Cold War American readership?
- What kinds of new literacies are required when we work with code to “read like a computer”?
Assessment
Your work will be assessed according to the following criteria located here:
Marking your assignment as done
It is fine to publish your assignment iteratively, but when you finish the final version of your assignment, write at the bottom of it “READY FOR GRADING”.
Good luck with your analysis!