One Challenge, Not Two Problems:   Regular Expressions for Researching   a Single-Author Corpus Project by Robert W. Williams, Ph.D., Political Science, Formerly Bennett College, Japanese Association for Digital Humanities, JADH 2021 Virtual Conference.
2021. • DRAFT Version Only
Abstract In this presentation I discuss how Digital Humanities tools and techniques​—​here, a concordancer employing regular expression (regex) searching protocols​—​can be used to analyze a collection of W.E.B. Du Bois's works (230 texts). This corpus is not compre­hensive and not representative of his 2000 published writings. My particular research goal focuses on exploring the corpus for the ways by which Du Bois expresses the idea of the "unknowable", whether by specific word or related synonyms and phrases. I sketch the research work flow which involves ways to locate words in the corpus via regex-oriented concordancing, and then requires reading the co-text closely in order to disambiguate the same word or phrase into its different conceptual meanings. I also outline several findings that point to two forms of the "unknowable" in Du Bois's thinking: the lack of direct knowledge of others' experiences and the impossibility of ever knowing some things. ◈
Acknowledgments My personal thanks to . . . • Japanese Association for Digital Humanities • Historiographical Institute, The University of Tokyo • Two anonymous reviewers • Dr. Yuta Hashimoto

The Project

My project aligns with an interpretive approach to the study of ideas: I seek to comprehend, as much as possible, how humans express their understandings of the world in the meaningful words they write. a)  For many years I have been researching the ideas of W.E.B. Du Bois (1868-1963), an African American civil rights activist and scholar. b)  In particular, I have been studying in general Du Bois's philosophy of social science, and in particular, Du Bois's idea of the “unknowable” and its significance for research and activism.
In the world of Du Bois scholarship his idea of “unknowable” is understudied and limited to a few texts (Balfour 2011; Bromell 2011 and 2013; Gooding-Williams 2009; C.E. Mitchell 1997; Monteiro 2008). Scholarly research has not traced the concept across his texts. a)  As I have argued in my scholarly work (e.g., R.W. Williams 2016 & 2018), the unknowable is a significant theme in his own research and activism. b)  For Du Bois, the unknowable was not merely a lack of personal or general knowledge about something. Rather, it involved a profound lack of knowledge because data was (and is) epistemologically inaccessible. c)  This inaccessibility is due, for example, to our inability to know about events because of irretrievable information, and to our inability to know others directly because we cannot experience what they are feeling.
My earlier research techniques involved reading, even re-reading, one individual text at a time for specific words and phrases that might evoke Du Bois's concept of the unknowable. a)  Nonetheless, I wish to find more instances of the unknowable in a collection of Du Bois's texts—a corpus—accessed in one moment as a totality. b)  This research task of reading a corpus at one time requires special tools and techniques.
My current project has two interrelated goals: a)  The first goal: find Du Bois's concept of the “unknowable” as well as its variant expressions amid the myriad words of his writings. b)  From any such matches, the second goal is to discern how Du Bois defines and applies the concept.
I created a 230-document corpus of Du Bois's writings. a)  This single-author corpus contains over 3 million words encompassing many published essays and newspaper articles, 19 books, and a few unpublished writings. b)  This corpus is neither comprehensive nor fully representative of his over 2000 published writings (Aptheker 1973). c)  Moreover, the corpus is not representative of the large number of his unpublished writings, including correspondence and drafts of printed works, as well as complete and incomplete manuscripts. These unpublished materials, housed in university archives, number in the several tens of thousands of items.
My interpretive project requires three vital research components.
First research requirement for the project: a)  I need an efficient and effective way to access a large number of documents within and across a corpus—something which was scarcely able to be undertaken without computers. b)  This is the realm that DH calls distance reading, which involves techniques (e.g., topic modeling, n-grams, word frequencies) that focus on the corpus of texts as an ensemble (Jockers 2013; Rockwell & Sinclair. 2016; Underwood 2019). But I also need to access to individual texts, something which is typically considered part of the province of close reading techniques.
Second research requirement for the project: a)  I need a way to locate the synonymous words by which Du Bois expressed the concept of the “unknowable.” b)  I must be able to access in the corpus the richness of Du Bois's nuanced, aesthetic, and theoretical expressions. c)  At the core of this second research element is the word/concept distinction. A concept is expressed by a word(s), but authors also convey the idea via other words and phrases, including metaphorical devices (Grondelaers et al. 2007; Skinner 2002; also Gunnell 2011; Hampsher-Monk et al. 1998).
Third research requirement for the project: a)  I need a way to examine sufficient textual details in order to disambiguate the uses of words and phrases that any computational technique presents as output. I prefer to disambiguate by reading more closely into the word's co-text​—​the sentences, paragraphs, and even the document itself. b)  This accords with my vocation as a political theorist: to minutely examine the intricacies of Du Bois's multiple expressions of the "unknowable", even the unique ones, the hapax legomena. Digital humanities techniques that reduce the lexicon to summaries or to statistically derived measures​—​as fruitful as they are for other projects​—​do not suffice for mine.
Accordingly, a)  a concordancer (here AntConc: Anthony 2020) is my software tool of choice for researching a corpus and its component documents (Sinclair 1991 & 2003; Stubbs 2015; Tognini-Bonelli 2010). b)  In order to locate the words salient to interpreting the “unknowable” concept, my technique of choice involves regular expressions (regexes), a notational system permitting us to match patterns of characters, such as a word or even words near each other, within larger spans of text. [Sources are listed in the "Regular Expressions" unit within the "References" section below.]

Work Flow

My project involves the reiterative workflow of my interpretive process, which includes these steps:
Step A. Search for the node word “unknowable” or “un-knowable”, and “unknown”:
a)  “Un(-)knowable” is scarce in the corpus (DARK 1920: Ch.X]; LCH 1943; LHA 1956; BF1-OM 1957; BF2-MB 1959; BF3-WC. 1961). b) “Unknown” as unknowable in principle is somewhat more prevalent compared with “un[-]knowable” (e.g., TPN 1899, HAPS 1904).
Step B. Within the co-text presented in the concordance lines and the etexts, by which words did Du Bois define, apply, or adapt “unknowable”? a)  For example, the following co-occurring words offer us potentially fruitful node words with which to search (DARK 1920: Ch.VI; LCH. 1943; BF1-OM 1957): limitations limit science scientific sciences scientifically scientist knowledge reasonable reason reasoning logic logical fiction history historical facts imagination own only b)  How might Du Bois relate any or all of those words to unknowability? Our regex searching will continue in and across other texts via Step C.
Step C. Craft regexes to search for node words derived from the co-text. a)  For example,
b)  We can even search for words in the co-text that are not variants of “know”, but which relate to it, such as “science” or “scientific”, “reason”, “logic”, and so forth. For example:
Step D. When reading through the concordance lines and the full-text for the node words, we also locate multi-word phrases that potentially convey or apply the idea, a)  including “will/can never know” “can never be known” “none will ever know” “These facts are gone forever.” (BF1-OM 1957: “Postscript”) “only the man himself. . .knows his own condition” (DARK 1920: Ch.VI) b)  For multi-word phrases we can use proximity-oriented regexes to locate nearby words within the documents, including a reverse order of the words. For example:
For the purpose of disambiguating the meanings of the same matched word, closer readings of the co-text in the actual documents themselves are required for my interpretive research process.

Discussion of Findings

The four findings pertain to both of the intertwined research goals. a) The first goal: locating Du Bois's concept of the “unknowable” and its variant expressions. b) The second goal: examining Du Bois's definition and application of the concept.
First finding: a)  When Du Bois used the specific term “unknowable” in the few places it did appear it was often related to knowledge in general or to historical research in general (LCH 1943; LHA 1956). [Figure 1 below.] b)  For example, in Du Bois's “Postscript” to his novel The Ordeal of Mansart we find:

The basis of this book is documented and verifiable fact, but the book is not history. On the contrary, I have used fiction to interpret those historical facts which otherwise would not be clear. Beyond this I have in some cases resorted to pure imagination in order to make unknown and unknowable history relate an ordered tale to the reader. [BF1-OM 1957]

Figure 1: Regex matching the variants
c)  Only in a relatively smaller number of cases did Du Bois use “unknown” to refer to unknowability (e.g., TPN 1899: missing population data passim).
Second finding: a)  When Du Bois utilized specific multi-word phrases to express unknowability, such as “will/can never know” or “none will ever know,” Du Bois was focusing on particular instances where specific pieces of information were unrecoverable. b)  For example, in “The Development of a People”:
The world will never know the exact number of slaves transported to America. [TDOP 1904: ¶27]
c)  On the other hand, via disambiguation we find that many cases of “do not know” did not refer to the unknowable in principle. [Figure 2 below.]
Figure 2: Regex matching variants of
do not know
d)  A notable exception is Du Bois's “The Individual and Social Conscience”.

Here in this my neighbor stand things I do not know, experiences I have never felt, depths whose darkness is beyond me, and heights hidden by the clouds; or, perhaps, rather, differences in ways of thinking, and dreaming, and feeling which I guess at rather than know; strange twistings of soul that curve between the grotesque and the awful. [IASC 1905: ¶3]

In that quotation “I do not know” refers to unknowability in principle: I cannot know what you are actually experiencing but I can only know about it​—​only sympathetically​—​because I am a human, also with feelings.
Third finding: a)  When Du Bois was indicating that the unknowable involved no direct knowledge of another's thoughts, experiences, and feelings, then words like “alone”, “only”, and “own” usually occurred in the near vicinity of the lemma “know”. b)  For example, in Darkwater, we read:
But remember the foundation of the argument,—that in the last analysis only the sufferer knows his sufferings and that no state can be strong which excludes from its expressed wisdom the knowledge possessed by mothers, wives, and daughters. [DARK 1920: ¶27 (Ch.VI)]
c)  Reading the concordance lines located some interesting cases. [Figure 3 below.]
Figure 3: Regex matching variants of
only | alone | own
Fourth finding: a)  For Du Bois writing the past tense of “knew” as part of a phrase “knew not” and “knew nothing” did not invoke unknowability in principle. Persons did come to know or could have known in and through other circumstances. b)  One exception was the adverb “never” associated with “knew”. Here “never knew” did tend to implicate unknowability in principle. For example, in Du Bois's autobiography, we read:
I heard the Negro folksong first in Great Barrington, sung by the Hampton Singers. But that was second-hand, sung by youth who never knew slavery. [While in Tennessee matriculating at Fisk University] I now heard the Negro songs by those who made them and in the land of their American birth. It was in the village into which my country school district filtered of Saturdays and Sundays. [A68 1968: p.120]
c)  Reading the concordance lines helped to disambiguate the cases. [Figure 4 below.]
Figure 4: Regex matching variants of
never knew

In Closing

Ultimately, such concordancer-mediated techniques navigate between distance and closer forms of reading. a)  They help us to study how authors articulate their ideas in multifarious ways within the individual texts of a corpus. b)  As regards my research on Du Bois, concordancing has helped me locate more instances of the idea of “unknowable” than I had discovered via my initial or even second readings of the texts previously. Concordancing has proven its worth to me.


A.  Works Written or Edited by W.E.B. Du Bois

A68. 1968. The Autobiography of W.E.B. Du Bois. NY: International Publishers.

BF1-OM. 1957. The Ordeal of Mansart. NY: Mainstream Publishers.

BF2-MB. 1959. Mansart Builds a School. NY: Mainstream Publishers.

BF3-WC. 1961. Worlds of Color. NY: Mainstream Publishers.

DARK. 1920. Darkwater. NY: Harcourt, Brace and Howe.

HAPS. 1904. “Heredity and the Public Schools.” Pp.45-52 in Pamphlets and Leaflets. Herbert Aptheker (Ed.). White Plains, NY: Kraus-Thomson Organization, 1986.

IASC. 1905. “The Individual and Social Conscience” [Originally Untitled]. Pp.53-55 in Religious Education Association, The Aims of Religious Education. The Proceedings of the Third Annual Convention..., 1905. Chicago: Executive Office of the R.E.A.

LCH. 1943. “Letter from W.E.B. Du Bois to American Philosophical Association, December 13, 1943.” W.E.B. Du Bois Papers. Special Collections & University Archives. University of Massachusetts Library. .

LHA. 1956. “Letter to Herbert Aptheker, January 10, 1956.” Pp.394-396 in The Correspondence of W.E.B. Du Bois: Vol. III: Selections, 1944-1963. Herbert Aptheker (Ed.). Amherst: University of Massachusetts Press, 1978.

TDOP. 1904. “The Development of a People.” International Journal of Ethics, 14:3 (April): 292-311.

TPN. 1899. The Philadelphia Negro. Philadelphia: Ginn.

B.  Works Written or Edited by Others

Anthony, Laurence. 2020. AntConc 3.5.9. [Computer Software, 64-bit]. Tokyo, Japan: Waseda University. URL:

Aptheker, Herbert. 1973. Annotated Bibliography of the Published Writings of W.E.B. Du Bois. Millwood, NY: Kraus-Thomson.

Balfour, Lawrie. 2011. Democracy's Reconstruction: Thinking Politically with W.E.B. Du Bois. Oxford: Oxford University Press.

Bromell, Nick. 2011. "W.E.B. Du Bois and the Enlargement of Democratic Theory." Raritan, 30:4 (Spring): 140-161.

Bromell, Nick. 2013. The Time is Always Now: Black Thought and the Transformation of US Democracy. NY: Oxford University Press.

Gooding-Williams, Robert. 2009. In the Shadow of Du Bois: Afro-Modern Political Thought in America. Cambridge, MA: Harvard University Press.

Grondelaers, Stefan, Speelman, Dirk, & Geeraertz, Dirk. 2007. "Lexical Variation and Change". Ch.37 [pp.988-1011] in Dirk Geeraerts & Hubert Cuyckens (Eds.), Oxford Handbook of Cognitive Linguistics. London, UK: Oxford U.P.

Gunnell, John. 2011. "Interpretation and the Autonomy of Concepts". Ch.6 in J.G. Gunnell, Political Theory and Social Science: Cutting Against the Grain. NY: Palgrave Macmillan.

Hampsher-Monk, Iain; Karin Tilmans; & Frank van Vree. 1998. "A Comparative Perspective on Conceptual History​—An Introduction". Pp.1-9 in Iain Hampsher-Monk, Karin Tilmans, & Frank van Vree (Eds.), History of Concepts: Comparative Perspectives. n.l.: Amsterdam University Press.

Jin, Jay. 2017. "Problems of Scale in 'Close' and 'Distant' Reading." Philological Quarterly, 96:1: pp.105-129.

Jockers, Matthew L. 2013. Macroanalysis: Digital Methods & Literary History. Urbana: University of Illinois Press.

Mitchell,Charles E. 1997. Individualism and Its Discontents: Appropriations of Emerson, 1880-1950. Amherst: University of Massachusetts Press.

Monteiro, Anthony. 2008. "W.E.B. Du Bois and the Study of Black Humanity: A Rediscovery." Journal of Black Studies, 38:4, (March): 600-621.

Richter, Melvin. 1995. The History of Political and Social Concepts: A Critical Introduction. NY: Oxford U.P.

Rockwell, Geoffrey & Stéfan Sinclair. 2016. Hermeneutica: Computer-Assisted Interpretation in the Humanities. Cambridge, MA:The MIT Press.

Sinclair, John. 1991. Corpus, Concordances, Collocation. Oxford: Oxford U.P.

Sinclair, John. 2003. Reading Concordances: An Introduction. London: Pearson.

Sinclair, John. 2004. Trust the Text: Language, Corpus and Discourse. London: Routledge.

Skinner, Quentin. 2002. "The idea of a cultural lexicon." Pp.158-174 in Q. Skinner, Visions of Politics, Volume 1: Regarding Method. Cambridge, UK: Cambridge U.P.

Stefanowitsch, Anatol. 2020. Corpus Linguistics: A Guide to Methodology. Berlin: Language Science Press. URL: .

Stubbs, Michael. 2015. "Computer-Assisted Methods of Analyzing Textual and Intertextual Competence". Pp.486-504 [Ch.23] in Deborah Tannen, Heidi E. Hamilton, & Deborah Schiffrin (Eds.), The Handbook of Discourse Analysis, Second Edition. West Sussex, UK: John Wiley & Sons.

Tognini-Bonelli, Elena. 2010. "Theoretical overview of the evolution of corpus linguistics". Pp.14-27 in Anne O’Keeffe & Michael McCarthy (Eds.), The Routledge Handbook of Corpus Linguistics. London: Routledge.

Underwood, Ted. 2019. Distant Horizons: Digital Evidence snd Literary Change. Chicago: University of Chicago Press.

Williams, Robert W. 2016. "W.E.B. Du Bois on Scientific Knowledge and Its Limits." Paper presented at the Symposium Celebrating the 120th Anniversary of the Atlanta Sociological Laboratory and the Work of W.E.B. Du Bois, Clark Atlanta University. 25 February 2016. URL: .

Williams, Robert W. 2018. “A Democracy of Differences: Knowledge and the Unknowable in Du Bois's Theory of Democratic Governance.” Pp.181-203 in Nick Bromell (Ed.), A Political Companion to W.E.B. Du Bois. Lexington: University Press of Kentucky.

C.  Regular Expressions: Regex Resources

Friedl, Jeffrey E.F. 2006. Mastering Regular Expres­sions, 3rd Edition Sebastopol, CA: O'Reilly.

Goyvaerts, Jan & Steven Levithan. 2012. Regular Expres­sions Cookbook, 2nd Edition. Sebastopol, CA: O'Reilly.

© 2021 Dr. Robert W. Williams