Exploring The Shakespeare and Company Project Datasets


Datasets

1. Hyekyung’s Dataset Documentation

2. SCData for members

3. SCData for books

4. SCData for events


Observation of The Shakespeare and Company Project datasets

Shakespeare and Company is an iconic English-language bookstore in Paris. It gains popularity for its emblematic feature and from many popular cultures such as films and TV series. The bookshop is known to be opened in the mid-twentieth century, but it was originally opened in 1919 and run by Sylvia Beach. Considering that the Lost Generation artists and authors gathered and moved around in interwar Paris, it is not difficult to presume that Shakespear and Company would operate and interact with the modernists as the literary hub; Beach’s meticulous records and documents regarding her lending library operation substantiate it. Beach managed not only the bookshop but also the lending library where the modernist artists and authors and general readers borrowed or purchased the books. Also, she operated a publishing house, where James Joyce’s Ulysses was published by her. For these reasons, exploring the Shakespear and Company project datasets based on Beach’s records would be an interesting and intellectual journey for modernist and Digital Humanities studies and reading culture during the interwar period. I observed all three datasets but mainly dealt with the dataset about the books that circulated within Shakespeare and Company lending library community.

What do you see as the most important or significant aspect of this dataset for scholars, researchers, and/or teachers?
The Shakespeare and Company Project dataset files have plentiful information about who the lending library members were, what books were circulated, and the related events, particularly subscription and borrow. Although three datasets are independently built, the data in each dataset are connected. Moreover, they provide plentiful details about how the lending library operated, members including Anglophone and Francophone modernist artists and authors, and general readers, and their activities through the URIs liked to the project website for further information such as how each individual is connected or interact with other lending library members, books, and events. For instance, if the audience wants to know the details of when James Joyce was a member of the lending library and what and how many books he borrowed in July 1925, the audience can find what he or she searches for by accessing the given URI in James Joyce’s row in the member dataset. Joyce borrowed four books on July 11th, 1925, and the books are about Oscar Wilde’s life and criticisms of his works. Through this, the audience can find out how Joyce’s reading practices impacted his writing practices, literary improvements, and finally, his works of literature.

Another fascinating point from these datasets is that the audience can learn how and where certain books were circulated. The dataset for books shows the basic information of by whom and when William Shakespeare’s Hamlet was borrowed or purchased. Furthermore, by accessing the URI in the row of Hamlet in that dataset, the individual borrower or purchaser’s information and a map indicating the member’s address are shown, from which the audience can track where Hamlet was circulated. If the audience finds out who shared the reading experience of the same book, then they could learn certain reading communities.

In addition, the datasets show what texts, including periodicals and books, were popular during the interwar period, and how the certain author’s work was accepted in the early twentieth century differently from now. According to the book dataset, except for the top three periodicals, Joyce’s A Portrait of the Artist as a Young Man is ranked as the most popular text during that period by event counts. Joyce’s work was borrowed fifty-six times and purchased five times from 1920 to 1941. From this dataset, the audiences are unable to find the reason for the popularity of Joyce’s piece of literature; nevertheless, we can assume and track the literary taste of Joyce’s contemporaries. Also, the book dataset exhibits how many times T. S. Eliot’s The Waste Land was borrowed from the lending library in 1922, its first year of publication: twice. Compared to event count data of Joyce’s A Portrait of the Artist as a Young Man or any other his literary works, Eliot’s masterpiece did not seem to be accepted by his contemporaries. Even though his books about collected poems, including The Waste Land, were circulated more than before, it does not mean that this poem became accommodated by the audiences.

Through these points, the data in three datasets are primarily beneficial to the scholars and researchers who are specialized in modernist studies especially because it can offer myriad of data from or through which the researchers can ask questions and find the answers regarding the Lost Generation members and their literary works in interwar Paris. Moreover, these datasets include information associated with ordinary reading cultures of the general readers, not only the high cultures of modernists. Therefore, they are suitable for scholars or researchers who explore the cultures in the early twentieth century as well as for the general audiences who are interested in Shakespeare and Company.


What do you see as the limits of this dataset? What is not included, and why, and what are the potential consequences of these omissions?
Although the datasets provide abundant information regarding Shakespeare and Company lending library, some of the data are incomplete due to the inconsistent or incorrect details from the archival sources. One of the most conspicuous limits in these datasets can be the books’ published years. Although the researchers of this project recorded the year of the first publication of each text under the year column in the book dataset, there is no distinction between when each text was actually published first and republished in the twentieth century. For example, when I filtered the data of early modern texts from the dataset by the year of publication, I found out that only two Shakespeare’s works, A Midsummer Night’s Dream and Love’s Labour’s Lost, were presented as the Pivot Table 1 shows (even one of the titles is misinformed!) among six; and the four texts are recorded to be published in the early twentieth century. The main reason why the filtered information is limited can be expounded from the incomplete data arranged under the year metadata column. For this reason, the audience who wants to know how the early modern texts were circulated by Shakespeare and Company lending library members has to sort out all early modern authors’ names from the dataset, providing them with incomplete information about what the audience looks for. A similar issue can be found from the data about modern texts. I filtered the data by the years from 1919 to 1939 in order to know how many modernist literary works were circulated, and found out the title of Aeschylus’s work, the ancient author, is on the list. This limit might bring about some obstacles for further research about what lending library texts were circulated without the individual audience’s rigorous and thorough observations.

Another limit from the datasets is the omission of data about author’s gender and texts genre in the book dataset. Most modernist artists and authors are famous even to the public. For that reason, the audiences can distinguish the author’s gender by their first names, so it might be unnecessary to identify the author’s gender in the datasets. However, what about the audiences unfamiliar with the modernist artists and writers and Anglophone and Francophone names? They could acknowledge big names such as Ernest Hemingway, James Joyce, and Virginia Woolf, but what about others who are less popular or only referred to their initials? As Joshua Kotin and Rebecca Sutton Koeser say that this project is not only for academic members but also public (2)1, the data about the author’s gender should have been informed for certain audiences who might rarely be acquainted with modernist studies, who might be from certain regions, and who are interested in female artists or authors during the interwar period. Of course, individuals can access the given URIs to further research, but not all URIs do not provide that information. Moreover, the data without the author’s gender make the audiences do extra work in order to find the related details, which could stimulate the audiences and enable them to dive into further research but could thwart them from having a further interest in the project.

A similar phenomenon can be found from the fact that there is no data regarding the book genres. Compared to the HathiTrust dataset, where there are endless subgenre data, the Shakespeare and Company datasets do not provide any information about text genres. It can be explained that the audiences are expected to be familiar with the genre of each text in the list. Then, what about the text originally written in the Asian language? Confucius’s Ta Hio: the Great Learning translated by Ezra Pound and Xiaoxiao Sheng’s Chin P’ing Mei translated by Bernard Miall would have been or still would be unfamiliar texts.

Lastly, although the datasets embrace meticulous records of event count in both books and events datasets, there is no information about how many copies of certain books or periodicals were purchased from the lending library. With the data about the number of copies, the audiences could partially find any clues as to why Guy de Pourtales paid 150 francs to purchase Joyce’s Ulysses in 1922.


Is it possible to collect or include what has been overlooked? How do the creators of this dataset describe or understand its limitations or boundaries?
Kotin and Koeser mention the difficulties regarding exact and complete books data—only book titles are provided in the original sources. They say, “because of the challenge of identifying the specific editions that circulated in the lending library” (Kotin and Koeser 19) they decided to record dates (years) of the first year of publication. Nonetheless, what they demonstrate does not convince why the data of year metadata should be incomplete, in particular the data about early modern texts or later texts, the first published year of which can be researched easily. The omission of purchase data can be elucidated from the inconsistency of Beach’s recording habit. Kotin and Koeser reveal that Beach occasionally kept records for book purchases and consignments together in one membership card, which conflates two separate information and makes it unable for the researchers to extract purchases data from the archival resources (6)2.

On the contrary to the issues caused by insufficient resources, the author’s gender and text genre can be considered. For the experts of modernist studies, those data would be easy to identify; or too inconsequential to deal with—the researchers do not declare anything regarding the author’s gender and text genre. The omissions I perceived from the datasets present many stories. The omitted data from the datasets reminded me of what Catherine D’Ignazio and Lauren F. Klein mention in their article, “What Gets Counted Counts.” The author’s gender and text genre could be nugatory or unnecessary for the researchers involved in the project because they are expert in modernist studies. Here, we can notice similar examples of the classification system in datasets designed by “the matrix of domination”3, by the dominant scholars who do not need any information about the author’s gender and text genre. On the other hand, the omissions tell us a sad story associated with the limited budget available to humanities studies. According to Kotin and Koeser, when the project team worked on transcribing Beach’s documents, the team “made the decision to partially transcribe and encode the logbooks for practical reasons: there wasn’t sufficient funding or time to support full transcription and encoding” (12). Although Kotin and Koeser do not mention why the data such as author’s gender or genre should be excluded, we can assume all possible explanations from the hidden authority or power dominating the data to the inadequate monetary issue.


After Exploring Shakespeare and Company Project Datasets

Due to its huge amount of data, the Shakespeare and Company Project took six years to complete and publish the first version of the datasets in 2020. Also, the datasets are not complete but still ongoing. The archival resources recorded by Beach are more related to the field of business rather than humanities studies. Nonetheless, through the Shakespeare and Company Project, the data that could be regarded as insignificant in humanities studies can be available and accessible to scholars, researchers, and general audiences. The project’s datasets indicate how slow digitalization enables cornucopian data for modernist studies along with Digital Humanities through “gradually exploring and accumulating knowledge and understanding of a particular manuscript”4, and discovers the unnoticed documents. However, as Andrew Prescott and Lorna Hughes point out, the slow digitalization accompanies the monetary issue, the Shakespeare and Company Project cannot be free from the funding issue. And the project team found its way to resolve the problem by truncating some information.

The Shakespeare and Company Project is open to the general public: they can access all datasets and the project website. They can participate in correcting and developing the datasets as well. Kotin and Koeser say that

Between the first and third versions of the data sets, the Project team added information from the address books—addresses, first names, events. The team also improved and corrected hundreds of member and book records. Much of this work was aided by visitors to the Project website. The credits page acknowledges their assistance (22).

The datasets’ openness and collaboration with the general public suggest the blurred demarcation between academia and non-academic. Also, this collaboration moderates the implications of “the matrix of domination” in the datasets that could be perceived from the omitted information for a moment.

However, for me, who expected to see how female authors and their works contributed to the operation of the lending library, the datasets do not offer the related data without my further research. Nevertheless, the project is ongoing, and there are possibilities that the data that I wanted to know will be added to the datasets.


Works Cited.

D’Ignazio, Catherine and Lauren F. Klein. “Ch 4: What Gets Counted Counts,” Data Feminism, 2020.

Kotin, Joshua and Rebecca Sutton Koeser. “Shakespeare and Company Project Data Set,” Journal of Cultural Analytics, 2022, pp. 1-35.

Prescott, Andrew and Lorna Hughes. “Why Do We Digitize?: The Case for Slow Digitization,” Archive Journal, 2018.

  1. Kotin, Joshua and Rebecca Sutton Koeser. “Shakespeare and Company Project Data Set,” Journal of Cultural Analytics, 2022, pp. 1-35. 

  2. Kotin, Joshua and Rebecca Sutton Koeser. “Shakespeare and Company Project Data Set,” Journal of Cultural Analytics, 2022, pp. 1-35. 

  3. D’Ignazio, Catherine and Lauren F. Klein. “Ch 4: What Gets Counted Counts,” Data Feminism, 2020. 

  4. Prescott, Andrew and Lorna Hughes. “Why Do We Digitize?: The Case for Slow Digitization,” Archive Journal, 2018.