Historians now have at their fingertips large datasets that allow us to ask questions that previously wouldn’t have been possible or practical to pursue. We can text mine the 6 billion pages curated in the digital library of the HathiTrust. We can glimpse the lives of some 3.5 million individuals who lived in seventeenth- and eighteenth-century London. Such large datasets are exciting and allow us to ask interesting and novel questions. But what about the things that we miss when dealing with such large amounts of digitised historical data? What compromises are we making in the hope that trends will be discernible from the noise? Or, that errors will ‘average’ out? I want to explore some of these issues through my latest research using I-CeM (Integrated Census Microdata), a digitised version of the British censuses between 1851 and 1911 that contain records on over 180 million individuals.</p>

My work only focuses on a small fraction of this dataset: the 1851 census and English farmers as a particular subgroup of respondents, so that’s who I’ll talk about here. I was excited to use I-CeM because theoretically I could examine all the extra information that the census recorded about farmers, including the size of their farms and the number of labourers they employed. These are important things to know because farm size and how many workers farmers employed (if any besides their own family) are key measures to track historical structural changes in agriculture.

The 1851 census is significant because it’s the earliest date in Britain that we have data on farm sizes and farm employment on a national scale. In the census, farmers were to be returned in the following format:

"Farmer of [317] acres, employing [12] labourers;" the actual number of acres, and of in and out-door labourers, on March 31st, being in all cases inserted.

Census of Great Britain, 1851: Population Tables. I. Numbers of the Inhabitants. Report and Summary Tables.

The census officials gave a few further examples of how different types of entries might be laid out:

Farmer of 110 acres (employing 4 labourers and 1 boy).

Farmer of 41 acres (employing 1 in and 1 out-door labourer, with a boy).

Freeholder, Farmer of 10 acres (employing no labourer).

Census of Great Britain, 1851: Population Tables. I. Numbers of the Inhabitants. Report and Summary Tables.

Even among these examples there is quite a lot of ambiguity, which is magnified in the original census documents by complex real-world arrangements and the idiosyncrasies of householders and census enumerators. I’ll talk in a subsequent blog post about how these affect of understanding of nineteenth-century farming. For now, I want to focus on the types of things that we miss when only looking at the census through the lens of a .CSV file.

Enriching our data on farmers

I’ve been thinking a lot more about the nature of digitised datasets and their relationship to the original historical records while I’ve been enriching the I-CeM data on farmers. I-CeM is largely based on transcriptions produced for genealogical purposes, and so it was understandably sufficient to transcribe just a person’s occupation, leaving the extra details out. So what read ‘Farmer of 220 acres (employing 11 labourers)’ in the original census ends up being just ‘Farmer’ in I-CeM. Fortunately, many transcribers faithfully recorded the original (longer) entries, but large gaps remained. Many of the omissions have been infilled by a project at Cambridge University, which has resulted in BBCE (British Business Census of Entrepreneurs), a new dataset of employers (not just farmers) in Britain, 1851-1911.</p>

But even with the additions made by BBCE, we’re still missing farm size and employment details for 30,000+ farmers from 1851. It’s for this group of farmers that I’m returning to the original census records to find the details on acreage and labour employment they supplied. Over the last few months I’ve checked original census records, otherwise known as CEBs (Census Enumerator Books), for about 15,000 farmers and updated 80% or 12,000 of them with new or revised information. You can see where and how many farmers’ details I’ve updated in the map below:

Two things have struck me while doing this. The first is trivial. There are endless ways to mis-transcribe the word ‘farmer’. Tanners, farriers, and joiners routinely masquerade as farmers in I-CeM. Less common (but appearing more often than you might like to think) are individuals who were formerly employed in occupations entirely unrelated to farming. So you have former laundresses, former nurses, and former chemists all appearing as farmers. Other highlights include a ‘suchman’ (seedsman), ‘ulied farmer’ (retired farmer), and a ‘srod farmder’ (iron founder).

The second issue is much more important and harder to correct and concerns how the census and subsequent digitised versions have handled the entries for wives of farmers. Here I mean any and all female spouses of male farmers and not just a woman labelled as a ‘farmer’s wife’, which was a term specific to the census and was intended to indicate that a wife worked on the farm.

The quality of data we have on farmers’ wives has suffered because the delimited text files that contain the digitised censuses don’t capture nuances in the arrangement of the records on the original census pages. This makes it hard to unpick certain types of errors or ambiguities using the digitised version, which only make much sense if you can see the layout of the original census enumerator book.

The Ditto

The first problem is the different interpretation of ditto marks found throughout the original census documents. Census officials used either the abbreviation (Do.) or the now widely used double inverted commas (“) to indicate an entry was the same as the one above. Some officials used either (Do.) or (“), while others used inverted commas to denote blank spaces.

Why are dittos so important to farmers’ wives?

The census (and society more broadly) conceived households as being governed by a ‘household head’. In a married couple, this was the husband. Only after the husband had died, did the widowed wife become the household head from the perspective of the census. Within a typical farming household, comprising a married couple, children and perhaps one or two servants, all members of the household were identified by their relationship to the household held. This is where the use of ditto is significant.

While married men would be recorded as a ‘Farmer’ with a given number of acres and employees, their wives’ entries usually comprised at least one ditto mark. If they worked on the farm, the census stipulated that a spouse ought to be designated ‘farmer’s wife’ as we can see for Jane Wilkinson from Lancashire in the image below. Happily, we find that I-CeM accurately records Jane Wilkinson as a ‘Farmer’s wife’.

Isaac & Jane Wilkinson from the Township of Mearley, Lancashire (1851); RecIDs in I-CeM 13375046 & 13375047. John & Mary Hanson from the Township of Mearley, Lancashire (1851); RecIDs in I-CeM 13375050 & 13375051.

But if we turn to another farming couple on the same page of the census, something suspicious pops up. Mary Hanson is simply listed as a ‘Farmer’ in I-CeM but looking at the census page we can see that the enumerator has used two ditto marks under her husband John’s entry of ‘Farmer 44 acres’.

Should this be read as simply dittoing 'farmer', or should we count Mary Hanson as a 'farmer's wife' as Jane Wilkinson above? Or is the repetition of the entire entry 'Farmer 44 acres' suggestive of a subtle distinction between how this couple may have managed or perceived the management of their farm compared to Isaac & Jane Wilkinson? Is this akin to a farming partnership that you might find between other, more distant relatives in the census but which is indicated more clearly by the phrase 'joint farmer' or 'partner'? Marriage was a partnership of sorts, and the formal label 'farmer's wife' may have encapsulated the same kind of partnership that I'm suggesting could be inferred when a husband's entire entry was dittoed. Maybe I'm just overthinking things. But the important point is that we have no way of distinguishing these possibly different categories in the digitised versions of the census. Without returning to the original CEBs, we can't begin to categorise accurately the 1000 wives of farmers who are just listed as 'farmers' in I-CeM.

We might also be including women who weren't clearly identified as working on the farm in the census but appear to be in I-CeM. For example, Sarah Sparrow, who lived in Norfolk in 1851, is listed as 'farming' in I-CeM. But, if we look at her entry in the original census, we can see double inverted commas below her husband's entry. Elsewhere on the page, double inverted commas don't appear to mean 'ditto'. They seem to indicate blank or unused spaces. Where information is clearly repeated from the line above, such as where individuals were born, we find 'Do', indicating 'ditto', instead.

Jeremiah & Sarah Sparrow from the parish of Tibenham, Norfolk (1851); RecIDs in I-CeM 6062617 & 6062618.

Data 'overspill'

The second problem, which I'm calling 'overspill', is where male farmers' occupation entries have spilled over into the space reserved for their spouses. We can see in the image above how the enumerator for Jeremiah and Sarah Sparrow avoided this by leaving a line between the married couple's entries. But most didn't. This is why you'll find thousands of wives with occupations like '2 labourers' or 'employing 10 men' in I-CeM.

For example, Elizabeth White from Woodland, Devon is recorded as 'Farming 100 acres' in I-CeM but we can see from the CEB below that these details actually belonged to her husband's entry, which in full reads 'Perpetual Curate of Woodland Farming 100 acres'. Sarah's entry, which has been added in a later hand and is squeezed between her husband's and one of their servant's entries, reads 'Farmers wife'. This also means that her husband, John, is missing these 100 acres from his own entry in I-CeM.

John & Elizabeth White from the parish of Woodland, Devon (1851); RecIDs in I-CeM 6814248 & 6814249.

In this example, the enumerator's large handwriting is probably to blame, as no doubt it was in many other instances. But was this always the case? Were there instances in which the entry of 'Farmer with [x] acres employing [x] labourers' was deliberately written across both husband and wife's occupations to indicate that this farm was managed and worked by both husband and wife? Should we be including these alongside those spouses who were more straightforwardly identified as a 'farmer's wife'? This is of course open to interpretation but digitised transcripts prevent us from examining these issues systematically across the dataset. We could of course identify some of these in I-CeM by looking for spouses like Elizabeth White whose occupations contain partial descriptions of a farm's acreage or workforce. But for all other farmers who have otherwise complete occupation strings in I-CeM, we have no way of knowing whether these were written in their own box, or across both husband and wife's boxes in the original census. What we're usually left with in I-CeM is a full entry for the husband and a blank one for the wife.

Some concluding thoughts

What can the subtle differences in the layout of nineteenth-century British census records tell us about big data more broadly and its place in historical research? I'm certainly not the first person to point out that we need to think critically about the provenance of digital datasets. It's also not simply the case that original historical records are uncomplicated 'authentic' versions that can be returned to fix the 'artificial' digital versions. Generations of historians have shown the importance of being critical of primary sources: thinking about their provenance, context, and purpose.

What might catch us out with the census though is that it looks like it was born to be digitised. It's already tabulated. Surely fewer decisions need to be made about how and what to preserve than prose sources? What could go wrong? I hope I've shown that there is still quite a lot of room for interpreting the layout and contents of a tabulated source like the census. More importantly though, when we use I-CeM we are hundreds of steps removed from the countless individual decisions that were made when the original documents were first transcribed. In the case of I-CeM, the transcripts were made long before it was even conceived of as a project. So this isn't I-CeM's fault per se and I should add that it's a brilliant resource that has opened up areas of research which were impossible to tackle before.