project umzi

umzi is an isiXhosa word meaning community.


The objective of this project is to determine if it is possible to write a program that will be able to analyse the “completeness” of the references used in an academic dissertation.


When I started doing my maters degree I was made aware of how the published works of others needed to be woven into the text I was writing – rather than just be used as a source of quotes or a list of what I had read. Woven to me meant integrated and relevant and referred to in different places in the text in different ways. We were also told to not rely on only one source but to rather use a range of sources and that it was also very important that we referenced the major work(s) relevant to our subject.

We also knew that our writing would be marked by an external examiner and that they would be looking at how the references have been used throughout the text. And we could therefore pass or fail based on what they found.

So how you reference the community of existing writing is a very important part of writing a dissertation. Not only technically (to prevent plagiarism)  but also structurally (to provide internal cohesion).

this project

I like to give my personal projects names, and I called this project umzi (meaning community in isiXhosa) as I want to see how the community of others is referenced in a given dissertation.

the solution

phase one – identify the references

The first phase was to write a phython progam to take as input the text of a dissertation and to output all the references from the reference section. This was relatively simple as it involved parsing the input text and then finding the section called references (the references section is normally located after the chapters and before the appendix) and then stripping out the year, the author list, and the title of the article. From the author list you can then get the first author. And from the title you can then get the first four letter word of the title. The first author + year + first four letter word gives the reference cite key.

phase two – identify all the in-text references

The next phase was to identify all the references used in the actual text of the dissertation. When looking at sample text (using APA referencing) it was clear that there were at least two types of references (maybe three but I’ll get to that).

The first type took the form of an explanation of the relevant text with the reference at the end in brackets:

However, while it has been reported that older adults are aware of the benefits of physical activity, less than 30% adhere to the national prescribed guidelines (Marquez, Bustamante, Blissmer & Prohaska, 2009).

The second type made reference to the article and looked like this:

Guralnik, Ferrucci, Simonsick, Salive and Wallace (1995) claimed that good nutrition and a physically active lifestyle have known benefits for prolonging functional independence and reducing the risk of disability, institutionalisation and mortality among older adults.

And the third type was a variation of the first type where the writer had used a direct quote and so added the page number. This type of reference looks like this:

“Financial security, social networks and level of education are all factors in successful ageing, and reinforce the need for broad multifactorial modelling” (Marquez et al., 2009, p. 15).

It was clear that each of these scenarios needed to be treated differently.