The value is in the data, making it more accessible so we can learn from it… this is not just important for contemporary data sets but for historical data sets that haven’t even been digitised yet – they’re stored somewhere on tape or in an archive and few people know they exist
Tell us about your background and journey to your current role and who inspires you?
I did an undergraduate in Languages, Interpreting and Translation at Heriot-Watt University, so I come from a Humanities background and am qualified in interpreting and translating between different languages.
I then became interested in machine translation, which was a fascinating time, this was at the end of the 90s. Many professional translators were sceptical towards it; I suppose machine translation quality was really bad at that point. It was a time when the internet and search engines were being used more and more.
In particular, it was the way that the European Parliament used machine translation that caught my attention and got me curious. I then became interested in Natural Language Processing and did the MSc in Speech and Language Processing at Edinburgh. My PhD thesis was on identifying Anglicisms in French and German automatically using computers – I completed my PhD at the School of Informatics at the University of Edinburgh and the rest is history! I’ve been working at the School of Informatics for over 10 years, where I became a Turing Fellow affiliated with the the Alan Turing Institute. My research focuses on text mining, which is about extracting information from text collections and linking it in order to find patterns in the data, such as people’s names, place and organisation names or names of particular diseases and drugs. Last year I became Chancellor’s Fellow at the Edinburgh Futures Institute and the School of Literatures, Languages and Cultures and I’m now leading the Language Technology Group at the University of Edinburgh.
At work, the people who inspire me most are my colleagues, senior colleagues with a lot of experience whom I’ve worked with for years on different text mining projects and junior staff who are very enthusiastic and engaged in the problems we are trying to tackle and approach things in new and interesting ways. But also domain experts whose data I’m analysing, and who bring the use case for the projects I’m involved in. They have interesting problems and come to me with data and then I try to analyse it!
Can you tell us what you are most passionate about in your work?
I think it has to be using technology to assist in a task – text mining is my area of expertise and the focus of my research. I’m passionate about applying this technology to data sets. I get excited seeing people’s eyes light up when my research methods can help to pull out something interesting or something they had not anticipated. They will often say ‘if you can do that, can you also do this?!’ and that’s where it usually becomes interesting because there are not always tools available to immediately automate a task and we then have to adapt existing methods or develop new ones.
That is what inspires me and gets me excited, being able to create technology to enable things that we cannot easily do manually – it doesn’t always work one hundred percent accurately, of course. Technology isn’t perfect, but the aim is to get closer to the truth and to something better than before. I guess that’s what progress is all about.
What are the biggest challenges in your field?
Processing historical archives with modern technology is particularly challenging. For example, I work with medical historians to study historical records on the Third Plague Pandemic at the end of the 19th and start of the 20th century. One big challenge is the quality of the data. I work with historical text in old books and reports that have been digitised using scanning and OCR [Optical Character Recognition]. This process creates errors in the text, caused by the recognition technology and also by the quality of the original documents, their old fonts and layout. We have big questions to answer: how do we mine these texts more accurately? How do we make them more accessible to researchers who need them for their work, so we can understand the past better? Also modern text mining tools tend to be developed for contemporary text so we need to adapt our methods to deal with historical text containing variation in language and names over time. This is not a solved problem.
I also work on mining electronic health records – radiology reports for brain imaging. This involves records particularly on different kinds of strokes and tumours. To be honest, one of the big challenges here is getting access to large data sets from the NHS. It is very important that the privacy of individual patients is safeguarded and that ethical implications are considered. We need to apply for ethics approval and obtain permission to use the data for research; but even when granted the processes involved in collecting dataset from different NHS boards into a data safe haven environment, which is used for safe research on healthcare data, anonymising it and applying our technology are very lengthy. Much more needs to be done to speed these processes up because the potential benefits of data analytics, and in particular raw text analytics, in healthcare is huge.
Also, bias in data is a big challenge in my field. Machine learning is driven by the data it is trained on. Data sets have gender, racial and other forms of bias within them. This is why we need to be super careful, critical and analytical about what machine learning output is telling us.
These are all very topical issues at the moment. Expanding on that, can you tell us more about the specific data sets and analysis techniques you work with and, in relation to bias, what challenges remain around gender and data?
In terms of text mining, you can use different types of methods, machine learning or rule-based algorithms. Rule-based methods require you to write rules that help, for example, to extract the information from a collection of texts. This is laborious but can work well if significant effort is put into designing the rule set. This approach is often used in industry even in products which you’d expect use much more sophisticated artificial intelligence.
Other methods I work with include traditional machine learning, where an annotated corpus or data set is used to train a model. But annotation is expensive and time-consuming to create, especially if it’s very domain-specific or specialist, because not anyone can annotate text with particular types of stroke, for example. You need experts to do this. At the moment deep learning is resulting in state-of-the-art performance for many tasks in my research field. Deep learning is a particular type of machine learning which tries to mimic the neurological processes of the brain to go from an input to an output sequence. It also requires a lot of training data to perform well.
There is lots to say about the challenges remaining in terms of gender and data. One interesting area is bias in how we read, analyse and classify historical archives. I have been involved in research on analysing historical newspapers and textual archives. They can contain a lot of bias, often they were written by men for men. For example, in the reports about the Third Plague Pandemic, the words ‘woman’, ‘women’ or ‘she’ are mentioned very little and if, then often described as being old, married or pregnant. The words ‘man’, ‘men’ or ‘he’ on the other hand are mentioned almost five times more frequently and in the context of adjectives like medical, young and sick. This is not really surprising given their context and the time period these texts were written in but we need to be aware of such biases as it suggests that a large percentage of the population was ignored in the reporting back then.
In terms of Natural Language Processing, there’s also the data that is made available by big companies like Google. This data, which is sought after because it is so large, is created by scraping the contents of webpages and it contains a lot of bias. It is important that we make sure that data used for training machine learning models is as bias-free as possible so that we do not skew their results to a particular sub-part of the population. Otherwise we need to be at least aware of the issues when using biased datasets for machine learning. A lot more education and training is needed in this area.
Do you find your work supportive of women?
I think, to some extent yes. There are quite a few women working in Natural Language Processing, and I think this is because this field attracts people coming from linguistics. There are some strong role models working in my area which made a big difference to me. In computer science there is now more and more support for women and a lot of initiatives to support to female students and research staff coming through the pipeline. There is lots still to be done though. You know, the old, white, male professor ‘culture’ still remains and we need to fight against that, plus the core computer science subjects are studied by a lot fewer women than men. We need more women role models, starting much younger. At schools the division between boys and girls really bothers me. I have two young sons and one is in primary school. I was really surprised at how early boys and girls stop playing together there. The gendered divisions at such a young age were a real eye-opener to me. We need to be teaching gender equality from when our kids are small. And this is not just about creating more support for girls but also about how we raise boys.
Yes, there are different points in the ‘pipeline’ that need to be targeted and addressed in order to solve gender inequality. So, what would you recommend to women and girls who would like to be you and go in to your line of work?
Basically, don’t be scared or shy, but go for it. If you are good at maths, sciences or computing and enjoy those subjects at school, then figure out ways to study them at university or find jobs which involve them. Contact people, even people much more senior, to ask for their suggestions and get their feedback on your ideas and plans. I wish that girls and women have more confidence to follow their passion and dreams and not be forced into stereotypical roles and career paths if they themselves would prefer to do something else. It sounds cliché but I’d say, go for it and follow your dreams!
What has been the best opportunity you’ve had in your career thus far?
I was hired as a Chancellor’s Fellow at the University of Edinburgh and it has been the best opportunity in terms of what the future holds for me. I’ve been a researcher for over 10 years, which has been wonderful and has provided me with a lot of research experience, but this new post has given me new responsibilities and a new perspective on what my work can involve. I’m Chancellor’s Fellow at the Edinburgh Futures Institute which is a new institute at the University of Edinburgh that spans across many different disciplines. I work very closely with literary scholars at the School of Literatures, Languages and Cultures and medical historians at the School of Social and Political Science. I love this cross-interdisciplinary nature of research. So I’m very proud of being hired in my current position and I’m super excited about the coming years in this post.
What do you look forward to?
I’m excited about working with new scholars and on new data sets in the future. As I’ve mentioned, I’m also collaborating with neurologists and mental health research scientists. I am really interested in these areas. Analysing healthcare records is a huge part of the future as well as solving the access issues and making sure we really maximise on the health benefits hidden in these data sets.
Do you have a vision for what data and technological innovation will look like?
I’m not sure about a vision but I can say that there is a lot of value in data, making data sets more accessible for research, not just in academia but also, where possible at least, to the public, because then you can discover and learn from it. That is not just important for contemporary data sets but also for historical ones that haven’t even been digitised yet – they’re somewhere on a tape or in an archive and few people know they exist. Making these more accessible is my goal.
Do you have a hero or heroine?
My data analytics heroine is Florence Nightingale (1820-1910); she’s one of the first female data scientists. I’m really inspired with what she achieved. She came from a well-off family but did not want to be subordinate; she wanted to work, use her knowledge and make a difference in the world. She really understood the value of data. She studied data on war mortality, the reasons and location of death and disease during war and found the main reason to be sanitation. Her well-known circular histogram visualisations are super interesting. Using these visualisations, she managed to get officials to take note of reports about sanitary conditions that might not have even been read otherwise. Her research made a real difference as it resulted in a lot of changes in medical practice and nursing back then, and they really decreased death rates in hospitals.
Do you have a fun fact about yourself?
When I was 10 years old, I joined a local rowing team and trained regularly on the river Elbe – I grew up in East Germany and this was before the Berlin Wall came down. I was a good rower and won a regional gold medal. My trainers suggested I consider joining the national East German rowing team (laughs). My parents were really against this as a career prospect (laughs). Then the wall came down. I became a teenager back then and started to pursue a lot of other interests. My life would have been very different if the wall hadn’t fallen and I had become an East German rower. Can you imagine?
It is funny how our lives could have taken different paths, isn’t it? So what do you do when the working day is over?
I love yoga and photography. If there is time, I take portrait and family photos for friends and family. A few years ago I have created a landscape photography calendar for the area in Edinburgh where I live to support local charities. I also love pottering around in my garden and going for walks with my family.