In this episode we explore five ideas for working with small datasets as part of internal audits or performance audits.
- Extend the timeframe
- Combine with proprietary data
- Combine with open data
- Scenario analysis
- Natural language processing (free text)
Narrator: Welcome to the assurance show. This podcast is for internal auditors and performance auditors. We discuss risk and data focused ideas that are relevant to assurance professionals. Your hosts are Conor McGarrity and Yusuf Moolla.
Yusuf Moolla: Morning Conor.
Conor McGarrity: Hi Yusuf.
Yusuf: We're talking about the power of small data. We somehow consistently still hear the phrase, big data, even though it has no real definition and it's now quite dated, right? I don't really enjoy hearing the term big data anymore. It doesn't actually have any proper meaning. Small datasets can be very useful in audits and we can effectively work with smaller data sets as well.
Conor: So, with all the conversation about big data over the past few years, the importance or relevance of little data has gotten washed away. And it's less valued than possibly it could be.
Yusuf: Yeah. So why this is important to auditors in particular, is that, as you said, there has been a dilution of the level of value that could be obtained from small data. But as auditors, we often only have small data sets to work with a smaller data sets to work with. So as auditors what can we do if we only have smaller datasets? Does that mean that the work that we can do is less valuable? Definitely not.
Does it mean that we need to somehow find bigger data in order to produce valuable results? And I think the answer to that is also no.
So as auditors, if we only have smaller data sets, what can we do with it to provide the value that we can provide, given that we don't have, larger data sets to work with. And even if we did have larger data sets to work with, there's still significant value that we can obtain from using some of the smaller data, particularly where those datasets already are at the level of quality and integrity that make them useful.
Conor: It's important to recognize that it's okay to start with those small data sets and that they are, like you've just described, very powerful.
Yusuf: If you have a bigger data set to work with, and you're comfortable with the quality of the dataset then you can potentially get a good result out of it. If you have a smaller dataset to work with and you're comfortable with the quality of the data, you can potentially get a good result with it. So, it's exactly the same thing. It doesn't really matter. However, bigger data sets do allow for easier analysis. And so, there's five things that we usually do when working with small data sets that we wanted to walk through today. And we'll talk about them each in turn. The different strategies will look different depending on the nature of the data that you have and how the different data sets can actually come together as opposed to one single dataset. We'll work through each of those five ideas in turn.
Conor: Okay, we're talking about five strategies to maximize the value of our smaller data sets. So, what's the primary consideration when we're thinking about how to maximize the use of our smaller data sets.
Yusuf: The first thing is, do you have data that you can work with over a longer timeframe? So if we have data for a week or a month or a year, because the data set is not very big, we are able to use a longer timeframe to get a different perspective. And it's actually easier when you're using a smaller data set to look at a longer timeframe.
So, for example, if you are usually going to be looking at one year or two years’ worth of data, and each of those years had 300 or 400 records. Let's say you had data that provided information on one transaction per day. Because you've got just one transaction per day, you aren't limited by technology or the time that it would take to do the analysis to just that one year or two years. And you can actually look over a longer period. We've successfully looked at smaller data sets over anywhere from 10 years upwards. The challenge that you get anywhere from five years to 10 years, and sometimes even less than that, is that you’re not necessarily looking at processes that are consistent and have been consistent over that period.
The data won't all look the same, and potentially what the data is representing wouldn't look the same. With that it's actually quite useful, anyway, because it allows you to look at what the changes might be through the data. So, you can use the data to see what those changes over time were in terms of data as output, but also what the changes over time were in terms of process.
In some cases, you might find that the process has gone backwards, and it is a little bit worse than it was, earlier on.
It's not necessarily a negative thing. You may find that the process is actually a bit shorter to become more efficient, or it's a bit longer to become more effective.
When you have a smaller dataset, the first thing to think about is, should I be looking at this over a longer time frame to get a different perspective?
Conor: I'm a little bit confused. If our audit is looking at activity that occurred over six months, say that's the time period for our audit. But we've got not a lot or a small quantity of data covering that six-month period. Is it useful or reasonable to go back further than that six-month period even though that's prior to the period in which you're auditing?
Yusuf: Yes. And often you'll want to go back further because you want to be able to compare the period that you are auditing to a previous period.
If you, for example, only looking at a six-month period and that six-month period happens not to cover a year end, either calendar year, end, or financial year end, then you may not be getting the full picture. You often have seasonality, in your process or in your data where you want to go back a little bit further.
But also, if you're looking at, let's say your six-month period happens to be January to June, just to make it, you know, easy to understand, easy to follow. If your period were January to June, what did that look like over the last four years? So, what did January to June as a six-month period look like over the last four years?
And then also what did July to December look like over the last four years? So you can use that comparison to understand where you've come from and what this six months looks like and what it looks like relative to that, to be able to provide a view as to whether things are moving in a positive direction or not.
Conor: Okay. Fantastic. What's the second thing we should think about?
Yusuf: So, the first one was the easy part. Get a bit more data. The second one, is really interesting, and this is probably where most of our time ends up getting spent.
And this is where we augment the data that we have by combining it with other proprietary data sets. So proprietary meaning data sets that aren't available in the open domain or that you have within the organization that you auditing or within your own organization, if you're internal audit. Often if you’re just looking at a particular domain, you may not get the full picture as to what is going on with a particular topic or subject or domain or audit area.
When you start combining your data with other proprietary datasets from adjacent subject matter. For example, the easiest example within internal audit that we always talk about because it's been done so many times is where we're doing a payroll audit and we bring in procurement data. So, augmenting the data that we have with other data that can provide us with a view of the original process, but also a view a bit more broadly across the organization.
That augmentation is really useful. Augmentation could also be where we combine master data with transactional data and then transactional data with audit logs. So how have the transactions changed if anything at all? So that augmentation opens up a whole range of possibilities.
If we were looking at, for example, something like sales data and the sales data was reasonably small. We can combine sales data with marketing data to understand what happened before the sales actually occurred. So, what is leading up to those sales? And then if, if you want to go further down the track, you then combine sales data with support data. So, complaints, data, and the like. And we'll talk about, free text data in a minute.
Combining proprietary data sets does give you a much broader perspective than just the individual data for the individual subject matter that you're looking at.
Conor: And of course, combining those data sets can give you further insights that might inform your future audits, for example, or even throw up some new topics or some new risks that need to be addressed, that have been uncovered.
Yusuf: if you find yourself in a situation where joining that data up, gives you some insight into what you're doing, but then there's other insights that you just don't know what to do with, set them aside potentially.
Conor: So that was augmenting your small data set with other. proprietary data. We're up to number three. What's the third thing we need to think about?
Yusuf: So, the third thing, very similar to number two, but this is where we are augmenting by combining with open data sets. So, this is, data sets that are available. In the performance audit world, this would be datasets that are available to the public. In the internal audit world, this might be datasets that are available to, either the public or to particular industry bodies. So, there's a whole movement that's happening in terms of opening data up within an industry. That open data may comprise data from other organizations that may not be completely open to the public, but it will be open to, organizations within the same industry.
In banking, for example, we have the open data initiative where data will be made available, from various financial services institutions for each of them. So that will be a bit more open. It's not exactly the public domain.
For performance audit, we're talking things like statistical data mapping data, etc. Mapping is quite an interesting one. and when we combining open data with the data that we have, if we have any data on, geography, for example, we can start combining our data with mapping data, to be able to provide all sorts of different perspectives.
Where, for example, we have no coverage or where we have too much coverage or understanding, in certain cases where there may be large distances between two geographies that we cover. There's so many possibilities all depending on the nature of the topic, but open data sets would be idea number three.
Conor: And open data is still an area that's obviously ripe for massive opportunity, particularly in the performance audit world. You touched there on the use of geographic data. To my mind, that's particularly useful, to see how a cohort within a certain geographic boundary is being provided public services.
And I think that's really popular because everybody knows about the local area they live in or the local community they live in. So, you can really, use geographic, or geodata to provide some really powerful messages.
Yusuf: Yeah, that's right. And there'll be a separate episode about using open data for audits. So, we won't talk a lot more about that. Idea number four is a little bit different. So, the previous ones were get a bit more data, get data for a longer timeframe, combine it, etc.
This is where you have a smaller dataset. You can actually - particularly if you’re looking at things like performance indicators or risk indicators. You can explore your data by looking at various scenarios. Now that exploration would potentially result in your data exploding. So, if you have a situation, for example, where you have, let's say you have a thousand records, probably easier, nice round number.
You have a thousand records and within your thousand records, you have five performance indicators. If you assign different, potential weightings to each of your performance indicators, so that you can see the results in different ways. You then in a situation where, your individual weightings, your, what if scenarios would result in your data exploding significantly.
So, what happens in practice is that you say, I have five indicators. For each of the indicators, I'm going to assign the weightings -1, 0, 1 and 2. Four options each. So, you do that. And then each record will have four options each for each of the five indicators that gets you into a situation where the resulting dataset is actually a big dataset.
You have, five to the power, four times a thousand, which is what five fives are 25, 25 25s are 625. So, you'd take your thousand records and you turn it into 625,000 records. It's significant, right? Significant. And I hope I got the math, right? I may not, but if I do have the math wrong, It's at the low end.
Conor: I think you're spot on.
Yusuf: And so, what happens then is that you have a much, much bigger data set to work with. It's quite useful when you're starting with a smaller data set, because you cannot do this with a very big data set. You will have results that are unwieldy. Bbut it allows you to look at different scenarios and work out what different weightings you want to apply to the individual performance indicators before you start bringing them together.
Conor: I've got a wry smile on my face here because, as you've encountered over the years, I've got a problem here with personal discipline, whereby all these what if scenarios create in my mind, what's possible scenarios and then get caught up, lost in the data and trying to play with it too much to answer all these problems at the same time with this ever increasing data set.
Yusuf: We've seen situations where we've had to, pull you back a little bit. And the easiest way to do that is to tell a machine to tell us what the optimal outcome might be.
And so we eventually will replace you with a machine, but until that happens we've got, we've got idea number five. This is about taking semi-structured data and looking at what a small semi-structured dataset becomes when you start exploring the text within it. So semi-structured in the sense that the data contains fields that have free text data. So, an example is complaints data, so when customers call in to complain or customers call in to provide feedback to us. And we take a, a recording of that, or we manually capture the notes of what the customer has said.
When you start exploring that sort of data, you end up in a situation where your data becomes a lot larger. When you’re using natural language processing on text data, when you're looking at phrases and keywords and what they look like across datasets, you end up with many thousands of columns. So, they’re still, shorter datasets, but a lot wider. So, they're not, not narrow anymore.
So text data is another one where, you know, a small data set could be your friend, because a large data set, can be quite unwieldy, but using that smaller dataset and expanding the text data within it to understand what the different topics are by looking at phrase analysis, and keyword analysis is another thing that you can do when you are faced with a small dataset, particularly when that small dataset has free text data within it.
Conor: To summarize, the five things we talked about there, when considering the power of small data that you might have available to you.
The first one being the importance of being able to extend timeframes of the data you have available and looking back and building on the quantity of what you have by going back in time.
The second was how you can augment your data by combining what you have with other internal data sets.
The third thing is as well as augmenting your small data set with other internal data sets or proprietary data sets but looking outside the organization and seeing what other open data sets you might be able to combine.
Fourthly. exploration of data and using scenario-based analysis and how we can use what if questions to actually understand what's possible.
Lastly, how we can use free text and how you can apply techniques such as NLP, if you're heading towards the more sophisticated end of the scale.
Yusuf: Don't hang your hat on big data. It's not always going to be, you know, a thing. If we stop hearing the word big data, we'll be really happy, but really glad over the years to have identified a range of small data sets to work with and the various techniques that we can use to provide significant value on our audits even if we only have small data sets to work with.
Conor: And we probably need to talk offline about the fact that you mentioned your thinking about replacing me with the machine. So probably need to have a conversation about that.
Yusuf: We will do. I'll bring the bot to talk to us as well.
Conor Lovely. Thanks, bye.
Narrator: If you enjoyed this podcast, please share with a friend and rate us in your podcast app. For immediate notification of new episodes, you can subscribe at assuranceshow.com - the link is in the show notes.