The Devil Is In The Data

The Devil Is In The Data

Yesterday I wrote about tackling AI security. An AI system’s core is in its data. That’s why I would like to say ‘The Devil Is In the Data’. As much as details are important on how the program behaves in Software Testing, it is as important as to how the data is in ensuring AI security. Let’s look at this in a little bit detail.

One potential cause of a non-secure AI system is its unmonitored data. Observing the data constantly to see whether it helps reach the objectives of the model is the key. For example, if you would like to differentiate between a dog and a cat using images. An image that positively identifies a dog without any ambiguity is very valuable for the data set. The various features of dog need to be sufficiently identified by the images such that there is no confusion to a model to distinguish a dog and the cat uniquely. More challenging would be the case between a dog and an wolf!

As AI systems are used to identify humans, the facial and the body features are important so that errors don’t happen. It would be terrible to target the wrong set of people, so to speak.

This is why, validating of the data upfront is so important before we build the model – be it numbers, text, images, or anything for that matter. It is an important responsibility of a testing professional to do that, and it should be done as part of the validation process.

If we are getting the data from The Internet, or from a source to which anyone can contribute (open source?!), we have the danger of someone uploading data that may poison the data that we are looking at, and if we are not watching, that will adversely affect the ability of our model. The upload need not have been intentional; it could have been a random incident. But as far as we are concerned, we need to watchful that our objectives are met. To me, it looks like cleaning loads of data that’s obtained from an external source is exhaustive and stressful. Taking only the data that we want and building the model based on that seems to be a more pragmatic approach.

I have to mention here about the noise in the model. Noise from things that are unimportant for the objective of the model could side-track the results. It is important to cancel them out, so that only the features that are important for the results are considered. Likewise, any attempt from an adversary to add noise to the data should be prevented, and for this constant monitoring and threat modeling for sources of data should be done. Because – ‘The Devil Is In The Data’.

It is a very interesting topic for me, although I am not a very big fan of the background mathematics involved in all these! Having said that, a dog has to do what a dog has to, and there’s no looking away from it!

For interesting discussions on AI security and data, please feel free to reach out to me.

Leave a Comment

Your email address will not be published. Required fields are marked *