The implications of global privacy on data science
The introduction of new privacy legislation – such as GDPR in Europe and the CCPA in California – is (rightfully) giving individuals the ability to dictate what personal data about them is held by companies and how it is used. And while much of the focus has been on companies making sure their marketing complies with the new rules, there are other less obvious, but equally important data-driven activities which are seeing impact from the changes. One key area for retailers and their analytics’ partners is the practice and application of data science to develop customer insights that can inform assortments, pricing and customer strategies. Let’s look at three examples where privacy legislation is having an impact:
- Don’t blame the computer
It’s not enough to just “let the computer decide”. With the new privacy legislation requiring that algorithm-based decisions are fair, the data scientist or of the algorithm needs to take responsibility for ensuring it delivers fair results.
But to be fair and unbiased there are some key considerations when applying data science:
- Just because a model faithfully follows the modelling data doesn’t necessarily mean that it’s fair. The modelling data itself could be biased, e.g. overlying represents a certain gender or ethnicity, and so the model is just reinforcing prejudice. (And if that is the case, it may be necessary to take a step back to understand if there is any unconscious bias in the business processes that are generating the data in the first place).
- Even if a model doesn’t use data like “gender” as an input variable, that doesn’t mean it’s gender neutral. It may produce a “gender proxy” based on other inputs it receives, such as combinations of products bought.
- Complex models are more challenging to assess. From a privacy perspective, it is often better to use a modelling technique which is interpretable, e.g. where the rationale for any given prediction is easy for humans to understand. This gives greater confidence that a model is fair and hasn’t inadvertently introduced bias. An interpretable model also means that explanations can be provided to individuals regarding the reasons why certain decisions were made, (e.g. declined a loan application), which is also an important requirement of GDPR.
- The right to be forgotten
Some legislation has given individuals the right to erase and rectify their data, which extends to any copies of individual-level data that may have been used to build models.
This has the following implications for the data scientist:
- As individuals are removed from modelling data, the model may change slightly. It’s important to re-calibrate, and to check the sensitivity to the input data.
- There can be complications if returning to a model at a later date, e.g. to extend the model to handle more varied scenarios, as the removal of individual data may mean that the original results are no longer fully repeatable. This can’t necessarily be avoided but adjustments will need to be made in order to compare the new approach to what was used in the past.
- Keeping individual data private
As it becomes more important to keep privacy in mind from the outset of any project, there are interesting science approaches that aim to analyse data without ever knowing the data attributable to any individual. For example, differential privacy introduces a random element so that the algorithm can never know any particular data point with certainty, but at the aggregate level the error reduces allowing for accurate high-level statistics. This is especially valuable in cases where consumers are asserting ownership of their own data and wanting services that are both personalised and private.
So, the changes to global privacy laws are clearly influencing the world of data science and where it intersects with retail, but in a positive way. Not only does this introduce a required update to best practice when it comes to developing customer insights from data, but it also forces the data scientist to consider how they utilise data in predictive modelling and to ensure that their approaches are fair.