Can synthetic data help address bias in models? (my PoV : Yes)
– real data is hard to acquire (and costly), anonymize (which itself can introduce distortions), is almost always biased (lets list a few studies that can prove/claim to have unbiased data for globe scale?)
– sythentic data does not have above challenges.
– one can control the parameters under which synthetic data gets generated to ensure it is near-realistic (a new word), can be scaled, can be reasoned (very important) to correct any issues including any unintended bias (yes it may still happen)
– IMHO, Using real world data to train these synthetic data generators may not be sufficient, as it only complicates the ‘reason’ ability to a second level (esp. when the ‘reason’ questions are on bias).
– Caveat: It is a hard problem to design a near-realistic sythetic data generator. Consider generating near-realistic EHR data including IoMT data.
– Question is : is it hard enough to necessitate a ‘real-world-data-trained’ generator?
This post was later published on LinkedIn here.