What Data for AI Training? The example of Japan

The question of the data sources for training AI systems is not at all trivial. Depending on the purpose, and the jurisdiction, there can be different answers. Social media platforms have for long time been seen as guilty to exploit user data for ad targeting, without a fair exchange of value, for example.

There is a weakly sourced article “Japan Goes All In: Copyright Doesn’t Apply To AI Training” that is getting pretty wide circulation, about using copyrighted works for training AI models in Japan.

In reality the copyright reform in Japan dates back to 2018 and has since allowed for text and data mining (TDM) for every use from any source, even those protected by copyright. It was drafted to enable the development of new technologies like IoT, but it also makes explicit reference to AI. The reform was also discussed outside of Japan at the time.

Therefore, the reform has since allowed for the use of sources protected by copyright for training AI models, and it is not a new development. However, as in other jurisdictions, it leaves unresolved the question of copyright protection for output generated by AI.

This quick research was done with the help of my friend Jun Miyazaki, a technological innovation expert in Tokyo, using Bing Chat, and by me with ChatGPT through the AskYourPDF plugin.