As much hype and attention that the machine learning / artificial intelligence field gets, and in spite of some impressive results that have come out of it in the form of e.g. ChatGPT, overall we have very little understanding in exactly how do these models work. We can train them, and we can use them, but we cannot answer any questions about what a particular AI model will or will not do, outside of testing this empirically. This is quite a big problem, especially for safety or alignment reasons, i.e. making sure that we don’t accidentally create an AI model that has both the capability and the inclination to rename itself Skynet and do skynet-y things.
Thankfully we have some of the brightest minds of our generation working on this problem, and Scott Alexander has this excellent article trying to summarize recent findings (from Anthropic’s interpretability team) on the topic: God Help Us, Let’s Try To Understand AI Monosemanticity. It starts for relatively basic concepts and does I think an excellent job at explaining both the problems and the results in a way that remains relatively comprehensible even for someone (such as myself) who has virtually no knowledge of the field. As a bonus, it even goes a bit into discussing how the properties of artificial neural nets relate to those of biological neural nets (i.e. our brains), though there’s very little of credible research results on this.
Go read it! If you need extra motivation, the subtitle of the post is Inside every AI is a bigger AI, trying to get out and that is just too funny to ignore.