Great point!
Clearly, we have a long way to go. I think ChatGPT and other similar systems that are coming need a sandbox in which to execute and verify proposed results or code works before delivering it. I can only imagine the resources needed for that. And like you suggested, diagnostics need to evaluate the specific, the sum, and various combinations of known symptoms to synthesize a diagnosis. It also needs to be based on real data. IBM's Watson failed in healthcare because much of the data upon which it relied was hypothetical. (Well, there other reasons, too.)
Puttin that all together, It seems to me there are three areas that need to be addressed for an AI system to be reliant: real data, Gestalt evaluations, and real-time testing. The latter two are programming and resource problems to be solved. The former--real data--is much more difficult, as Watson proved. Sure, there are a lot of specific "data" available for training. However, much of that data was created and captured by humans who, as it turns out, are neither consistent, accurate nor complete in recording data. That was the problem with healthcare, where data is abundant, but the quality--despite ICD10 and CPT coding standards--is lacking.
Nonetheless. I remain optimistic.