Conclusion
A number of bigger conclusions emerge from this take a look at case. The 2 fashions that drew from curated databases of experimental literature, NotebookLM and our custom-built software, outperformed the LLMs educated on unfiltered web knowledge. Specifically, fashions counting on open net sources tended to combine established theories with extremely speculative ones.
The evaluated LLMs (accessed in December 2024) additionally confirmed weaknesses in temporal and contextual understanding. For instance, they usually failed to acknowledge when a proposed speculation was later disproved. In addition they often omitted related papers once they didn’t explicitly embrace the precise language used within the preliminary question.
Our outcomes broadly spotlight the necessity for LLMs to raised perceive tables and pictures, as scientific papers closely use these codecs. Whereas two of the fashions constantly referenced photographs, they usually relied extra on picture captions reasonably than on visible evaluation. Enhancing visible reasoning functionality, together with deciphering photographs, plots and scale bars, is a serious route for future enchancment.

