📊 Full opportunity report: Minerva. The opposite path. on ThorstenMeyerAI.com — validation score, market gap, and execution plan.
TL;DR
Italy’s Minerva-3B, a European sovereign language model trained from scratch with extensive Italian data, achieved poor results on academic benchmarks. This challenges assumptions about data scale and investment in native-language AI models.
Italy’s Minerva-3B, a large-scale European sovereign language model trained entirely from scratch on 2.5 trillion tokens, scored only 4.9% on the INVALSI Italian school-exam benchmark, a result that raises questions about the relationship between data scale, model size, and language understanding.
Minerva was developed by Sapienza University of Rome’s NLP group, led by Roberto Navigli, using 128 GPUs on the CINECA Leonardo supercomputer, and is part of Italy’s national AI strategy funded through PNRR. The project trained the model from scratch, with approximately 50% Italian content, resulting in a model with 3 billion parameters that outperforms some multilingual models on Italian benchmarks.
Despite the impressive technical effort and significant investment, Minerva’s performance on the INVALSI test—an academic assessment for Italian students—was near chance, at just 4.9%. Researchers concluded that dataset size and parameter count are more critical for complex language tasks than pre-training data composition alone, suggesting that scale remains a limiting factor.
Minerva.
The opposite
path.
Italy spent years building a European sovereign LLM from scratch. Then Minerva-3B scored 4.9% on the INVALSI Italian school exam.
Where AMÁLIA layered Portuguese specialization onto a multilingual foundation, Minerva trained from scratch on 2.5 trillion tokens with approximately 50% Italian content. Where AMÁLIA’s weights are not yet public, Minerva published weights, training data, and code as truly-open from day one. By every institutional measure, the Italian approach worked. But the empirical results contain a finding the press coverage has been quiet about — and it has implications that extend well beyond Italy.
Same problem. Opposite path.
European sovereign-LLM development has two primary architectural approaches. Italy chose from scratch with substantial native-language foundation. Portugal chose continuation pre-training of a multilingual model. The structural comparison surfaces what each commitment actually requires operationally.
The comparison is not “Italy did it better than Portugal.” Both projects respond to the same structural problem with different architectural strategies under different institutional and economic constraints. Italy’s national-AI investment is structurally larger by an order of magnitude — and Minerva is the visible artifact of that scale.
large language model training hardware
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
4.9% on INVALSI. The bitter lesson surfaces.
In June 2024, researchers evaluated Minerva-3B on the Italian school-exam benchmark. The result was unambiguous. This is not a critique of Minerva — it is a critique of the public discourse around what Minerva’s empirical results actually demonstrate.
GPU clusters for AI training
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
350M to 7B. Four parameter scales, one architecture.
The Minerva model family covers four parameter tiers, each with specific training corpora. Each scale level reveals what the from-scratch path actually requires at different operating points.
Italian + English
100B English
~50% English
+ 200B code
AI model training data storage
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Three answers. Same question.
Minerva, AMÁLIA, and OpenEuroLLM represent the three operational answers to the European sovereign-LLM question. Each makes different architectural and institutional bets. The strategic discourse benefits from treating all three as data points in the same empirical experiment.
AI benchmark testing tools
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Three standards the movement should adopt.
The structural critique generalizes beyond Minerva. The European sovereign-LLM movement benefits from internalizing these lessons across every subsequent national project. Italy modeled the openness standard; the movement should adopt it as norm.
Minerva is one valid answer to the European sovereign-LLM question. AMÁLIA is another. OpenEuroLLM is potentially a third. The strategic discourse benefits from treating all three as data points in the same empirical experiment rather than as competing national-prestige projects. More analysis like this is needed. Not less.
Implications for European Sovereign-Language AI Strategies
This development challenges the assumption that large-scale native-language models trained from scratch automatically achieve deep language understanding. It indicates that the level of investment and model scale necessary to produce country-specific knowledge may be higher than previously thought. The results suggest that European efforts in sovereign AI need to account for the potentially higher costs and larger models required to reach meaningful performance levels, especially in complex academic or professional contexts.
European Sovereign-Language Model Development Approaches
The Italian Minerva project contrasts with the Portuguese AMÁLIA model, which layered Portuguese onto a multilingual foundation, whereas Minerva trained from scratch on a massive Italian dataset. Italy’s approach involved significant institutional coordination, including funding from the Italian government, use of the CINECA supercomputer, and open release of weights and data. While Minerva achieved technical benchmarks, its poor performance on academic tests highlights the ongoing debate about the effectiveness of scale versus specialization in sovereign-LM development.
“Minerva’s performance on the INVALSI test underscores that data scale and parameter count are more crucial than dataset composition for handling complex language tasks.”
— Thorsten Meyer
Unresolved Questions About Scale and Effectiveness
It is still unclear what specific model size or data investment would be necessary to achieve high performance on complex language tasks like academic assessments. The results raise questions about whether larger models or more targeted data are needed, and how these findings translate to other languages and domains. The ongoing development of Minerva and similar models will clarify whether the observed performance is a fundamental limit or a temporary artifact of current scale.
Next Steps in European Sovereign-Language AI Development
The Minerva team is continuing to refine their models, with upcoming iterations aimed at improving performance on complex tasks. Researchers and policymakers will likely reassess investment strategies, considering larger models or more specialized data. Further benchmarking across different languages and tasks will help determine the optimal scale and approach for sovereign-LM projects in Europe.
Key Questions
Why did Minerva perform poorly on the INVALSI test?
Despite extensive training on Italian data, Minerva’s limited parameters and dataset size may have been insufficient for deep understanding of complex academic content, highlighting the importance of scale.
Does this mean training from scratch is ineffective?
Not necessarily; it suggests that training from scratch requires significant scale and resources to reach comparable performance levels, and that pre-training on multilingual data might be more efficient in some cases.
What are the implications for European AI policy?
The results imply that European sovereign-LLMs may need to invest in larger models and more substantial native-language datasets to achieve meaningful performance, affecting future funding and development strategies.
Will future models improve on this performance?
Likely, as the Minerva team continues refining their approach and explores larger models, but it remains to be seen how much scale is truly necessary for complex language understanding.
Source: ThorstenMeyerAI.com