Electromigration is one of the biggest risks to be avoided when overclocking our processor or graphics card. This physical phenomenon is largely responsible for a poorly done overclocking process ending up breaking our processor or causing it to need more voltage to reach the same frequency level. It is what we usually know as having a degraded CPU.
Without a doubt, the best weapon to avoid electromigration is to know it, to know what causes it and what causes it, and especially how to do a better overclocking that avoids this effect. In this article, we will analyze it in depth. We started!
Electromigration at the physical level
We are going to explain trying to give a basic explanation of what happens in this process at an electronic level. We are not physicists, but neither do we pretend to give an excessively advanced and perfect exposition. We want you to get an idea of what electromigration entails.
Source: Wikipedia |
Electromigration does not affect the transistors directly, it affects the metal parts of the CPU. We are talking about the microscopic internal copper connections of the chip, which we could colloquially define as the "wires" that connect the transistors. From time to time, some of the electrons that are passing hit an atom of the metallic material, being able to displace it minimally.
When this happens too many times, then the connection weakens noticeably, reducing its section, which causes the current density that passes through it to increase. Even a worse case can lead to the fatal outcome of the chip stops working because too many of these connections are damaged and electricity can no longer be properly conducted inside.
Let's summarize the phenomenon we just explained a bit by thinking of a normal-size cable with two connections. If the copper interior weakens and the section is reduced, it will continue to pass the same current, but it will do so in a conductor with a lower maximum supported current, so once it occurs it will continue to weaken. This is the big problem, that what happens once has an irreversible effect.
What are the causes of electromigration?
Temperature is one of the great causes of electromigration, and it is something that also has a physical explanation. Silicon is a material that has a negative α temperature coefficient, and to understand what it implies you must take into account the formula for the variation of electrical resistance with temperature: R = R 0 (1 + α * ΔT). This means that when the temperature coefficient is negative then an increase in temperature causes a reduction in the resistance of the silicon, and therefore an increase in the flow of electrons with increasing current. This makes electromigration worse.
In fact, what best proves this idea is that this is the cause of the use of liquid nitrogen to do extreme overclocking, since by getting the processor to temperatures below zero then we get a negative ΔT that causes an increase in resistance silicon of the CPU, and therefore the amount of current that will pass to a voltage equal will be much less, and the voltage can be increased well above the safe limits without problems. It is clear that this is something that we cannot achieve in practice, for a normal team.
This is precisely the second major cause of electromigration, which is the voltage applied to the CPU. As this increases, the current increases and therefore the flow of electrons within the processor, in addition to raising the temperature, so it is undoubtedly the perfect cocktail for degradation problems to occur. Then we will explain more fully to what extent this can be given since it is clear that there are relatively safe voltage and temperature ranges.
And its consequences? How to detect processor degradation
We have seen the intuitive explanation of what happens in electromigration at an electrical level, and also the causes that cause it, and the truth is that it is quite easy to deduce what the consequences are in our processor and in our daily experience.
The most obvious consequence would be for the chip to stop working, directly. But killing a chip takes a very long time, you have to break enough connections within the CPU to stop working, something that does not happen easily even in relatively large cases.
In practice, you will begin to notice how electromigration is developing when the PC is turned off or there is an overclocking crash when performing a very demanding task, such as a rendering or a stress test. That's the first big indication that your CPU is degrading, as it implies that it now needs more voltage to force the electrons through all those broken or weakened connections.
Many people suffer from this problem, they think that it is only because the voltage they had, in the beginning, was low, so they will turn it up and they will be so happy because now the CPU works perfectly again. If you have been attentive to the article, you will know well that what they have done has been to make it worse.
Derived directly from this, the other direct consequence that can be noticed is that the achievable frequency margins are lower and that now the frequency level that you can reach in a stable way at the same amount of voltage will be much worse. In short, at the same frequency, more voltage is needed, and even with more voltage, the maximum achievable frequency can be lost.
Particular cases
Electromigration is highly dependent on the particular chip we are dealing with and its manufacturing process. And it is that depending on how it has been manufactured we will have very different tolerable voltage margins.
Disclaimer: as we will explain, these voltage ranges are not an "absolute truth" applied to all cases. Don't do OC taking it as such.
For example, in the first and second generation of Ryzen (with manufacturing processes of 14nm and 12nm), AMD itself puts a reasonable limit around 1,425 volts, since at those levels good current margins will be maintained. For the 7 nanometer process, that we know of, AMD has not officially ruled on this, so we cannot make big claims. In any case, the general consensus among experts tends to be around 1.3 volts for heavy loads on the CPU. In lighter uses such as most games, it can even approach 1.5V without problems, but this is already an added risk that each user must decide whether to assume or not.
The static OC could degrade a Ryzen 3000 or 5000, we propose alternatives |
Note that, since AMD has not officially said anything about it, these "agreed margins" are based on tests carried out by experts based on the operation of the boosting algorithm.
What accounts for this difference? It's something to do with the transistor density of the CPU. In a 7nm process, the minimum separation that we can see between the transistors is much lower, we will accommodate more in a much smaller space, and therefore these internal connections that can cause electromigration will be much more sensitive.
This density of transistors influences the energy density of the chip. This is a very interesting measurement: it is done in W / mm 2 (power per unit area) and gives a good overview of how stressed the internal connections of the CPU are.
Be careful: you have to take one thing into account. The safe voltage range depends specifically on the unique characteristics of each silicon. That is, each chip is a world. We all know this is true for achievable overclocking margins, but also for safe voltage limits. For example, a 7nm Zen 2 chip overclocked to 1.325V using a lot of heavy loads could end up degrading within a few months, simply because it's not a good sample.
Let's now go to Intel, where the above principles obviously apply as well. The company has been using the same manufacturing process for several years, albeit with slight variations, so it is easier to talk about the voltage limits that we find. The general consensus is around 1.45 volts. But there is talk of a special sensitivity to temperature, so it is necessary to keep it below 80 degrees if we want to have that voltage. This with some CPU and cooling systems is impossible so the practical limit is below.
We remind you that all the voltages that we have given you are guide values that do not have to be applied to all chips, in addition to that we are only talking about the core voltage and not other voltages such as that of the SoC in AMD or System Agent in Intel.
The importance of context and how it influences the use that is given to the machine
It is clear that it is not the same to make an “unsafe OC” on a computer that is going to be predominantly dedicated to gaming than one in which heavy renderings will be made. Not only because of the duration of high CPU load periods, which the longer they are, the more they will degrade it, but also due to the very nature of that load. A game does not usually stress the processor excessively, while in the opposite case we can have an application that makes use of all its cores at the same time and with complex and demanding instructions such as AVX, for example.
Therefore, a person who is fully aware of what he is doing and only plans to dedicate himself to gaming could overclock a little beyond the limits considered safe to obtain a much higher frequency, and would not suffer from electromigration. But it is something that you have to know how to do very well and that obviously carries risks. The best way to calibrate this is to see what happens with the temperature, although you force the voltage a lot if you have reasonable temperature data and you are not going to apply exaggerated loads then you could rest easy.
For example, AMD itself knows this well, hence its boost algorithms push the CPU above the safe maximum voltage margins in its boost algorithms when the processor load is not excessive. And it is that the fault is not directly the voltage, but the current density per unit of surface that is inside the processor, which obviously increases if we raise the voltage, but if the demand of the CPU is low it will be much lower levels.
Another tremendously relevant point, speaking of context, is that we will have much more margin for safe overclocking with CPUs with fewer cores within the same architecture. For example, an Intel Core i5-10600K is much easier to OCear than an Intel Core i9-10900K, since they share substrate, architecture, manufacturing process and cores only that the latter has 4 more, in such a way that its consumption with OC is much higher, its current density per surface area increases and the margins to avoid electromigration are smaller.
Another thing we want to tell you is that automatic overclocking of motherboards is usually more dangerous than normal. All manufacturers include their one-click OC options, with which you can get quite interesting frequency increases, but in most cases, they are greatly exceeded with the voltage. In fact, we have seen many cases of people who have used these options with CPUs with very good heatsinks, but thanks to the tremendous voltage forced by the board they have become excessively high temperatures, creating a real danger of electromigration.
So how do I properly overclock to avoid electromigration?
Getting a good overclocking depends infinitely on the tasks you do on your computer. If you dedicate yourself to encoding, rendering, complex calculations, scientific tasks, etc., then you will demand a lot from the CPU, and from what we have explained you must be very careful, with reduced voltages (below 1.3 volts on most platforms), or if you approach the recommended maximums you should maintain good temperatures. An arbitrary reference data would be 75 degrees.
Heavy tasks for many hours? Low voltages and controlled temperatures.
If you are only going to play games, then the margin with the voltage will be much higher and you will be able to go to voltages close to the recommended limits, always seeking to respect the temperatures.
In case your cooling system fails to maintain good numbers, simply lower your aspirations and set a safer voltage, where having a "high" temperature will not be a problem.
In the case of games, there is greater security with the voltage, but if you approach the recommended maximums (above we indicate the values according to the CPU) keep temperatures low.
We also want to clarify that in the latest AMD CPUs manual overclocking is not an interesting idea. The reason is basically that we can get better performance by playing with its boost algorithm and options such as PBO (Precision Boost Overdrive). In practice, manual OC only makes sense if we have a specific need for higher multi-core performance, since in single-core and overall performance we may end up losing.
In the case of Intel, the fact that the same manufacturing process has been maturing for so many years has the advantage that interesting margins of manual overclocking are achieved there, but always according to the indications that we have just given you.
How long can it take to appear? Should you be obsessed with temperatures?
One question that you will all be asking yourself is: how long will electromigration take to appear if I have not done a proper overclocking? The answer is not fixed, and depending on the case we can speak of several months or several years.
Many people argue against OC that "the CPU will last a lot fewer years." Actually, those of you who have been attentive to the article will have already understood that it is a question of how things are done, and it does not automatically imply that doing OC ends up leading to degradation phenomena that break the processor.
A badly overclocked CPU above safe voltages doing heavy-duty 24/7 can only last a couple of weeks. But in a normal overclocking, taking our guidelines, where perhaps there may even be a higher voltage than desirable, but that the use that is given to the computer resides in games and some sporadic rendering could last several years. It all depends on each case, but we can talk about 2, 3, 4, 5, 6 years. All this until we begin to notice that there are crashes with overclocking, that is, some preliminary symptoms of electromigration that one can forget by lowering the OC in order to have more years of the processor.
If you have excessive voltage and use the CPU in many high-consuming tasks, you can run out of CPU in 1 year. But that is doing things without ahead. A reasonable OC will take years to give any symptoms of electromigration.
If you have suffered degradation by electromigration, then you have to understand that there is no miracle cure that will restore the normal state of your CPU. But that doesn't mean you have to throw it away, and as long as it works you can save it by simply lowering your overclocking.
Oxide Breakdown, where the voltage does have a leading role
In the article, we have talked about voltage as one of the causes of electromigration, but focusing more on what is derived from the rise in voltage such as higher temperatures or increased current, and not so specifically as the fault of voltage per se. That is why we have told you that taking voltages beyond reasonable limits does not have to cause electromigration if certain conditions are not met.
However, there is a different phenomenon that is often confused with electromigration and that is affected by voltage. It's called oxide breakdown and it directly affects the transistors in the CPU, not the internal connections like electromigration. Furthermore, its effects are totally different.
The oxide breakdown is very easy to cause if you apply excessive voltage. The very fact of putting too much voltage (imagine, for example, 1.8V) for a short period of time and at room temperature will pass this effect. Its consequences are catastrophic because in a matter of seconds and without doing anything special, the CPU can be completely damaged. That is the fundamental difference with electromigration, where the possible effects can manifest themselves in a much longer-term.
If you put too high a voltage and the CPU breaks almost instantly, an 'oxide breakdown' has occurred and not electromigration. Electromigration can go on for months or years without showing any clear effects or affecting your OC.
In the best case, an oxide breakdown effect will not cause the CPU to break directly but will stop working at certain voltages and maximum OC frequency will be lost.
The point to cause the oxide breakdown is usually considered above 1.5V in Intel's 14nm processes and above 1.8V in the 3rd generation Ryzen. As you can see, it will only happen in the most extreme cases.
Conclusions
All overclockers know that one of the biggest demons they face is processor degradation. This occurs by an electrical phenomenon called electromigration. When it happens, your CPUs stop sustaining the current level of overclocking and start to require much higher voltage levels in order to function.
This effect begins to manifest itself when the processor is improperly overclocked, combined with intensive use of the processor. For this reason, it is very important to maintain adequate voltages in the OC, and in case of using a high voltage to monitor the temperatures and the use that is given to the processor. And it is that when doing OC trying to avoid this problem it is not the same that we are going to dedicate ourselves to renderings of several hours than to gaming: in the latter case we have much more margin to play with our processor.
In this article, we have given an important series of ideas to understand how to avoid electromigration according to the CPU used, after introducing the physical fundamentals, causes, and consequences of the phenomenon. Among other keys, we have also talked about why in the latest AMD CPUs playing with the boosting algorithm may be a better idea than overclocking, and we have introduced an effect very similar to that of electromigration but with very different causes, the oxide breakdown. An in-depth look at our CPU OC's greatest enemies.
mbtTOC();yle="text-align: justify;">We hope this article has helped you to face the great enemy that is electromigration, and perform your overclocking with more skill and mind.
0 Comments