Comment on page
Caching in Earthfiles
Caching is at the heart of how Earthly works. This page will walk you through the key concepts of caching in Earthfiles.
Most commands in an Earthfile create a cache layer as part of the way they execute. You can think of a target in an Earthfile as a cake with multiple layers. If a layer's ingredients change, you need to redo the affected layer, plus any layer on top. Similarly, in an Earthfile target, if the input to a command is different (different ARG values, different source files being COPY'd, or the command itself is different), then Earthly can reuse the layers from a previous run up to that command, but it would have to re-execute that command and what follows after it.
If you happen to be familiar with Dockerfile layer caching, then layer caching in Earthly targets will be very familiar to you as it works the same way.
Earthly supports inheriting from other targets, copying artifacts that result from them, or simply issuing the build of another target. These various target cross-references result in a build graph underneath. Thus, one target could influence whether another target is executed - for example, if a source file changes and that results in rebuilding an artifact in
target1, but then
COPYof that artifact, then at least part of
target2will need to be re-executed too as a result. Earthly deals with all of this automatically.
Because of how layer caching works, it is best to organize builds in a manner that best utilizes the cache. A common strategy is to download and install dependencies early on in the build. Since the list of dependencies doesn't change very often, this expensive operation will usually be cached. To achieve this, it is important to copy the minimal amount of source files (usually just the file that defines what the dependencies are) before issuing the command that installs the dependencies.
Here is a practical example:
COPY . .
RUN go mod download
RUN go build ...
In the above example, changing the project's
git fetchmight cause slow commands like
go mod downloadto be re-executed.
COPYcommands (among other things) to mark certain files as inputs to the build. If any file included in a
COPYchanges, then the build will continue from that
COPYcommand onwards. For this reason, you want to be as specific as possible when including files in a
COPYcommand. In some cases, you might even have to list files individually.
Here are some possible ways to improve the above example:
COPY go.mod go.sum ./*.go ./
RUN go mod download
RUN go build ...
The above is better, as it avoids reacting to changes in
.gitor to unrelated files, like
README.md. However, this can be arranged even better, to avoid downloading all the dependencies on every
COPY go.mod go.sum ./
RUN go mod download
COPY ./*.go ./
RUN go build ...
In general, including the smallest set of input files as possible at every step will result in the best cache performance.
Sometimes layer caching is not enough to properly express the best way to cache something. Cache mounts help complement layer caching, by allowing the contents of a directory to be reused across multiple builds. Cache mounts can be helpful in cases where the tool you're using to build within Earthly is able to leverage incremental caching on its own. Some package managers are able to do that for downloaded dependencies.
RUN --mount type=cacheonly mounts the cache for that single
CACHEmounts it for any
RUNcommand that follows in the same target
- Final image
RUN --mount type=cache, the contents of the cache are NOT persisted in the final image.
CACHE, the contents of the cache are copied into in the final image, and also, as a result will be available to be read in targets inheriting from the original target
RUN --mount type=cacheis very performant as it does not require transferring contents at the end
CACHEcan be slow in certain cases, if the contents are large, due to the need to copy the contents into the final image
RUN --mount type=cacheis isolated to a single command, making it more difficult (but not impossible) to pass along files between steps via the cache
CACHEis available to all commands in the target, making it easier to pass along files between steps via the cache, and thus also easier to run into race conditions, if a parallel build changes the contents of the cache in unexpected ways
Cache mounts, by default, are only available within the same target. So if both
RUN --mount type=cache,target=/my-cache, the contents would not be shared. If you would like to share the contents, you can use the
idoption. Setting the
idmakes the cache mount global, allowing any target to access the same contents, as long as they both use the same
RUN --mount type=cache,id=my-cache-id,target=/my-cache.
Parallel builds using the same cache mount (or the same build where the mount is used in multiple targets) pose another aspect to be aware of: accessing the cache mount concurrently. By default, sharing is set to
locked- meaning that parallel executions will wait for each other to complete, thus allowing access by one process at a time. While this is the safest option, it is also the slowest. Keep in mind that this will limit your build parallelism significantly if you overuse global cache mounts. Other possible options are
shared(allows concurrent access), or
private(if a parallel execution occurs, a new empty mount is created).
Cache mounts can be a versatile tool for controlling caching in ways that layer caching cannot. There are, however, important limitations to understand.
The most important limitation to be aware of is that reusing state from a previous run can be a source of build inconsistency. A test passing just because it starts off with the right contents in cache could later result in deploying a broken application to production.
Another limitation is that cache mounts are not great for passing files from one build step to another. This is because a parallel build could interfere with the cache between steps in ways that are difficult to debug. Be especially mindful that builds from different development branches might interact with each other unexpectedly in this situation. It is therefore best to avoid using cache mounts as a mechanism to pass along information. It is best to extract the result of an operation out of the cache mount within the same operation, to ensure that the cache is locked during this time.
Finally, another important limitation is the fact that cache mounts can grow in size indefinitely. While Earthly does garbage-collect layers and cache mounts on a least-recently-used basis, a cache mount that is used frequently could grow more than expected. In such situations, you should consider managing the lifecycle of the cache contents yourself, by removing unused files from it every few runs. A good place for such cleanup operations is within the same layer (same
RUNcommand) that uses the contents, at the end.
In certain situations, you might want to disable caching either for a specific command, or for the entire build. This can be done by using the
--no-cacheflag. For example,
RUN --no-cache echo "Hello"will always execute the
echocommand, even if the
RUNcommand was executed before with the same arguments. To disable caching for an entire run, you can use
earthly --no-cache +my-target.
Another way to disable caching is to use the
RUN --pushflag. This flag is useful when you want to perform an operation with external effects (e.g. deploying to production). By default Earthly does not run
--pushcommands unless the
--pushflag is also specified when invoking Earthly itself (
earthly --push +my-target).
RUN --pushcommands are never cached.
Debugging caching issues can be tricky. Here are some common issues that you might face and how to resolve them.
If the configured cache size is too small, then Earthly might garbage-collect cached layers more often than you might expect. This can manifest in builds randomly not using cache for certain layers. Usually it is the biggest layers that suffer from this (and oftentimes the biggest layers are the most expensive to recreate). This problem is most common on Mac and Windows, where Docker uses a VM with limited disk size. To resolve this, either configure a larger cache size if you are running local builds, or launch a larger Satellite if you are using remote builds via Earthly Satellites. For more information see the managing cache page.
In Earthly, like in Dockerfiles, ARGs declared in Earthfiles also behave as environment variables within the target they are declared in. This means that if you declare an ARG, and then use it in a
RUNcommand, then the
RUNcommand will be invalidated if the ARG changes. This is sometimes not very obvious, especially if you are not actually using the value of that ARG.
For this reason, it is best to declare ARGs as late as possible within the target they are used in, and try to avoid declaring
--globalARGs as much as possible. If an ARG is not yet declared, it will not influence the cache state of a layer, allowing for more cache hits. Limiting the scope of ARGs as much as possible will yield better cache performance.
Watch out especially for ARGs that change often, such as the built-in ARG
EARTHLY_GIT_HASH. Declaring this ARG as late as possible in the build will cause less cache misses.
Note that secrets, unlike ARGs, do NOT contribute to the cache state of a layer. This means that if you use a secret in a
RUNcommand, and the secret changes, the
RUNcommand will not be invalidated.
Please note that remote caching via registry tends to be very difficult to get right. When using explicit caching, by default, only the layers of the target being directly called are included (if any). From there, you need to add
SAVE IMAGE --cache-hintacross various other targets to add more layers to the remote cache. There is often a trade-off between the upload/download size vs the actual time saved. Significant experimentation is necessary to get this right. Note also that main branch builds might overlap with PR builds, and therefore, you might need to use different cache destinations for each. Usually, using remote runners is a better alternative to remote caching via registry.
If you have already optimized your cache by maximizing its size, declaring arguments as late as possible, and implementing the other recommendations provided here, but you still encounter performance bottlenecks due to computationally intensive tasks being evicted from the cache, consider employing
SAVE IMAGEcommands at strategic points. These images can serve as manual caches and can improve efficiency at the cost of simplicity. For additional details, refer to the Best Practices section.