Learn the Best Way: How to Detect Memory Leak

Learn the Best Way: How to Detect Memory Leak

The alert usually shows up at the worst time. A pod starts restarting. RSS keeps climbing. Garbage collection gets noisier. Latency drifts upward, then someone asks the dangerous question: is this a memory leak, or just normal memory behavior under load?

That question matters because the wrong first move wastes hours. Many teams jump straight into heap dumps and profilers before proving that memory is leaking. In production, the fastest path is triage. Start broad, confirm the pattern, narrow the scope, then choose the least disruptive tool that can answer the next question.

Table of Contents

Recognizing the Symptoms of a Memory Leak

Some leaks announce themselves with crashes. Most don't. They show up as a pattern of slow degradation that keeps returning after deploys, traffic shifts, or routine restarts.

A focused developer analyzing time-series anomaly detection code on his computer screen in a home office.

What rising memory really means

The obvious symptom is steady memory growth over time. On Linux, that often means RSS keeps rising across repeated workload cycles. In managed runtimes, heap usage may rise after garbage collection instead of returning near its earlier baseline. In Kubernetes, the symptom may be operational rather than diagnostic: restarts, evictions, or repeated OOMKilled events.

Other signals often arrive first:

  • Longer garbage collection pauses because the runtime has more live objects to scan.
  • Latency drift during request bursts or background processing.
  • Reduced node headroom that causes unrelated services to compete for memory.
  • Recovery after restart followed by the same climb again, which strongly suggests retained state rather than a one-off spike.

A leak also creates secondary failures. Queue workers slow down. Sidecars compete for memory. Kernel pressure increases. Teams sometimes label all of that "the app is slow" and miss the memory pattern entirely. Good incident review depends on monitoring the silent failures that don't page immediately.

Practical rule: If memory rises, plateaus, and stays useful, that isn't automatically a leak. If memory rises across repeated cycles and never settles, suspicion should go up fast.

The first triage question

A frequently missed question is whether the leak is even in the application. Microsoft's guidance puts the first step earlier than most app teams expect: prove that a leak exists at all with long-running trend checks before moving into user-mode or kernel-mode diagnostics, because apparent leak symptoms can come from cache growth, normal GC behavior, or a lower-level component issue such as a driver or service layer, as described in Microsoft's memory leak troubleshooting guidance.

That changes the workflow. If browser tabs balloon but the backend stays flat, the client may be leaking. If the process looks stable but the host is under memory pressure, a service, driver, or container layer may be the culprit. If only one pod in a replica set drifts upward after a particular lifecycle path, the application becomes the prime suspect.

A useful mental model is simple:

  1. Host-level symptom asks whether the machine is under pressure.
  2. Process-level symptom asks which process owns the growth.
  3. Runtime-level symptom asks whether live objects keep accumulating.
  4. Kernel or platform symptom asks whether the problem sits below user space.

Teams that skip step one often spend too long staring at code that isn't responsible.

Establishing a Baseline with System-Level Tools

Before attaching a profiler, confirm the trend with ordinary operating system tools. This phase isn't glamorous, but it avoids false positives and tells whether the leak is worth deeper analysis.

A five-step flowchart illustrating the systematic process for identifying and analyzing system-level memory leaks in applications.

Start with trends, not snapshots

The core idea behind how to detect memory leak is trend analysis, not one screenshot from top. Microsoft's guidance recommends watching counters over a long window, with an example of sampling every 10 minutes and, for a true long-duration check, capturing data for 24 hours in its leak detection guidance. That advice applies well beyond Windows because the principle is universal: memory pools and runtimes often need time to settle.

On Linux, the practical equivalent is to record the same fields repeatedly:

  • RSS for resident memory pressure.
  • VSZ or VIRT for virtual address space context.
  • Swap activity, if present.
  • Per-process memory over repeated workload cycles, not just at startup.

Here, top, htop, ps, /proc/[pid]/status, and pmap prove useful. They won't tell why objects are retained, but they will show whether a process is growing and whether that growth sits in heap-like anonymous mappings, loaded libraries, or something else.

For teams that also perform host maintenance, it's worth separating leak diagnosis from routine memory reclamation tasks. If the issue turns out to be cache behavior rather than leaked allocations, Server Scheduler helps clear RAM cache safely and gives useful operational context for those cleanup cases.

Useful Linux checks before profiling

A simple workflow works well under pressure.

  • Use top or htop first. Watch whether one process consistently climbs in RSS while others stay stable.
  • Use ps for repeatable sampling. This is better for logging output over time than watching a terminal manually.
  • Inspect /proc/[pid]/status. It gives a quick read on resident and virtual memory without attaching anything intrusive.
  • Use pmap when the shape matters. It helps distinguish heap growth from mapped files, libraries, and anonymous regions.

Short-term observation is where many investigations go wrong. A service may warm up classes, JIT-compile code, or populate caches at startup. That can look alarming in the first hour and then flatten out. Without repeated samples, it's easy to call normal growth a leak.

One clean graph over several hours is more useful than a detailed profiler capture taken too early.

A standard server monitoring stack also helps here. Teams already collecting CPU, memory, disk, and per-process telemetry can review the leak in context, which is exactly why Linux server monitoring dashboards matter during incident response.

What not to conclude too early

A few patterns create noise:

Signal What it may mean Why it's not enough alone
RSS rises after deploy Warmup, caching, JIT, or a leak Early growth can be normal
Heap rises during load test Active working set increased The key question is whether it falls back
Pod restarts stop the issue briefly Process memory state was reset Useful clue, not proof of root cause
High host memory use App leak or system component issue Ownership still needs to be isolated

The baseline phase is successful when it answers two narrow questions. Is memory growth sustained, and which layer appears to own it. Only then is it worth paying the overhead of deeper tooling.

Deep Dive with Language-Specific Profilers

Once system-level evidence shows a real trend, the next job is to identify what keeps accumulating. Profilers are useful here. They answer a different question than OS tools do. Not "is memory going up?" but "which objects, allocations, or references explain the rise?"

What profilers are good at

Profilers come in two broad styles. Some instrument allocations heavily and provide detailed call stacks or retained-object graphs. Others sample with less overhead but less precision. The right choice depends on how fragile the workload is and whether the issue is reproducible outside production.

The strongest general technique is snapshot comparison. A widely used approach is to compare heap snapshots before and after the same action runs repeatedly, then look for a persistent positive delta in object counts or retained size. That turns leak hunting into a measurable analysis instead of intuition, as reflected in Google's patent describing correlated memory trend analysis.

That method is reliable because leaks usually show up as survivors. The same request path runs again and again, but some objects created during each cycle never disappear. Snapshot comparison reveals the pattern.

Tool selection by runtime

Different runtimes expose different evidence. A practical selection framework helps.

Tool Category Overhead Granularity Typical Use Case
OS process tools such as top, ps, pmap Low Process and mapping level Confirm sustained growth and find the leaking process
C and C++ profilers such as Valgrind Massif High Allocation and heap detail Reproducible native leaks in development or staging
Java diagnostics such as jmap, jcmd, JFR, MAT Medium to high Heap objects, references, GC context JVM leaks, retained object graphs, dump analysis
Python tools such as tracemalloc Medium Allocation traceback by file and line Tracing growing allocations in repeatable paths
Browser and JavaScript tools such as Chrome DevTools Medium Heap snapshots, retained size, allocation timelines Frontend leaks and Node.js heap comparison

A few trade-offs matter in practice:

  • Valgrind Massif is useful for native memory but can be too slow for realistic production traffic.
  • Java heap dumps are extremely informative, especially when opened in Eclipse MAT or similar analyzers, but dump capture can be disruptive on busy processes.
  • tracemalloc helps when Python allocations need to be tied back to specific lines, but it works best in controlled reproductions.
  • Chrome DevTools and Node heap snapshots are excellent for JavaScript leaks involving listeners, closures, detached DOM nodes, or retained buffers.

How snapshot comparison finds the leak

A repeatable loop is the heart of profiler work. Start from a baseline. Run the suspect action. If the runtime supports it, force garbage collection to reduce noise. Take another snapshot. Repeat the exact same user flow again. Then compare.

A clean comparison tends to expose one of these patterns:

  • Objects whose count increases every iteration
  • Collections whose retained size never returns
  • Listener or callback structures that grow with each reconnect or mount/unmount cycle
  • Caches or maps with no eviction path
  • Resource wrappers that stay referenced after request completion

Useful test condition: run the same action enough times that startup behavior is no longer the dominant signal.

What doesn't work well is random clicking through a UI, vague synthetic traffic, or trying to reason from a single dump with no baseline. That creates too much noise. Leak detection gets easier when the workload is narrow, scripted, and repeated exactly.

Profiler output also needs context. A large object isn't necessarily the problem. The problem is often the small long-lived reference that keeps a whole tree alive. That is why retained paths and dominator analysis matter more than sorting by shallow size alone.

Modern Detection with eBPF and Container Insights

Traditional profilers are powerful, but they can be too heavy for a busy production system. That's where modern tracing and container-aware telemetry become useful.

A comparison chart showing traditional profiling methods versus modern eBPF and container-based memory leak detection techniques.

When profilers are too invasive

eBPF gives operators a lower-impact way to observe system behavior from the kernel side. In a memory investigation, that matters when attaching a full profiler would distort the workload, require a restart, or create operational risk during an incident.

Tools from the BCC and bpftrace ecosystem can help surface allocation activity, syscall behavior, and process-level patterns without modifying application code. They don't replace heap analysis when the question is "which object graph is retained," but they are useful when the first need is live visibility with minimal interference.

This is the practical split:

  • Use traditional profilers when a controlled reproduction exists and the team needs code-level detail.
  • Use eBPF-based tracing when the system is live, timing-sensitive, or too fragile for deep attachment.
  • Use both when production evidence must guide a later, cleaner lab reproduction.

Container memory changes the investigation

Containers complicate how to detect memory leak because visibility is split. The process may look one way from inside the container and another from the node. Cgroup limits, shared page accounting, and orchestrator restarts all affect what the team sees.

A few container-specific checks help avoid blind spots:

  • Use docker stats or equivalent runtime metrics to see whether one container climbs while siblings remain stable.
  • Check Kubernetes events for OOMKilled restarts and restart loops.
  • Review working set, RSS, and garbage collection behavior together instead of relying on one chart.
  • Correlate pod churn with memory growth. If one replica repeatedly resets while peers stay healthy, the leak may be path-dependent rather than universal.

Production guidance for cloud and server environments favors trend analysis on memory telemetry plus automated dump capture when thresholds are crossed, and it specifically calls out repeated OOMKilled restarts, rising RSS, and increasing GC pause times in Kubernetes as early signs worth acting on, as discussed in Wiz's memory leak detection guidance.

Container fleets also need better observability hygiene than many teams expect. Host-only dashboards can hide per-container growth, while in-container tooling can miss node pressure and eviction context. Good container monitoring troubleshooting closes that gap by making both views available during triage.

Systematizing Detection with Monitoring and Alerting

A leak that required one painful investigation will usually come back in a slightly different form. The fix is not just better debugging. It is a monitoring setup that catches gradual memory growth early enough to preserve evidence and keep the service online.

A professional data center dashboard on a computer screen displaying memory usage and network traffic monitoring metrics.

Alert on memory slope

Hard thresholds still have value, but they are late signals for slow leaks. A process can stay under its memory limit for a long time while retained objects grow cycle by cycle. By the time usage crosses a static threshold, the service may already be close to an OOM kill or a forced restart, and the best inspection window is gone.

A better default is to alert on sustained growth over time. In practice, that means watching the rate of increase for RSS, heap usage, or container working set over a meaningful window, then filtering out short-lived spikes from batch work, cache warmups, or garbage collection. The exact threshold depends on the service. A JVM under regular GC pressure needs a different policy than a Go API with a stable allocation pattern.

The useful signals are usually:

  • RSS or working set trend, to catch growth visible to the OS and container runtime
  • Heap trend, to separate application retention from native memory growth
  • GC frequency and pause behavior, to show whether the runtime is working harder to recover memory
  • Workload level, such as request rate or queue depth, to avoid paging someone for traffic-driven growth
  • Restart and OOM events, to show whether the leak is being masked by orchestration

For teams wiring this into dashboards and alerts, platforms such as Prometheus, Grafana, and other DevOps monitoring tools for infrastructure telemetry make rate-of-change rules and correlated views practical.

Capture evidence before the crash

The alert is only half the job. Its value comes from what the system does next.

If a service shows steady memory growth and is still healthy enough to answer traffic, capture evidence then. Heap dumps, process snapshots, allocation profiles, and a timestamped bundle of container and node metrics are far more useful before the process is killed than after. Teams often wait for the crash because it feels safer. In production, that usually means losing the state that explains the leak.

Selective automation works best here. Dump capture can be expensive in CPU, disk, and pause time, so it should trigger only when the trend is persistent and the service is inside a safe operating window. On busy systems, I prefer a two-stage policy: page on sustained slope first, then trigger evidence collection only if growth continues or the process approaches a defined guardrail.

That trade-off matters in containers. A heap dump large enough to help can also fill ephemeral storage or push an already constrained pod into eviction. Store dumps off-node when possible, and include pod metadata, cgroup limits, and restart history with the artifact. Without that context, teams end up with a dump file but no reliable picture of the conditions that produced it.

A Practical Workflow for Reproducing and Fixing Leaks

Once the production pattern is clear, the repair work should move into a controlled reproduction. Here, a broad investigation becomes a focused engineering task.

Build a controlled reproduction loop

A practical workflow starts with a baseline, then runs the suspected workload under controlled conditions, repeats the same action loop, forces garbage collection when the runtime allows it, and compares before-and-after heap snapshots to identify objects whose retained size keeps increasing, which is the approach described in Browserless guidance on finding and preventing memory leaks.

That sounds simple, but discipline matters. The loop should exercise the exact lifecycle path that appears in production:

  • request burst handling
  • component mount and unmount
  • reconnect logic
  • background job retries
  • file or stream processing
  • cache fill and eviction paths

A weak test won't reveal the leak. Unrealistic inputs often flatten the very path that retains memory.

Fix the retention path, then verify stability

Most fixes fall into a few buckets. References live too long. Event listeners or callbacks accumulate. Caches don't evict. Connection pools or file handles aren't closed. Retry and timer logic keeps objects reachable after the work is done.

A good remediation pass usually asks:

  1. What object survives each cycle that shouldn't?
  2. What reference keeps it alive?
  3. Is the retention intentional, such as a cache, but unbounded?
  4. Does memory stabilize after the fix across repeated runs?

What doesn't work is stopping after a code change that "looks right." The fix is only real when the same reproduction loop stops showing growth. Memory must return toward its prior baseline after garbage collection, and repeated snapshots must stop showing a positive accumulation pattern.

That final verification is where many leak fixes fail. Teams patch one code path, but a second retention path remains. Repeat the loop until the trend is gone, not just reduced.


Fivenines can help teams catch these patterns earlier by collecting per-server and per-container memory telemetry over time, then alerting on abnormal growth before a leak turns into a restart loop or an outage. For operators who want one place to watch Linux servers, containers, uptime checks, and scheduled task health, Fivenines is one monitoring option to evaluate alongside the rest of the workflow described above.