Python: Memory and Runtime Profiling

In the ever-evolving landscape of software development, performance optimization stands as a constant pursuit, striving to enhance the efficiency and responsiveness of applications.
Python, a versatile and widely used programming language, offers a range of profiling tools to identify performance bottlenecks and optimize code execution.

Runtime profiling and memory profiling are two essential techniques for optimizing Python code performance.

Runtime Profiling

Runtime profiling is measuring the execution time of different parts of a Python program to identify performance bottlenecks.
This can be done using a variety of tools, such as cProfile and pyinstrument. These tools will generate a profile report that shows you how much time each function is taking to execute.

This helps quantify where the program is spending most time - slow functions, expensive I/O, repetitive algorithms, etc.

Different time metrics used in Python profiling:

Wall Clock Time

  1. Elapsed real time from start to finish of a function call or code block execution.
  2. Includes time spent waiting on I/O, network calls, database requests, etc.
  3. Measured using time.time() at the start and end of the section.
  4. Gives total duration from a user perspective.

CPU Time

  1. Time spent by the CPU actively executing instructions in a function.
  2. Excludes time waiting for I/O, network requests, etc.

Two types:

a. User CPU Time

  1. Time actively spent running a program’s code in user space.
  2. In Python this is the execution of Python bytecodes, numpy functions, etc.
  3. Measure of computation time.

b. System CPU Time

  1. Time spent executing system calls, and kernel functions.
  2. Includes OS tasks like disk I/O and networking calls. (system CPU time includes executing any OS kernel logic related to I/O operations. But it does not include the wait time for I/O to complete where the CPU is idle.)
  3. Shows time in kernel space on behalf of process.

Comparing Times

  1. Wall clock time will generally be higher as it includes I/O wait time.
  2. High system CPU could indicate excessive syscalls.
  3. Optimizing computation-heavy code reduces User CPU time.
  4. Profiling helps compare where time is spent - computing vs I/O.

Relationship Between Wall Clock Time, CPU Time, User CPU Time, and System CPU Time

The following equation holds: wall clock time = CPU time + other time

Where “other time” is the time that the program has spent waiting for resources or executing code in other modes, such as supervisor mode.

CPU time can be further broken down into user CPU time and system CPU time:
CPU time = User CPU time + System CPU time

Profiling Tools for Python

The Python ecosystem boasts a diverse selection of profiling tools, each with its own strengths and applications. Some of the most popular options include:

  • cProfile: The standard library’s cProfile module strikes a balance between accuracy and overhead, making it suitable for profiling long-running programs. It provides detailed call stack information and time spent in each function.
  • Pyinstrument: It is a statistical profiler for Python that provides insights into the performance characteristics of Python programs. It employs a sampling approach, periodically recording the call stack to estimate the time spent in each function. This approach offers a balance between accuracy and overhead, making it suitable for profiling long-running programs.

Pyinstrument

Pyinstrument is a statistical profiler - it doesn’t track every function call that your program makes. Instead, it’s recording the call stack every 1ms.
That gives some advantages over other profilers. Firstly, statistical profilers have much lower overhead than tracing profilers.

Pyinstrument records duration using ‘wall-clock’ time. When you’re writing a program that downloads data, reads files, and talks to databases, all that time is included in the tracked time by pyinstrument.
That’s important when debugging performance problems since Python is often used as a ‘glue’ language between other services. The problem might not be in your program, but you should still be able to find out why it’s slow.

The standard Python profiler profile and cProfile show you a big list of functions, ordered by the time spent in each function. This is useful, but it can be difficult to interpret why those functions are getting called. It’s more helpful to know why those functions are called, and which parts of user code were involved.

Following are the outputs from cprofile for a sample program:

        10000004 function calls in 4.165 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    4.165    4.165 {built-in method builtins.exec}
        1    0.098    0.098    4.165    4.165 example_prog.py:1(<module>)
        1    3.235    3.235    4.067    4.067 example_prog.py:1(leak_memory)
 10000000    0.832    0.000    0.832    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

It’s often hard to understand how your own code relates to these traces.

Following are the outputs from pyinstrument for the same sample program as above:

  _     ._   __/__   _ _  _  _ _/_   Recorded: 17:20:40  Samples:  3796
 /_//_/// /_\ / //_// / //_'/ //     Duration: 3.898     CPU time: 3.894
/   _/                      v4.6.1

Program: /envs/bin/pyinstrument example_prog.py

3.897 <module>  example_prog.py:1
├─ 3.820 leak_memory  example_prog.py:1
│  ├─ 3.159 [self]  example_prog.py
│  └─ 0.662 list.append  <built-in>
└─ 0.077 [self]  example_prog.py

To view this report with different options, run:
    pyinstrument --load-prev 2023-11-26T17-20-40 [options]

Pyinstrument records the entire stack, so tracking expensive calls is much easier. It also hides library frames by default, letting you focus on your app/module is affecting performance.

Install Pyinstrument

pip install pyinstrument

Let’s use a sample program:

# example_prog.py
def leak_memory():
  # Create a list and keep adding items to it without ever removing them
  leaky_list = []
  i = 0
  while i < 10000000:
    i += 1
    leaky_list.append(object())

if __name__ == "__main__":
  leak_memory()

Run using pyinstrument

pyinstrument example_prog.py 

Run using cProfile

python -m cProfile example_prog.py 

Memory Profiling

Memory profiling helps find memory leaks and optimization opportunities by detailing memory usage over program execution.
By understanding memory usage patterns, identifying potential issues, and applying best practices, Python developers can create more efficient and reliable software.

Once you have a memory profile report, you can identify the objects that are taking up the most memory and focus on optimizing those objects. This can lead to significant memory savings, especially for programs that use large amounts of data.

Some methods are:

  1. Tracing memory used using tracemalloc to track allocations
  2. Getting process memory with resource module
  3. Using a memory profiler module to analyze memory per line
  4. Monitoring in production using PSUTil, Memray
    This helps identify growing memory patterns, temporary allocations to optimize, caching opportunities, and leaks.

Memory leak

A memory leak in Python occurs when an object is no longer being used but remains in memory, causing the program’s memory usage to grow over time. This can eventually lead to performance issues, instability, and even crashes. Memory profiling is essential for identifying and resolving memory leaks in Python applications.

Detecting Memory Leaks

Several signs can indicate a memory leak in your Python program:

  1. Increasing Memory Usage: Continuously increasing memory usage, even when the program’s workload remains constant, is a strong indicator of a memory leak.
  2. Garbage Collector Inefficiency: If the garbage collector (GC) appears to be running frequently but failing to release memory, it suggests a memory leak is preventing proper memory management.
  3. Performance Degradation: As memory usage increases, the program’s performance may degrade, causing slowness, unresponsiveness, or even crashes.
  4. Unexplained Objects: Examining the list of objects in memory may reveal objects that are no longer referenced but still occupy memory, indicating a potential leak.

Tools:

Memray: Memray is a powerful memory profiling tool for Python that provides a comprehensive view of memory usage and helps identify and resolve memory leaks. It offers several advantages over traditional memory profiling tools, including:

  1. Native Profiling: Memray can track native C/C++ functions in addition to Python functions, providing a holistic view of memory usage in applications that use C extensions.
  2. Flamegraph Visualization: Memray generates flamegraphs, which are interactive visualizations that show how memory is being used at each level of the call stack. This makes it easy to identify the functions that are consuming the most memory.
  3. Per-Test Profiling: Memray can be integrated into pytest test suites, allowing you to measure memory usage for each test and identify any memory leaks that occur during testing.
pip install memray

Run on a sample program

memray run example_prog.py 

The above command will generate a .bin file

You can use this .bin file now to generate reports.

Some example commands to generate reports:

Generate FlameGraph

python -m memray flamegraph <bin_File_path>

Get memory usage by running stats or tree

memray stats <bin_file_path>
Or 
memray tree <bin_file_path>

The above command will generate the following outputs :

📏 Total allocations:
	371

📦 Total memory allocated:
	920.897MB

📊 Histogram of allocation size:
	min: 1.000B
	------------------------------------------
	< 6.000B   :   4 ▇
	< 38.000B  :   0 
	< 242.000B :   6 ▇
	< 1.478KB  :  89 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇
	< 9.218KB  :  35 ▇▇▇▇▇▇
	< 57.492KB :  18 ▇▇▇
	< 358.589KB:  16 ▇▇▇
	< 2.184MB  : 171 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇
	< 13.623MB :  16 ▇▇▇
	<=84.968MB :  16 ▇▇▇
	------------------------------------------
	max: 84.968MB

📂 Allocator type distribution:
	 MMAP: 156
	 REALLOC: 118
	 MALLOC: 76
	 CALLOC: 21

🥇 Top 5 largest allocating locations (by size):
	- leak_memory:example_prog.py:7 -> 920.696MB
	- get_data:<frozen importlib._bootstrap_external>:1131 -> 62.472KB
	- _compile_bytecode:<frozen importlib._bootstrap_external>:729 -> 32.352KB
	- _get_code_from_file:<frozen runpy>:259 -> 29.971KB
	- namedtuple:/Users/lib/python3.11/collections/__init__.py:434 -> 14.752KB

🥇 Top 5 largest allocating locations (by number of allocations):
	- leak_memory:example_prog.py:7 -> 258
	- _compile_bytecode:<frozen importlib._bootstrap_external>:729 -> 35
	- _get_code_from_file:<frozen runpy>:259 -> 21
	- namedtuple:/lib/python3.11/collections/__init__.py:434 -> 10
	- _fill_cache:<frozen importlib._bootstrap_external>:1659 -> 4
Allocation metadata
-------------------
Command line arguments: '/envs/bin/memray run example_prog.py'
Peak memory size: 241.032MB
Number of allocations: 698

Biggest 10 allocations:
-----------------------
📂 241.020MB (100.00 %) <ROOT>  
├── 📂 241.019MB (100.00 %) _run_tracker  /lib/python3.11/site-packages/memray/commands/run.py:66
│   ├── [[4 frames hidden in 2 file(s)]]
│   │   └── 📄 240.979MB (99.98 %) leak_memory  example_prog.py:7
│   ├── [[1 frames hidden in 1 file(s)]]
│   │   └── 📂 36.990KB (0.01 %) _find_and_load  <frozen importlib._bootstrap>:1178
│   │       ├── [[1 frames hidden in 1 file(s)]]
│   │       │   └── 📂 20.990KB (0.01 %) _load_unlocked  <frozen importlib._bootstrap>:690
│   │       │       ├── [[2 frames hidden in 1 file(s)]]
│   │       │       │   └── 📄 14.127KB (0.01 %) _compile_bytecode  <frozen importlib._bootstrap_external>:729
│   │       │       └── [[1 frames hidden in 1 file(s)]]
│   │       │           └── 📂 5.639KB (0.00 %) _call_with_frames_removed  <frozen importlib._bootstrap>:241
│   │       │               ├── 📄 2.264KB (0.00 %) <module>  /ib/python3.11/pkgutil.py:22
│   │       │               ├── 📄 1.688KB (0.00 %) <module>  /lib/python3.11/pkgutil.py:194
│   │       │               └── 📄 1.688KB (0.00 %) <module> /lib/python3.11/pkgutil.py:269
│   │       └── [[4 frames hidden in 2 file(s)]]
│   │           └── 📂 16.000KB (0.01 %) find_spec  <frozen importlib._bootstrap_external>:1616
│   │               ├── 📄 8.000KB (0.00 %) _fill_cache  <frozen importlib._bootstrap_external>:1667
│   │               └── [[1 frames hidden in 1 file(s)]]
│   │                   └── 📄 8.000KB (0.00 %) <setcomp>  <frozen importlib._bootstrap_external>:1684
│   └── [[2 frames hidden in 2 file(s)]]
│       └── 📄 3.072KB (0.00 %) __init__  <frozen zipimport>:94
└── 📄 1.111KB (0.00 %) _run_tracker  /lib/python3.11/site-packages/memray/commands/run.py:52

Generate flamegraph to study memory allocation concerning time, this can help in debugging memory leak issues.

memray flamegraph <bin_file_path>

Live Debugging

Run memray to debug the behavior of a long-running program using —live

memray run —live example_prog.py 

The above command is useful to debug memory leaks in a long-running application, as you can observe the memory allocation changes in real time.