profile#

Memory / storage profiler for materialized datakit datasets.

Profiles a materialized pandas.DataFrame (or a saved .pkl / .h5 file containing one), with first-class support for the nested object payloads typical of datakit outputs:

Object-dtype columns containing embedded pandas.DataFrame instances (recursed into, per inner column).
Object-dtype columns containing numpy.ndarray payloads.
meta columns (or any object cell) holding nested dict payloads — recursively profiled key by key.
Arbitrary Python objects, sized via sys.getsizeof with cycle-safe recursion into containers (list/tuple/set/dict).

The profiler leans on the pandas memory_usage(deep=True) API for contiguous-dtype columns and the index, and only falls back to recursive sizing for object payloads where pandas reports only the pointer cost.

Output#

profile_materialized() returns a MaterializedMemoryReport that can render:

summary() — concise human-readable summary string
verbose() — detailed human-readable breakdown
to_dict() — JSON-serialisable nested dict
to_json(path)— write JSON file to disk

CLI usage:

mesofield datakit profile path/to/materialized.pkl --verbose

class mesofield.datakit.profile.CellMemory[source]#

Bases: object

CellMemory(row: ‘str’, column: ‘str’, source: ‘str’, feature: ‘str’, value_type: ‘str’, estimated_bytes: ‘int’)

__init__(row, column, source, feature, value_type, estimated_bytes)#

Parameters:

row (str)
column (str)
source (str)
feature (str)
value_type (str)
estimated_bytes (int)

Return type:

None

class mesofield.datakit.profile.ColumnMemory[source]#

Bases: object

ColumnMemory(column: ‘str’, source: ‘str’, feature: ‘str’, dtype: ‘str’, n_total: ‘int’, n_non_null: ‘int’, n_null: ‘int’, pandas_deep_bytes: ‘int’, pointer_array_bytes: ‘int’, object_payload_bytes: ‘int’, estimated_total_bytes: ‘int’, avg_non_null_cell_bytes: ‘int’, max_non_null_cell_bytes: ‘int’, value_type_counts: ‘dict[str, int]’ = <factory>, nested_dataframe_inner_bytes: ‘dict[str, int]’ = <factory>, nested_dict_key_bytes: ‘dict[str, int]’ = <factory>)

__init__(column, source, feature, dtype, n_total, n_non_null, n_null, pandas_deep_bytes, pointer_array_bytes, object_payload_bytes, estimated_total_bytes, avg_non_null_cell_bytes, max_non_null_cell_bytes, value_type_counts=<factory>, nested_dataframe_inner_bytes=<factory>, nested_dict_key_bytes=<factory>)#

Parameters:

column (str)
source (str)
feature (str)
dtype (str)
n_total (int)
n_non_null (int)
n_null (int)
pandas_deep_bytes (int)
pointer_array_bytes (int)
object_payload_bytes (int)
estimated_total_bytes (int)
avg_non_null_cell_bytes (int)
max_non_null_cell_bytes (int)
value_type_counts (dict[str, int])
nested_dataframe_inner_bytes (dict[str, int])
nested_dict_key_bytes (dict[str, int])

Return type:

None

class mesofield.datakit.profile.MaterializedMemoryReport[source]#

Bases: object

MaterializedMemoryReport(source_path: ‘str | None’, shape: ‘tuple[int, int]’, index_names: ‘tuple[str, …]’, column_levels: ‘tuple[str, …]’, index_bytes: ‘int’, columns_index_bytes: ‘int’, pandas_deep_total_bytes: ‘int’, estimated_total_bytes: ‘int’, columns: ‘list[ColumnMemory]’, by_source_bytes: ‘dict[str, int]’, by_source_feature_bytes: ‘dict[str, int]’, largest_cells: ‘list[CellMemory]’)

__init__(source_path, shape, index_names, column_levels, index_bytes, columns_index_bytes, pandas_deep_total_bytes, estimated_total_bytes, columns, by_source_bytes, by_source_feature_bytes, largest_cells)#

Parameters:

source_path (str | None)
shape (tuple[int, int])
index_names (tuple[str, ...])
column_levels (tuple[str, ...])
index_bytes (int)
columns_index_bytes (int)
pandas_deep_total_bytes (int)
estimated_total_bytes (int)
columns (list[ColumnMemory])
by_source_bytes (dict[str, int])
by_source_feature_bytes (dict[str, int])
largest_cells (list[CellMemory])

Return type:

None

mesofield.datakit.profile.profile_materialized(target, *, top_n_cells=20, source_path=None)[source]#

Build a MaterializedMemoryReport from a DataFrame or saved file.

Parameters#

target: A materialized pandas.DataFrame or a path-like pointing to a .pkl / .pickle file produced by datakit.
top_n_cells: How many of the largest individual object cells to keep in the report.
source_path: Optional override for the path recorded in the report (useful when passing an already-loaded DataFrame).

Parameters:

target (DataFrame | str | Path)
top_n_cells (int)
source_path (str | None)

Return type:

MaterializedMemoryReport