profile#

Memory / storage profiler for materialized datakit datasets.

Profiles a materialized pandas.DataFrame (or a saved .pkl / .h5 file containing one), with first-class support for the nested object payloads typical of datakit outputs:

  • Object-dtype columns containing embedded pandas.DataFrame instances (recursed into, per inner column).

  • Object-dtype columns containing numpy.ndarray payloads.

  • meta columns (or any object cell) holding nested dict payloads — recursively profiled key by key.

  • Arbitrary Python objects, sized via sys.getsizeof with cycle-safe recursion into containers (list/tuple/set/dict).

The profiler leans on the pandas memory_usage(deep=True) API for contiguous-dtype columns and the index, and only falls back to recursive sizing for object payloads where pandas reports only the pointer cost.

Output#

profile_materialized() returns a MaterializedMemoryReport that can render:

  • summary() — concise human-readable summary string

  • verbose() — detailed human-readable breakdown

  • to_dict() — JSON-serialisable nested dict

  • to_json(path)— write JSON file to disk

CLI usage:

mesofield datakit profile path/to/materialized.pkl --verbose
class mesofield.datakit.profile.CellMemory[source]#

Bases: object

CellMemory(row: ‘str’, column: ‘str’, source: ‘str’, feature: ‘str’, value_type: ‘str’, estimated_bytes: ‘int’)

__init__(row, column, source, feature, value_type, estimated_bytes)#
Parameters:
  • row (str)

  • column (str)

  • source (str)

  • feature (str)

  • value_type (str)

  • estimated_bytes (int)

Return type:

None

class mesofield.datakit.profile.ColumnMemory[source]#

Bases: object

ColumnMemory(column: ‘str’, source: ‘str’, feature: ‘str’, dtype: ‘str’, n_total: ‘int’, n_non_null: ‘int’, n_null: ‘int’, pandas_deep_bytes: ‘int’, pointer_array_bytes: ‘int’, object_payload_bytes: ‘int’, estimated_total_bytes: ‘int’, avg_non_null_cell_bytes: ‘int’, max_non_null_cell_bytes: ‘int’, value_type_counts: ‘dict[str, int]’ = <factory>, nested_dataframe_inner_bytes: ‘dict[str, int]’ = <factory>, nested_dict_key_bytes: ‘dict[str, int]’ = <factory>)

__init__(column, source, feature, dtype, n_total, n_non_null, n_null, pandas_deep_bytes, pointer_array_bytes, object_payload_bytes, estimated_total_bytes, avg_non_null_cell_bytes, max_non_null_cell_bytes, value_type_counts=<factory>, nested_dataframe_inner_bytes=<factory>, nested_dict_key_bytes=<factory>)#
Parameters:
  • column (str)

  • source (str)

  • feature (str)

  • dtype (str)

  • n_total (int)

  • n_non_null (int)

  • n_null (int)

  • pandas_deep_bytes (int)

  • pointer_array_bytes (int)

  • object_payload_bytes (int)

  • estimated_total_bytes (int)

  • avg_non_null_cell_bytes (int)

  • max_non_null_cell_bytes (int)

  • value_type_counts (dict[str, int])

  • nested_dataframe_inner_bytes (dict[str, int])

  • nested_dict_key_bytes (dict[str, int])

Return type:

None

class mesofield.datakit.profile.MaterializedMemoryReport[source]#

Bases: object

MaterializedMemoryReport(source_path: ‘str | None’, shape: ‘tuple[int, int]’, index_names: ‘tuple[str, …]’, column_levels: ‘tuple[str, …]’, index_bytes: ‘int’, columns_index_bytes: ‘int’, pandas_deep_total_bytes: ‘int’, estimated_total_bytes: ‘int’, columns: ‘list[ColumnMemory]’, by_source_bytes: ‘dict[str, int]’, by_source_feature_bytes: ‘dict[str, int]’, largest_cells: ‘list[CellMemory]’)

__init__(source_path, shape, index_names, column_levels, index_bytes, columns_index_bytes, pandas_deep_total_bytes, estimated_total_bytes, columns, by_source_bytes, by_source_feature_bytes, largest_cells)#
Parameters:
Return type:

None

mesofield.datakit.profile.profile_materialized(target, *, top_n_cells=20, source_path=None)[source]#

Build a MaterializedMemoryReport from a DataFrame or saved file.

Parameters#

target

A materialized pandas.DataFrame or a path-like pointing to a .pkl / .pickle file produced by datakit.

top_n_cells

How many of the largest individual object cells to keep in the report.

source_path

Optional override for the path recorded in the report (useful when passing an already-loaded DataFrame).

Parameters:
Return type:

MaterializedMemoryReport