profile#
Memory / storage profiler for materialized datakit datasets.
Profiles a materialized pandas.DataFrame (or a saved .pkl / .h5
file containing one), with first-class support for the nested object
payloads typical of datakit outputs:
Object-dtype columns containing embedded
pandas.DataFrameinstances (recursed into, per inner column).Object-dtype columns containing
numpy.ndarraypayloads.metacolumns (or any object cell) holding nesteddictpayloads — recursively profiled key by key.Arbitrary Python objects, sized via
sys.getsizeofwith cycle-safe recursion into containers (list/tuple/set/dict).
The profiler leans on the pandas memory_usage(deep=True) API for
contiguous-dtype columns and the index, and only falls back to recursive
sizing for object payloads where pandas reports only the pointer cost.
Output#
profile_materialized() returns a MaterializedMemoryReport
that can render:
summary()— concise human-readable summary stringverbose()— detailed human-readable breakdownto_dict()— JSON-serialisable nested dictto_json(path)— write JSON file to disk
CLI usage:
mesofield datakit profile path/to/materialized.pkl --verbose
- class mesofield.datakit.profile.CellMemory[source]#
Bases:
objectCellMemory(row: ‘str’, column: ‘str’, source: ‘str’, feature: ‘str’, value_type: ‘str’, estimated_bytes: ‘int’)
- class mesofield.datakit.profile.ColumnMemory[source]#
Bases:
objectColumnMemory(column: ‘str’, source: ‘str’, feature: ‘str’, dtype: ‘str’, n_total: ‘int’, n_non_null: ‘int’, n_null: ‘int’, pandas_deep_bytes: ‘int’, pointer_array_bytes: ‘int’, object_payload_bytes: ‘int’, estimated_total_bytes: ‘int’, avg_non_null_cell_bytes: ‘int’, max_non_null_cell_bytes: ‘int’, value_type_counts: ‘dict[str, int]’ = <factory>, nested_dataframe_inner_bytes: ‘dict[str, int]’ = <factory>, nested_dict_key_bytes: ‘dict[str, int]’ = <factory>)
- __init__(column, source, feature, dtype, n_total, n_non_null, n_null, pandas_deep_bytes, pointer_array_bytes, object_payload_bytes, estimated_total_bytes, avg_non_null_cell_bytes, max_non_null_cell_bytes, value_type_counts=<factory>, nested_dataframe_inner_bytes=<factory>, nested_dict_key_bytes=<factory>)#
- class mesofield.datakit.profile.MaterializedMemoryReport[source]#
Bases:
objectMaterializedMemoryReport(source_path: ‘str | None’, shape: ‘tuple[int, int]’, index_names: ‘tuple[str, …]’, column_levels: ‘tuple[str, …]’, index_bytes: ‘int’, columns_index_bytes: ‘int’, pandas_deep_total_bytes: ‘int’, estimated_total_bytes: ‘int’, columns: ‘list[ColumnMemory]’, by_source_bytes: ‘dict[str, int]’, by_source_feature_bytes: ‘dict[str, int]’, largest_cells: ‘list[CellMemory]’)
- __init__(source_path, shape, index_names, column_levels, index_bytes, columns_index_bytes, pandas_deep_total_bytes, estimated_total_bytes, columns, by_source_bytes, by_source_feature_bytes, largest_cells)#
- Parameters:
source_path (str | None)
index_bytes (int)
columns_index_bytes (int)
pandas_deep_total_bytes (int)
estimated_total_bytes (int)
columns (list[ColumnMemory])
largest_cells (list[CellMemory])
- Return type:
None
- mesofield.datakit.profile.profile_materialized(target, *, top_n_cells=20, source_path=None)[source]#
Build a
MaterializedMemoryReportfrom a DataFrame or saved file.Parameters#
- target
A materialized
pandas.DataFrameor a path-like pointing to a.pkl/.picklefile produced by datakit.- top_n_cells
How many of the largest individual object cells to keep in the report.
- source_path
Optional override for the path recorded in the report (useful when passing an already-loaded DataFrame).