datakit#
Datakit package entry point.
- class mesofield.datakit.Dataset[source]#
Bases:
objectDiscovery + materialization for a BIDS-style experiment hierarchy.
- classmethod from_directory(root, *, sources=None, prefer_processed=True, include_task_level=True)[source]#
Discover one or more experiment roots and build a Dataset.
Pass a single path or a sequence of paths; multiple roots are concatenated row-wise.
- property meta: dict#
Provenance metadata for the running datakit package.
The same dictionary is attached to the materialized DataFrame as
df.attrs["datakit"]so it persists through pickle round-trips.
- include(*, subject=None, session=None, task=None, source=None)[source]#
Keep only rows/sources matching the given filters (AND-combined).
Each keyword accepts either a single string or a sequence of strings;
None(the default) means “no constraint on this axis”. All provided filters are combined with logical AND, so adding a keyword narrows the result. Returns a newDataset— the original is unchanged, so calls chain naturally.Examples#
>>> ds.include(subject="STREHAB07") # one subject >>> ds.include(subject=["STREHAB07", "STREHAB08"]) # multiple subjects >>> ds.include(session="ses-05", task="task-widefield") >>> ds.include(source=["dataqueue", "treadmill"]) # drop other sources >>> ds.include(subject="STREHAB07").include(task="task-movies") # chain
- exclude(*, subject=None, session=None, task=None, source=None)[source]#
Drop rows/sources matching the given filters.
Like
include(), every keyword accepts a string or a sequence of strings, and combining keywords narrows what gets removed. Behavior depends on which axes are provided:source only — drop those source columns globally.
row axes only (subject/session/task) — drop matching rows.
both — NaN out only the listed source columns within matching rows; rows and other sources are preserved.
Examples#
>>> ds.exclude(subject="STREHAB07") # drop a subject >>> ds.exclude(source="psychopy") # drop a source globally >>> ds.exclude(session=["ses-01", "ses-02"]) # drop multiple sessions >>> ds.exclude(subject="STREHAB07", source="pupil_dlc") ... # blank pupil_dlc only for STREHAB07; other rows/sources untouched
- head(n=3)[source]#
Return a new
Datasetcontaining only the firstnrows.Convenience for quick tests; equivalent to slicing the inventory with
.iloc[:n]while preserving sources and roots.
- select(subject, session, task=None)[source]#
Return a new
Datasetcontaining exactly one inventory row.Positional shorthand for
include(subject=..., session=..., task=...)intended for the common “give me this one cell” use case. Unlikeinclude(), all arguments must be single strings — for multi-value or partial filtering useinclude()directly:>>> ds.select("STREHAB07", "ses-05", "task-widefield") # one cell >>> ds.include(subject="STREHAB07", session="ses-05") # all tasks for that session
Raises
KeyErrorif no row matches andValueErrorif more than one row matches (e.g. whentaskis omitted on a task-level inventory and multiple tasks exist for the session).
- materialize(*, strict=True, return_errors=False, progress=False)[source]#
Build the materialized DataFrame.
With
strict=True(default) the first error is raised with full(subject, session, task, source, path)context. Withstrict=Falsefailed cells are blanked; passreturn_errors=Trueto also receive the long-format error frame produced byvalidate().
- class mesofield.datakit.DataSource[source]#
Bases:
objectBase class for a file-backed data source.
- requires: ClassVar[Tuple[str, ...]] = ()#
Tag names of upstream sources whose loaded streams should be made available via
LoadContext.dependencies. Soft contract: a missing or failed dependency yieldsNoneindependencies[tag]; sources are responsible for either degrading gracefully or raising.
- load(path, *, context=None)[source]#
Load data from the given path.
- Parameters:
path (Path)
context (LoadContext | None)
- Return type:
- class mesofield.datakit.LoadContext[source]#
Bases:
objectContext object passed to every
DataSource.loadcall.Carries identity (subject/session/task), the inventory row for that cell (so sources can locate sibling files via
path_for()), and any upstream sources that were loaded for the same cell as declared onDataSource.requires.For backward parity with previous releases, when
"dataqueue"is present independencies, the convenience attributesdataqueue_frame,dataqueue_meta,master_timeline, andexperiment_windoware populated from it. New sources should prefer reading fromdependenciesdirectly.- require_path(tag)[source]#
Like
path_for()but raisesFileNotFoundErrorwhen missing.
- get_dependency(tag)[source]#
Return a previously-loaded dependency stream, or None if unavailable.
- Parameters:
tag (str)
- Return type:
LoadedStream | None
- require_dependency(tag)[source]#
Like
get_dependency()but raises if the dependency is missing.- Parameters:
tag (str)
- Return type:
- __init__(subject, session, task, inventory_row, dependencies=<factory>, master_timeline=None, experiment_window=None, dataqueue_frame=None, dataqueue_meta=None)#
- Parameters:
- Return type:
None
- class mesofield.datakit.LoadedStream[source]#
Bases:
objectHydrated data stream with timestamps and metadata.
- class mesofield.datakit.MaterializedMemoryReport[source]#
Bases:
objectMaterializedMemoryReport(source_path: ‘str | None’, shape: ‘tuple[int, int]’, index_names: ‘tuple[str, …]’, column_levels: ‘tuple[str, …]’, index_bytes: ‘int’, columns_index_bytes: ‘int’, pandas_deep_total_bytes: ‘int’, estimated_total_bytes: ‘int’, columns: ‘list[ColumnMemory]’, by_source_bytes: ‘dict[str, int]’, by_source_feature_bytes: ‘dict[str, int]’, largest_cells: ‘list[CellMemory]’)
- __init__(source_path, shape, index_names, column_levels, index_bytes, columns_index_bytes, pandas_deep_total_bytes, estimated_total_bytes, columns, by_source_bytes, by_source_feature_bytes, largest_cells)#
- Parameters:
source_path (str | None)
index_bytes (int)
columns_index_bytes (int)
pandas_deep_total_bytes (int)
estimated_total_bytes (int)
columns (list[ColumnMemory])
largest_cells (list[CellMemory])
- Return type:
None
- mesofield.datakit.build_meta()[source]#
Return a snapshot of provenance metadata for the running package.
built_atis regenerated on every call so embedded copies record the moment a dataset was materialized rather than when the module was imported.
- mesofield.datakit.get_version()[source]#
Best-effort version string for the running
datakitpackage.- Return type:
- mesofield.datakit.inspect_sources(inventory_or_dataset, sources=None)[source]#
Return a per-source coverage summary for an inventory.
The returned DataFrame is indexed by source tag with columns
present,total,missing, andcoverage(fraction of rows with a non-null path). Accepts either aDatasetor a raw inventory DataFrame.When
sourcesis omitted, every registered tag found in the inventory’s columns is reported.
- mesofield.datakit.load(root, *, sources=None, prefer_processed=True, include_task_level=True, progress=True, strict=True, return_errors=False)[source]#
One-shot discovery + materialization.
Equivalent to
Dataset.from_directory(root, ...).materialize(...). UseDataset.from_directory()directly when you need to filter (.include/.exclude) before materializing.
- mesofield.datakit.load_dataset(path, *, hdf_key='dataset')[source]#
Load a previously materialized dataset back into a DataFrame.
The consumer-side inverse of
Dataset.save(). Reads a.pkl/.pickle(pandas pickle) or.h5/.hdf5(HDF5) artefact and returns the materializedpandas.DataFrame— including thedf.attrsprovenance metadata embedded at save time.Raises#
- FileNotFoundError
If
pathdoes not exist.- ValueError
If the file extension is not a supported dataset format.
- mesofield.datakit.load_path(tag, path)[source]#
Ad-hoc single-file load via the registered source for
tag.Builds a minimal
LoadContextso sources withoutrequiresor sibling-path lookups can be exercised directly. Sources declaring dependencies will receiveNonefor them incontext.dependenciesand must either degrade gracefully or raise.- Parameters:
- Return type:
- mesofield.datakit.open_shell(target=None, *, hdf_key='dataset')[source]#
Open an interactive shell pre-loaded with a datakit object.
Returns a process exit code (
0on success).
- mesofield.datakit.profile_materialized(target, *, top_n_cells=20, source_path=None)[source]#
Build a
MaterializedMemoryReportfrom a DataFrame or saved file.Parameters#
- target
A materialized
pandas.DataFrameor a path-like pointing to a.pkl/.picklefile produced by datakit.- top_n_cells
How many of the largest individual object cells to keep in the report.
- source_path
Optional override for the path recorded in the report (useful when passing an already-loaded DataFrame).