datakit#

Datakit package entry point.

class mesofield.datakit.Dataset[source]#

Bases: object

Discovery + materialization for a BIDS-style experiment hierarchy.

__init__(inventory, *, sources=None, roots=())[source]#

Parameters:

inventory (DataFrame)
sources (Iterable[str] | None)
roots (Sequence[str | Path])

Return type:

None

classmethod from_directory(root, *, sources=None, prefer_processed=True, include_task_level=True)[source]#

Discover one or more experiment roots and build a Dataset.

Pass a single path or a sequence of paths; multiple roots are concatenated row-wise.

Parameters:

root (str | Path | Sequence[str | Path])
sources (Iterable[str] | None)
prefer_processed (bool)
include_task_level (bool | None)

Return type:

Dataset

property notes: list[str]#: Free-form notes attached to this dataset.

add_note(note)[source]#

Append a free-form note; persisted via df.attrs['datakit_notes'].

Parameters:: note (str)
Return type:: Dataset

property meta: dict#

Provenance metadata for the running datakit package.

The same dictionary is attached to the materialized DataFrame as df.attrs["datakit"] so it persists through pickle round-trips.

include(*, subject=None, session=None, task=None, source=None)[source]#

Keep only rows/sources matching the given filters (AND-combined).

Each keyword accepts either a single string or a sequence of strings; None (the default) means “no constraint on this axis”. All provided filters are combined with logical AND, so adding a keyword narrows the result. Returns a new Dataset — the original is unchanged, so calls chain naturally.

Examples#

>>> ds.include(subject="STREHAB07")              # one subject
>>> ds.include(subject=["STREHAB07", "STREHAB08"])  # multiple subjects
>>> ds.include(session="ses-05", task="task-widefield")
>>> ds.include(source=["dataqueue", "treadmill"])    # drop other sources
>>> ds.include(subject="STREHAB07").include(task="task-movies")  # chain

Parameters:

subject (str | Sequence[str] | None)
session (str | Sequence[str] | None)
task (str | Sequence[str] | None)
source (str | Sequence[str] | None)

Return type:

Dataset

exclude(*, subject=None, session=None, task=None, source=None)[source]#

Drop rows/sources matching the given filters.

Like include(), every keyword accepts a string or a sequence of strings, and combining keywords narrows what gets removed. Behavior depends on which axes are provided:

source only — drop those source columns globally.
row axes only (subject/session/task) — drop matching rows.
both — NaN out only the listed source columns within matching rows; rows and other sources are preserved.

Examples#

>>> ds.exclude(subject="STREHAB07")                   # drop a subject
>>> ds.exclude(source="psychopy")                     # drop a source globally
>>> ds.exclude(session=["ses-01", "ses-02"])          # drop multiple sessions
>>> ds.exclude(subject="STREHAB07", source="pupil_dlc")
... # blank pupil_dlc only for STREHAB07; other rows/sources untouched

Parameters:

subject (str | Sequence[str] | None)
session (str | Sequence[str] | None)
task (str | Sequence[str] | None)
source (str | Sequence[str] | None)

Return type:

Dataset

head(n=3)[source]#

Return a new Dataset containing only the first n rows.

Convenience for quick tests; equivalent to slicing the inventory with .iloc[:n] while preserving sources and roots.

Parameters:: n (int)
Return type:: Dataset

select(subject, session, task=None)[source]#

Return a new Dataset containing exactly one inventory row.

Positional shorthand for include(subject=..., session=..., task=...) intended for the common “give me this one cell” use case. Unlike include(), all arguments must be single strings — for multi-value or partial filtering use include() directly:

>>> ds.select("STREHAB07", "ses-05", "task-widefield")  # one cell
>>> ds.include(subject="STREHAB07", session="ses-05")   # all tasks for that session

Raises KeyError if no row matches and ValueError if more than one row matches (e.g. when task is omitted on a task-level inventory and multiple tasks exist for the session).

Parameters:

subject (str)
session (str)
task (str | None)

Return type:

Dataset

validate(*, progress=False)[source]#

Run every (cell, source); report status without raising.

Parameters:: progress (bool)
Return type:: DataFrame

materialize(*, strict=True, return_errors=False, progress=False)[source]#

Build the materialized DataFrame.

With strict=True (default) the first error is raised with full (subject, session, task, source, path) context. With strict=False failed cells are blanked; pass return_errors=True to also receive the long-format error frame produced by validate().

Parameters:

strict (bool)
return_errors (bool)
progress (bool)

Return type:

DataFrame | tuple[DataFrame, DataFrame]

save(path, *, format=None, strict=True, progress=False, hdf_key='dataset')[source]#

Materialize and write to disk. Pickle by default; HDF5 via format="hdf5" or a .h5/.hdf5 suffix.

Parameters:

path (str | Path)
format (str | None)
strict (bool)
progress (bool)
hdf_key (str)

Return type:

Path

class mesofield.datakit.DataSource[source]#

Bases: object

Base class for a file-backed data source.

requires: ClassVar[Tuple[str, ...]] = ()#: Tag names of upstream sources whose loaded streams should be made available via LoadContext.dependencies. Soft contract: a missing or failed dependency yields None in dependencies[tag]; sources are responsible for either degrading gracefully or raising.

load(path, *, context=None)[source]#

Load data from the given path.

Parameters:

path (Path)
context (LoadContext | None)

Return type:

LoadedStream

class mesofield.datakit.LoadContext[source]#

Bases: object

Context object passed to every DataSource.load call.

Carries identity (subject/session/task), the inventory row for that cell (so sources can locate sibling files via path_for()), and any upstream sources that were loaded for the same cell as declared on DataSource.requires.

For backward parity with previous releases, when "dataqueue" is present in dependencies, the convenience attributes dataqueue_frame, dataqueue_meta, master_timeline, and experiment_window are populated from it. New sources should prefer reading from dependencies directly.

path_for(tag)[source]#

Return the path stored in the inventory row for tag, or None.

Parameters:: tag (str)
Return type:: Path | None

require_path(tag)[source]#

Like path_for() but raises FileNotFoundError when missing.

Parameters:: tag (str)
Return type:: Path

get_dependency(tag)[source]#

Return a previously-loaded dependency stream, or None if unavailable.

Parameters:: tag (str)
Return type:: LoadedStream | None

require_dependency(tag)[source]#

Like get_dependency() but raises if the dependency is missing.

Parameters:: tag (str)
Return type:: LoadedStream

__init__(subject, session, task, inventory_row, dependencies=<factory>, master_timeline=None, experiment_window=None, dataqueue_frame=None, dataqueue_meta=None)#

Parameters:

subject (str)
session (str)
task (str | None)
inventory_row (Mapping[str, Any])
dependencies (Mapping[str, LoadedStream | None])
master_timeline (ndarray | None)
experiment_window (tuple[float, float] | None)
dataqueue_frame (DataFrame | None)
dataqueue_meta (Mapping[str, Any] | None)

Return type:

None

class mesofield.datakit.LoadedStream[source]#

Bases: object

Hydrated data stream with timestamps and metadata.

__init__(tag, t, value, meta)#

Parameters:

tag (str)
t (ndarray)
value (object)
meta (dict)

Return type:

None

class mesofield.datakit.MaterializedMemoryReport[source]#

Bases: object

MaterializedMemoryReport(source_path: ‘str | None’, shape: ‘tuple[int, int]’, index_names: ‘tuple[str, …]’, column_levels: ‘tuple[str, …]’, index_bytes: ‘int’, columns_index_bytes: ‘int’, pandas_deep_total_bytes: ‘int’, estimated_total_bytes: ‘int’, columns: ‘list[ColumnMemory]’, by_source_bytes: ‘dict[str, int]’, by_source_feature_bytes: ‘dict[str, int]’, largest_cells: ‘list[CellMemory]’)

__init__(source_path, shape, index_names, column_levels, index_bytes, columns_index_bytes, pandas_deep_total_bytes, estimated_total_bytes, columns, by_source_bytes, by_source_feature_bytes, largest_cells)#

Parameters:

source_path (str | None)
shape (tuple[int, int])
index_names (tuple[str, ...])
column_levels (tuple[str, ...])
index_bytes (int)
columns_index_bytes (int)
pandas_deep_total_bytes (int)
estimated_total_bytes (int)
columns (list[ColumnMemory])
by_source_bytes (dict[str, int])
by_source_feature_bytes (dict[str, int])
largest_cells (list[CellMemory])

Return type:

None

mesofield.datakit.build_meta()[source]#

Return a snapshot of provenance metadata for the running package.

built_at is regenerated on every call so embedded copies record the moment a dataset was materialized rather than when the module was imported.

Return type:: Dict[str, Any]

mesofield.datakit.get_version()[source]#

Best-effort version string for the running datakit package.

Return type:: str

mesofield.datakit.inspect_sources(inventory_or_dataset, sources=None)[source]#

Return a per-source coverage summary for an inventory.

The returned DataFrame is indexed by source tag with columns present, total, missing, and coverage (fraction of rows with a non-null path). Accepts either a Dataset or a raw inventory DataFrame.

When sources is omitted, every registered tag found in the inventory’s columns is reported.

Parameters:

inventory_or_dataset (Dataset | DataFrame)
sources (Iterable[str] | None)

Return type:

DataFrame

mesofield.datakit.load(root, *, sources=None, prefer_processed=True, include_task_level=True, progress=True, strict=True, return_errors=False)[source]#

One-shot discovery + materialization.

Equivalent to Dataset.from_directory(root, ...).materialize(...). Use Dataset.from_directory() directly when you need to filter (.include / .exclude) before materializing.

Parameters:

root (str | Path | Sequence[str | Path])
sources (Iterable[str] | None)
prefer_processed (bool)
include_task_level (bool | None)
progress (bool)
strict (bool)
return_errors (bool)

Return type:

DataFrame | tuple[DataFrame, DataFrame]

mesofield.datakit.load_dataset(path, *, hdf_key='dataset')[source]#

Load a previously materialized dataset back into a DataFrame.

The consumer-side inverse of Dataset.save(). Reads a .pkl / .pickle (pandas pickle) or .h5 / .hdf5 (HDF5) artefact and returns the materialized pandas.DataFrame — including the df.attrs provenance metadata embedded at save time.

Raises#

FileNotFoundError: If path does not exist.
ValueError: If the file extension is not a supported dataset format.

Parameters:

path (str | Path)
hdf_key (str)

Return type:

DataFrame

mesofield.datakit.load_path(tag, path)[source]#

Ad-hoc single-file load via the registered source for tag.

Builds a minimal LoadContext so sources without requires or sibling-path lookups can be exercised directly. Sources declaring dependencies will receive None for them in context.dependencies and must either degrade gracefully or raise.

Parameters:

tag (str)
path (str | Path)

Return type:

LoadedStream

mesofield.datakit.open_shell(target=None, *, hdf_key='dataset')[source]#

Open an interactive shell pre-loaded with a datakit object.

Returns a process exit code (0 on success).

Parameters:

target (str | Path | None)
hdf_key (str)

Return type:

int

mesofield.datakit.profile_materialized(target, *, top_n_cells=20, source_path=None)[source]#

Build a MaterializedMemoryReport from a DataFrame or saved file.

Parameters#

target: A materialized pandas.DataFrame or a path-like pointing to a .pkl / .pickle file produced by datakit.
top_n_cells: How many of the largest individual object cells to keep in the report.
source_path: Optional override for the path recorded in the report (useful when passing an already-loaded DataFrame).

Parameters:

target (DataFrame | str | Path)
top_n_cells (int)
source_path (str | None)

Return type:

MaterializedMemoryReport

Subpackages#

sources

datakit#

Examples#

Examples#

Raises#

Parameters#

Subpackages#

Submodules#