core#

Core orchestration for datakit.

Provides Dataset — the user-facing entry point that wraps a discovered file inventory and materializes it via the source registry.

Failure policy: - materialize(strict=True) (default) raises on the first error with

(subject, session, task, source, path) context.

materialize(strict=False) continues past errors and (with return_errors=True) returns a long-format error DataFrame.
validate() runs every cell, never raises, and returns the same long-format DataFrame.

class mesofield.datakit.core.Dataset[source]#

Bases: object

Discovery + materialization for a BIDS-style experiment hierarchy.

__init__(inventory, *, sources=None, roots=())[source]#

Parameters:

inventory (DataFrame)
sources (Iterable[str] | None)
roots (Sequence[str | Path])

Return type:

None

classmethod from_directory(root, *, sources=None, prefer_processed=True, include_task_level=True)[source]#

Discover one or more experiment roots and build a Dataset.

Pass a single path or a sequence of paths; multiple roots are concatenated row-wise.

Parameters:

root (str | Path | Sequence[str | Path])
sources (Iterable[str] | None)
prefer_processed (bool)
include_task_level (bool | None)

Return type:

Dataset

property notes: list[str]#: Free-form notes attached to this dataset.

add_note(note)[source]#

Append a free-form note; persisted via df.attrs['datakit_notes'].

Parameters:: note (str)
Return type:: Dataset

property meta: dict#

Provenance metadata for the running datakit package.

The same dictionary is attached to the materialized DataFrame as df.attrs["datakit"] so it persists through pickle round-trips.

include(*, subject=None, session=None, task=None, source=None)[source]#

Keep only rows/sources matching the given filters (AND-combined).

Each keyword accepts either a single string or a sequence of strings; None (the default) means “no constraint on this axis”. All provided filters are combined with logical AND, so adding a keyword narrows the result. Returns a new Dataset — the original is unchanged, so calls chain naturally.

Examples#

>>> ds.include(subject="STREHAB07")              # one subject
>>> ds.include(subject=["STREHAB07", "STREHAB08"])  # multiple subjects
>>> ds.include(session="ses-05", task="task-widefield")
>>> ds.include(source=["dataqueue", "treadmill"])    # drop other sources
>>> ds.include(subject="STREHAB07").include(task="task-movies")  # chain

Parameters:

subject (str | Sequence[str] | None)
session (str | Sequence[str] | None)
task (str | Sequence[str] | None)
source (str | Sequence[str] | None)

Return type:

Dataset

exclude(*, subject=None, session=None, task=None, source=None)[source]#

Drop rows/sources matching the given filters.

Like include(), every keyword accepts a string or a sequence of strings, and combining keywords narrows what gets removed. Behavior depends on which axes are provided:

source only — drop those source columns globally.
row axes only (subject/session/task) — drop matching rows.
both — NaN out only the listed source columns within matching rows; rows and other sources are preserved.

Examples#

>>> ds.exclude(subject="STREHAB07")                   # drop a subject
>>> ds.exclude(source="psychopy")                     # drop a source globally
>>> ds.exclude(session=["ses-01", "ses-02"])          # drop multiple sessions
>>> ds.exclude(subject="STREHAB07", source="pupil_dlc")
... # blank pupil_dlc only for STREHAB07; other rows/sources untouched

Parameters:

subject (str | Sequence[str] | None)
session (str | Sequence[str] | None)
task (str | Sequence[str] | None)
source (str | Sequence[str] | None)

Return type:

Dataset

head(n=3)[source]#

Return a new Dataset containing only the first n rows.

Convenience for quick tests; equivalent to slicing the inventory with .iloc[:n] while preserving sources and roots.

Parameters:: n (int)
Return type:: Dataset

select(subject, session, task=None)[source]#

Return a new Dataset containing exactly one inventory row.

Positional shorthand for include(subject=..., session=..., task=...) intended for the common “give me this one cell” use case. Unlike include(), all arguments must be single strings — for multi-value or partial filtering use include() directly:

>>> ds.select("STREHAB07", "ses-05", "task-widefield")  # one cell
>>> ds.include(subject="STREHAB07", session="ses-05")   # all tasks for that session

Raises KeyError if no row matches and ValueError if more than one row matches (e.g. when task is omitted on a task-level inventory and multiple tasks exist for the session).

Parameters:

subject (str)
session (str)
task (str | None)

Return type:

Dataset

validate(*, progress=False)[source]#

Run every (cell, source); report status without raising.

Parameters:: progress (bool)
Return type:: DataFrame

materialize(*, strict=True, return_errors=False, progress=False)[source]#

Build the materialized DataFrame.

With strict=True (default) the first error is raised with full (subject, session, task, source, path) context. With strict=False failed cells are blanked; pass return_errors=True to also receive the long-format error frame produced by validate().

Parameters:

strict (bool)
return_errors (bool)
progress (bool)

Return type:

DataFrame | tuple[DataFrame, DataFrame]

save(path, *, format=None, strict=True, progress=False, hdf_key='dataset')[source]#

Materialize and write to disk. Pickle by default; HDF5 via format="hdf5" or a .h5/.hdf5 suffix.

Parameters:

path (str | Path)
format (str | None)
strict (bool)
progress (bool)
hdf_key (str)

Return type:

Path

class mesofield.datakit.core.LoadContext[source]#

Bases: object

Context object passed to every DataSource.load call.

Carries identity (subject/session/task), the inventory row for that cell (so sources can locate sibling files via path_for()), and any upstream sources that were loaded for the same cell as declared on DataSource.requires.

For backward parity with previous releases, when "dataqueue" is present in dependencies, the convenience attributes dataqueue_frame, dataqueue_meta, master_timeline, and experiment_window are populated from it. New sources should prefer reading from dependencies directly.

path_for(tag)[source]#

Return the path stored in the inventory row for tag, or None.

Parameters:: tag (str)
Return type:: Path | None

require_path(tag)[source]#

Like path_for() but raises FileNotFoundError when missing.

Parameters:: tag (str)
Return type:: Path

get_dependency(tag)[source]#

Return a previously-loaded dependency stream, or None if unavailable.

Parameters:: tag (str)
Return type:: LoadedStream | None

require_dependency(tag)[source]#

Like get_dependency() but raises if the dependency is missing.

Parameters:: tag (str)
Return type:: LoadedStream

__init__(subject, session, task, inventory_row, dependencies=<factory>, master_timeline=None, experiment_window=None, dataqueue_frame=None, dataqueue_meta=None)#

Parameters:

subject (str)
session (str)
task (str | None)
inventory_row (Mapping[str, Any])
dependencies (Mapping[str, LoadedStream | None])
master_timeline (ndarray | None)
experiment_window (tuple[float, float] | None)
dataqueue_frame (DataFrame | None)
dataqueue_meta (Mapping[str, Any] | None)

Return type:

None

class mesofield.datakit.core.LoadedStream[source]#

Bases: object

Hydrated data stream with timestamps and metadata.

__init__(tag, t, value, meta)#

Parameters:

tag (str)
t (ndarray)
value (object)
meta (dict)

Return type:

None

mesofield.datakit.core.load(root, *, sources=None, prefer_processed=True, include_task_level=True, progress=True, strict=True, return_errors=False)[source]#

One-shot discovery + materialization.

Equivalent to Dataset.from_directory(root, ...).materialize(...). Use Dataset.from_directory() directly when you need to filter (.include / .exclude) before materializing.

Parameters:

root (str | Path | Sequence[str | Path])
sources (Iterable[str] | None)
prefer_processed (bool)
include_task_level (bool | None)
progress (bool)
strict (bool)
return_errors (bool)

Return type:

DataFrame | tuple[DataFrame, DataFrame]

mesofield.datakit.core.load_dataset(path, *, hdf_key='dataset')[source]#

Load a previously materialized dataset back into a DataFrame.

The consumer-side inverse of Dataset.save(). Reads a .pkl / .pickle (pandas pickle) or .h5 / .hdf5 (HDF5) artefact and returns the materialized pandas.DataFrame — including the df.attrs provenance metadata embedded at save time.

Raises#

FileNotFoundError: If path does not exist.
ValueError: If the file extension is not a supported dataset format.

Parameters:

path (str | Path)
hdf_key (str)

Return type:

DataFrame

mesofield.datakit.core.load_path(tag, path)[source]#

Ad-hoc single-file load via the registered source for tag.

Builds a minimal LoadContext so sources without requires or sibling-path lookups can be exercised directly. Sources declaring dependencies will receive None for them in context.dependencies and must either degrade gracefully or raise.

Parameters:

tag (str)
path (str | Path)

Return type:

LoadedStream

mesofield.datakit.core.inspect_sources(inventory_or_dataset, sources=None)[source]#

Return a per-source coverage summary for an inventory.

The returned DataFrame is indexed by source tag with columns present, total, missing, and coverage (fraction of rows with a non-null path). Accepts either a Dataset or a raw inventory DataFrame.

When sources is omitted, every registered tag found in the inventory’s columns is reported.

Parameters:

inventory_or_dataset (Dataset | DataFrame)
sources (Iterable[str] | None)

Return type:

DataFrame