Skip to content

Datasets

See Datasets for information about available datasets and instructions for retrieving NSRR data.

sleepecg.download_nsrr(db_slug, subfolder='', pattern='*', shallow=False, data_dir='.')

Recursively download files from NSRR.

Specify a subfolder and/or a filename pattern to filter results. Implemented according to the NSRR API specs.

Parameters:

  • db_slug (str) –

    Short identifier of a database, e.g. 'mesa'.

  • subfolder (str) –

    The folder at which to start the search, by default '' (i.e. the root folder).

  • pattern (str) –

    Glob-like pattern to select files (only applied to the basename, not the dirname), by default '*'.

  • shallow (bool) –

    If True, only download files in the given subfolder (i.e. no recursion), by default False.

  • data_dir (str | Path) –

    Directory where all datasets are stored, by default '.'.

Source code in sleepecg/io/nsrr.py
def download_nsrr(
    db_slug: str,
    subfolder: str = "",
    pattern: str = "*",
    shallow: bool = False,
    data_dir: str | Path = ".",
) -> None:
    """
    Recursively download files from [NSRR](https://sleepdata.org).

    Specify a subfolder and/or a filename pattern to filter results. Implemented according
    to the [NSRR API specs](https://github.com/nsrr/sleepdata.org/wiki/api-v1-datasets).

    Parameters
    ----------
    db_slug : str
        Short identifier of a database, e.g. `'mesa'`.
    subfolder : str, optional
        The folder at which to start the search, by default `''` (i.e. the root folder).
    pattern : str, optional
        Glob-like pattern to select files (only applied to the basename, not the dirname),
        by default `'*'`.
    shallow : bool, optional
        If `True`, only download files in the given subfolder (i.e. no recursion), by
        default `False`.
    data_dir : str | pathlib.Path, optional
        Directory where all datasets are stored, by default `'.'`.
    """
    db_dir = Path(data_dir) / db_slug

    download_url = _get_nsrr_url(db_slug)
    files_to_download = _list_nsrr(db_slug, subfolder, pattern, shallow)
    tqdm_description = f"Downloading {db_slug}/{subfolder or '.'}/{pattern}"

    for filepath, checksum in tqdm(files_to_download, desc=tqdm_description):
        target_filepath = db_dir / filepath
        url = download_url + filepath
        _download_nsrr_file(url, target_filepath, checksum)

sleepecg.download_physionet(db_slug, requested_records, extensions, db_version='1.0.0', data_dir='.')

Download requested files from PhysioNet.

All files with extensions for record IDs in requested_records are downloaded from the PhysioNet database db_slug.

Parameters:

  • db_slug (str) –

    Short identifier of a database, e.g. 'mitdb'.

  • requested_records (list[str]) –

    Records with those IDs are downloaded.

  • extensions (Iterable[str]) –

    Files with those extensions are downloaded.

  • db_version (str) –

    Version of the database, by default '1.0.0'.

  • data_dir (str | Path) –

    Directory where all datasets are stored, by default '.'.

Source code in sleepecg/io/physionet.py
def download_physionet(
    db_slug: str,
    requested_records: list[str],
    extensions: Iterable[str],
    db_version: str = "1.0.0",
    data_dir: str | Path = ".",
) -> None:
    """
    Download requested files from PhysioNet.

    All files with `extensions` for record IDs in `requested_records` are downloaded from
    the PhysioNet database `db_slug`.

    Parameters
    ----------
    db_slug : str
        Short identifier of a database, e.g. `'mitdb'`.
    requested_records : list[str]
        Records with those IDs are downloaded.
    extensions : Iterable[str]
        Files with those extensions are downloaded.
    db_version : str, optional
        Version of the database, by default `'1.0.0'`.
    data_dir : str | pathlib.Path, optional
        Directory where all datasets are stored, by default `'.'`.
    """
    data_dir = Path(data_dir)
    checksums = _get_physionet_checksums(data_dir, db_slug, db_version)
    db_url = f"{_PHYSIONET_FILES_URL}/{db_slug}/{db_version}"

    for record_id in tqdm(requested_records, desc=f"Downloading {db_slug}"):
        for extension in extensions:
            if not extension.startswith("."):
                extension = "." + extension
            filepath = (data_dir / db_slug / record_id).with_suffix(extension)
            _download_file(
                f"{db_url}/{filepath.name}",
                filepath,
                checksums[filepath.name],
                checksum_type=_CHECKSUM_TYPE,
            )

sleepecg.export_ecg_record(record, filename)

Export ECG record to CSV.

Parameters:

  • record (ECGRecord) –

    ECG record to export.

  • filename (str | Path) –

    File name to write to.

Source code in sleepecg/io/ecg_readers.py
def export_ecg_record(record: ECGRecord, filename: str | Path) -> None:
    """
    Export ECG record to CSV.

    Parameters
    ----------
    record : ECGRecord
        ECG record to export.
    filename : str | pathlib.Path
        File name to write to.
    """
    filename = Path(filename).with_suffix(".csv")

    rpeaks = np.zeros_like(record.ecg, dtype=int)
    rpeaks[record.annotation] = 1

    np.savetxt(
        filename,
        np.vstack((record.ecg, rpeaks)).T,
        fmt="%.3f,%d",
        header=f"# fs: {record.fs}Hz\necg,rpeak",
        comments="",
    )

sleepecg.read_gudb(offline=False, data_dir=None)

Lazily read records from GUDB.

Required files are downloaded if not present in '<data_dir>/gudb'.

Parameters:

  • offline (bool) –

    If True, only local files will be used (i.e. no files will be downloaded), by default False.

  • data_dir (str | Path) –

    Directory where all datasets are stored. If None (default), the value will be taken from the configuration.

Yields:

  • ECGRecord

    Each element in the generator is of type ECGRecord and contains the ECG signal (.ecg), sampling frequency (.fs), annotated beat indices (.annotations), .lead, and .id.

Source code in sleepecg/io/ecg_readers.py
def read_gudb(
    offline: bool = False,
    data_dir: Optional[str | Path] = None,
) -> Iterator[ECGRecord]:
    """
    Lazily read records from [GUDB](https://berndporr.github.io/ECG-GUDB/).

    Required files are downloaded if not present in `'<data_dir>/gudb'`.

    Parameters
    ----------
    offline : bool, optional
        If `True`, only local files will be used (i.e. no files will be downloaded), by
        default `False`.
    data_dir : str | pathlib.Path, optional
        Directory where all datasets are stored. If `None` (default), the value will be
        taken from the configuration.

    Yields
    ------
    ECGRecord
        Each element in the generator is of type `ECGRecord` and contains the ECG signal
        (`.ecg`), sampling frequency (`.fs`), annotated beat indices (`.annotations`),
        `.lead`, and `.id`.
    """
    DB_URL = "https://berndporr.github.io/ECG-GUDB/experiment_data"
    EXPERIMENTS = ["sitting", "maths", "walking", "hand_bike", "jogging"]
    FS = 250

    if data_dir is None:
        data_dir = get_config("data_dir")

    db_dir = Path(data_dir).expanduser() / "gudb"

    for subject_id in tqdm(list(range(25)), desc="Reading GUDB"):
        for experiment in EXPERIMENTS:
            experiment_subdir = f"subject_{subject_id:02}/{experiment}"
            if not offline:
                for tsv_filename in (
                    "ECG.tsv",
                    "annotation_cs.tsv",
                    "annotation_cables.tsv",
                ):
                    ecg_file_url = f"{DB_URL}/{experiment_subdir}/{tsv_filename}"
                    target_filepath = db_dir / experiment_subdir / tsv_filename
                    try:
                        checksum = _GUDB_MD5[f"{experiment_subdir}/{tsv_filename}"]
                    except KeyError:
                        pass  # file not available
                    else:
                        _download_file(ecg_file_url, target_filepath, checksum, "md5")
            ecg_data = {
                lead: signal
                for lead, signal in zip(
                    ("chest", "II", "III"),
                    np.loadtxt(
                        db_dir / experiment_subdir / "ECG.tsv",
                        delimiter=" ",  # space-separated (contrary to what .tsv suggests)
                        usecols=(0, 1, 2),
                        unpack=True,
                    ),
                )
            }
            annotations_chest_file = db_dir / experiment_subdir / "annotation_cs.tsv"
            if annotations_chest_file.is_file():
                yield ECGRecord(
                    ecg=ecg_data["chest"].to_numpy(),
                    fs=FS,
                    annotation=np.loadtxt(annotations_chest_file, dtype=np.int32),
                    lead="chest",
                    id=f"{subject_id:02}_{experiment}",
                )
            annotations_chest_file = db_dir / experiment_subdir / "annotation_cables.tsv"
            if annotations_chest_file.is_file():
                annotations = np.loadtxt(annotations_chest_file, dtype=np.int32)
                for lead in ("II", "III"):
                    yield ECGRecord(
                        ecg=ecg_data[lead].to_numpy(),
                        fs=FS,
                        annotation=annotations,
                        lead=lead,
                        id=f"{subject_id:02}_{experiment}",
                    )

sleepecg.read_ltdb(records_pattern='*', offline=False, data_dir=None)

Lazily read records from LTDB.

Parameters:

  • records_pattern (str) –

    Glob-like pattern to select record IDs, by default '*'.

  • offline (bool) –

    If True, only local files will be used (i.e. no files will be downloaded), by default False.

  • data_dir (str | Path) –

    Directory where all datasets are stored. If None (default), the value will be taken from the configuration.

Yields:

  • ECGRecord

    Each element in the generator is of type ECGRecord and contains the ECG signal (.ecg), sampling frequency (.fs), annotated beat indices (.annotations), .lead, and .id.

Source code in sleepecg/io/ecg_readers.py
def read_ltdb(
    records_pattern: str = "*",
    offline: bool = False,
    data_dir: Optional[str | Path] = None,
) -> Iterator[ECGRecord]:
    """
    Lazily read records from [LTDB](https://physionet.org/content/ltdb/).

    Parameters
    ----------
    records_pattern : str, optional
        Glob-like pattern to select record IDs, by default `'*'`.
    offline : bool, optional
        If `True`, only local files will be used (i.e. no files will be downloaded), by
        default `False`.
    data_dir : str | pathlib.Path, optional
        Directory where all datasets are stored. If `None` (default), the value will be
        taken from the configuration.

    Yields
    ------
    ECGRecord
        Each element in the generator is of type `ECGRecord` and contains the ECG signal
        (`.ecg`), sampling frequency (`.fs`), annotated beat indices (`.annotations`),
        `.lead`, and `.id`.
    """
    if data_dir is None:
        data_dir = get_config("data_dir")
    yield from _read_mitbih("ltdb", records_pattern, offline, data_dir)

sleepecg.read_mesa(records_pattern='*', heartbeats_source='annotation', offline=False, keep_edfs=False, data_dir=None)

Lazily read records from MESA.

Each MESA record consists of an .edf file containing raw polysomnography data and an .xml file containing annotated events. Since the entire MESA dataset requires about 385 GB of disk space, .edf files can be deleted after heartbeat times have been extracted. Heartbeat times are cached in an .npy file in <data_dir>/mesa/preprocessed/heartbeats.

Parameters:

  • records_pattern (str) –

    Glob-like pattern to select record IDs, by default '*'.

  • heartbeats_source (('annotation', 'cached', 'ecg')) –

    If 'annotation' (default), get heartbeat times from polysomnography/annotations-rpoints/<record_id>-rpoints.csv (not available for all records). If 'ecg', use sleepecg.detect_heartbeats() on the ECG contained in polysomnography/edfs/<record_id>.edf and cache the result in preprocessed/heartbeats/<record_id>.npy. If 'cached', get the cached heartbeats.

  • offline (bool) –

    If True, search for local files only instead of using the NSRR API, by default False.

  • keep_edfs (bool) –

    If False, remove .edf after heartbeat detection, by default False.

  • data_dir (str | Path) –

    Directory where all datasets are stored. If None (default), the value will be taken from the configuration.

Yields:

  • SleepRecord

    Each element in the generator is of type SleepRecord.

Source code in sleepecg/io/sleep_readers.py
def read_mesa(
    records_pattern: str = "*",
    heartbeats_source: str = "annotation",
    offline: bool = False,
    keep_edfs: bool = False,
    data_dir: Optional[str | Path] = None,
) -> Iterator[SleepRecord]:
    """
    Lazily read records from [MESA](https://sleepdata.org/datasets/mesa).

    Each MESA record consists of an `.edf` file containing raw polysomnography data and an
    `.xml` file containing annotated events. Since the entire MESA dataset requires about
    385 GB of disk space, `.edf` files can be deleted after heartbeat times have been
    extracted. Heartbeat times are cached in an `.npy` file in
    `<data_dir>/mesa/preprocessed/heartbeats`.

    Parameters
    ----------
    records_pattern : str, optional
         Glob-like pattern to select record IDs, by default `'*'`.
    heartbeats_source : {'annotation', 'cached', 'ecg'}, optional
        If `'annotation'` (default), get heartbeat times from
        `polysomnography/annotations-rpoints/<record_id>-rpoints.csv` (not available for all
        records). If `'ecg'`, use `sleepecg.detect_heartbeats()` on the ECG contained in
        `polysomnography/edfs/<record_id>.edf` and cache the result in
        `preprocessed/heartbeats/<record_id>.npy`. If `'cached'`, get the cached heartbeats.
    offline : bool, optional
        If `True`, search for local files only instead of using the NSRR API, by default
        `False`.
    keep_edfs : bool, optional
        If `False`, remove `.edf` after heartbeat detection, by default `False`.
    data_dir : str | pathlib.Path, optional
        Directory where all datasets are stored. If `None` (default), the value will be
        taken from the configuration.

    Yields
    ------
    SleepRecord
        Each element in the generator is of type `SleepRecord`.
    """
    from edfio import read_edf

    DB_SLUG = "mesa"
    ANNOTATION_DIRNAME = "polysomnography/annotations-events-nsrr"
    EDF_DIRNAME = "polysomnography/edfs"
    HEARTBEATS_DIRNAME = "preprocessed/heartbeats"
    RPOINTS_DIRNAME = "polysomnography/annotations-rpoints"

    GENDER_MAPPING = {0: Gender.FEMALE, 1: Gender.MALE}

    heartbeats_source_options = {"annotation", "cached", "ecg"}
    if heartbeats_source not in heartbeats_source_options:
        raise ValueError(
            f"Invalid value for parameter `heartbeats_source`: {heartbeats_source}, "
            f"possible options: {heartbeats_source_options}"
        )

    if data_dir is None:
        data_dir = get_config("data_dir")

    db_dir = Path(data_dir).expanduser() / DB_SLUG
    annotations_dir = db_dir / ANNOTATION_DIRNAME
    edf_dir = db_dir / EDF_DIRNAME
    heartbeats_dir = db_dir / HEARTBEATS_DIRNAME

    for directory in (annotations_dir, edf_dir, heartbeats_dir):
        directory.mkdir(parents=True, exist_ok=True)

    if not offline:
        download_url = _get_nsrr_url(DB_SLUG)

        subject_data_filename, subject_data_checksum = _list_nsrr(
            "mesa",
            "datasets",
            "mesa-sleep-dataset-*.csv",
            shallow=True,
        )[0]
        subject_data_filepath = db_dir / subject_data_filename
        _download_nsrr_file(
            download_url + subject_data_filename,
            target_filepath=subject_data_filepath,
            checksum=subject_data_checksum,
        )

        xml_files = _list_nsrr(
            DB_SLUG,
            ANNOTATION_DIRNAME,
            f"mesa-sleep-{records_pattern}-nsrr.xml",
            shallow=True,
        )
        checksums = dict(xml_files)
        requested_records = [Path(file).stem[:-5] for file, _ in xml_files]

        edf_files = _list_nsrr(
            DB_SLUG,
            EDF_DIRNAME,
            f"mesa-sleep-{records_pattern}.edf",
            shallow=True,
        )
        checksums.update(edf_files)

        rpoints_files = _list_nsrr(
            DB_SLUG,
            RPOINTS_DIRNAME,
            f"mesa-sleep-{records_pattern}-rpoint.csv",
            shallow=True,
        )
        checksums.update(rpoints_files)
    else:
        subject_data_filepath = next((db_dir / "datasets").glob("mesa-sleep-dataset-*.csv"))
        xml_paths = annotations_dir.glob(f"mesa-sleep-{records_pattern}-nsrr.xml")
        requested_records = sorted([file.stem[:-5] for file in xml_paths])

    subject_data_array = np.loadtxt(
        subject_data_filepath,
        delimiter=",",
        skiprows=1,
        usecols=[0, 3, 5],  # [mesaid, gender, age]
        dtype=int,
    )

    subject_data = {}
    for mesaid, gender, age in subject_data_array:
        subject_data[f"mesa-sleep-{mesaid:04}"] = SubjectData(
            gender=GENDER_MAPPING[gender],
            age=age,
        )

    for record_id in requested_records:
        heartbeats_file = heartbeats_dir / f"{record_id}.npy"
        if heartbeats_source == "annotation":
            rpoints_filename = f"{RPOINTS_DIRNAME}/{record_id}-rpoint.csv"
            rpoints_filepath = db_dir / rpoints_filename
            if not rpoints_filepath.is_file():
                if not offline and rpoints_filename in checksums:
                    _download_nsrr_file(
                        download_url + rpoints_filename,
                        rpoints_filepath,
                        checksums[rpoints_filename],
                    )
                else:
                    print(f"Skipping {record_id} due to missing heartbeat annotations.")
                    continue

            heartbeat_times = np.loadtxt(
                rpoints_filepath,
                delimiter=",",
                skiprows=1,
                usecols=18,  # column 18 ('seconds') contains the annotated heartbeat times
            )
            # for some reason some (39) records have unsorted annotations
            heartbeat_times.sort()
        elif heartbeats_source == "cached":
            if not heartbeats_file.is_file():
                print(f"Skipping {record_id} due to missing cached heartbeats.")
                continue
            heartbeat_times = np.load(heartbeats_file)
        elif heartbeats_source == "ecg":
            edf_filename = EDF_DIRNAME + f"/{record_id}.edf"
            edf_filepath = db_dir / edf_filename
            edf_was_available = edf_filepath.is_file()
            if not offline:
                _download_nsrr_file(
                    download_url + edf_filename,
                    edf_filepath,
                    checksums[edf_filename],
                )

            ecg = read_edf(edf_filepath).get_signal("EKG")
            heartbeat_indices = detect_heartbeats(ecg.data, ecg.sampling_frequency)
            heartbeat_times = heartbeat_indices / ecg.sampling_frequency
            np.save(heartbeats_file, heartbeat_times)

            if not edf_was_available and not keep_edfs:
                edf_filepath.unlink()

        xml_filename = ANNOTATION_DIRNAME + f"/{record_id}-nsrr.xml"
        xml_filepath = db_dir / xml_filename
        if not offline:
            _download_nsrr_file(
                download_url + xml_filename,
                xml_filepath,
                checksums[xml_filename],
            )

        parsed_xml = _parse_nsrr_xml(xml_filepath)

        yield SleepRecord(
            sleep_stages=parsed_xml.sleep_stages,
            sleep_stage_duration=parsed_xml.sleep_stage_duration,
            id=record_id,
            recording_start_time=parsed_xml.recording_start_time,
            heartbeat_times=heartbeat_times,
            subject_data=subject_data[record_id],
        )

sleepecg.read_mitdb(records_pattern='*', offline=False, data_dir=None)

Lazily read records from MITDB.

Parameters:

  • records_pattern (str) –

    Glob-like pattern to select record IDs, by default '*'.

  • offline (bool) –

    If True, only local files will be used (i.e. no files will be downloaded), by default False.

  • data_dir (str | Path) –

    Directory where all datasets are stored. If None (default), the value will be taken from the configuration.

Yields:

  • ECGRecord

    Each element in the generator is of type ECGRecord and contains the ECG signal (.ecg), sampling frequency (.fs), annotated beat indices (.annotations), .lead, and .id.

Source code in sleepecg/io/ecg_readers.py
def read_mitdb(
    records_pattern: str = "*",
    offline: bool = False,
    data_dir: Optional[str | Path] = None,
) -> Iterator[ECGRecord]:
    """
    Lazily read records from [MITDB](https://physionet.org/content/mitdb/).

    Parameters
    ----------
    records_pattern : str, optional
        Glob-like pattern to select record IDs, by default `'*'`.
    offline : bool, optional
        If `True`, only local files will be used (i.e. no files will be downloaded), by
        default `False`.
    data_dir : str | pathlib.Path, optional
        Directory where all datasets are stored. If `None` (default), the value will be
        taken from the configuration.

    Yields
    ------
    ECGRecord
        Each element in the generator is of type `ECGRecord` and contains the ECG signal
        (`.ecg`), sampling frequency (`.fs`), annotated beat indices (`.annotations`),
        `.lead`, and `.id`.
    """
    if data_dir is None:
        data_dir = get_config("data_dir")
    yield from _read_mitbih("mitdb", records_pattern, offline, data_dir)

sleepecg.read_shhs(records_pattern='*', heartbeats_source='annotation', offline=False, keep_edfs=False, data_dir=None)

Lazily read records from SHHS.

Each SHHS record consists of an .edf file containing raw polysomnography data and an .xml file containing annotated events. Since the entire SHHS dataset requires about 356 GB of disk space, .edf files can be deleted after heartbeat times have been extracted. Heartbeat times are cached in an .npy file in <data_dir>/shhs/preprocessed/heartbeats.

Parameters:

  • records_pattern (str) –

    Glob-like pattern to select record IDs, by default '*'.

  • heartbeats_source (('annotation', 'cached', 'ecg')) –

    If 'annotation' (default), get heartbeat times from polysomnography/annotations-rpoints/shhsX/<record_id>-rpoints.csv (not available for all records). If 'ecg', use sleepecg.detect_heartbeats() on the ECG contained in polysomnography/edfs/shhsX/<record_id>.edf and cache the result in preprocessed/heartbeats/shhsX/<record_id>.npy. If 'cached', get the cached heartbeats.

  • offline (bool) –

    If True, search for local files only instead of using the NSRR API, by default False.

  • keep_edfs (bool) –

    If False, remove .edf after heartbeat detection, by default False.

  • data_dir (str | Path) –

    Directory where all datasets are stored. If None (default), the value will be taken from the configuration.

Yields:

  • SleepRecord

    Each element in the generator is of type SleepRecord.

Source code in sleepecg/io/sleep_readers.py
def read_shhs(
    records_pattern: str = "*",
    heartbeats_source: str = "annotation",
    offline: bool = False,
    keep_edfs: bool = False,
    data_dir: Optional[str | Path] = None,
) -> Iterator[SleepRecord]:
    """
    Lazily read records from [SHHS](https://sleepdata.org/datasets/shhs).

    Each SHHS record consists of an `.edf` file containing raw polysomnography data and an
    `.xml` file containing annotated events. Since the entire SHHS dataset requires about
    356 GB of disk space, `.edf` files can be deleted after heartbeat times have been
    extracted. Heartbeat times are cached in an `.npy` file in
    `<data_dir>/shhs/preprocessed/heartbeats`.

    Parameters
    ----------
    records_pattern : str, optional
         Glob-like pattern to select record IDs, by default `'*'`.
    heartbeats_source : {'annotation', 'cached', 'ecg'}, optional
        If `'annotation'` (default), get heartbeat times from
        `polysomnography/annotations-rpoints/shhsX/<record_id>-rpoints.csv`
        (not available for all records). If `'ecg'`, use `sleepecg.detect_heartbeats()` on
        the ECG contained in `polysomnography/edfs/shhsX/<record_id>.edf` and cache the
        result in `preprocessed/heartbeats/shhsX/<record_id>.npy`. If `'cached'`, get the
        cached heartbeats.
    offline : bool, optional
        If `True`, search for local files only instead of using the NSRR API, by default
        `False`.
    keep_edfs : bool, optional
        If `False`, remove `.edf` after heartbeat detection, by default `False`.
    data_dir : str | pathlib.Path, optional
        Directory where all datasets are stored. If `None` (default), the value will be
        taken from the configuration.

    Yields
    ------
    SleepRecord
        Each element in the generator is of type `SleepRecord`.
    """
    from edfio import read_edf

    DB_SLUG = "shhs"
    ANNOTATION_DIRNAME = "polysomnography/annotations-events-nsrr"
    EDF_DIRNAME = "polysomnography/edfs"
    HEARTBEATS_DIRNAME = "preprocessed/heartbeats"
    RPOINTS_DIRNAME = "polysomnography/annotations-rpoints"

    # see shhs/datasets/shhs-data-dictionary-0.16.0-domains.csv lines 91+92
    GENDER_MAPPING = {"2": Gender.FEMALE, "1": Gender.MALE}

    heartbeats_source_options = {"annotation", "cached", "ecg"}
    if heartbeats_source not in heartbeats_source_options:
        raise ValueError(
            f"Invalid value for parameter `heartbeats_source`: {heartbeats_source}, "
            f"possible options: {heartbeats_source_options}"
        )

    if data_dir is None:
        data_dir = get_config("data_dir")

    data_dir = Path(data_dir).expanduser()
    db_dir = data_dir / DB_SLUG
    annotations_dir = db_dir / ANNOTATION_DIRNAME
    edf_dir = db_dir / EDF_DIRNAME
    heartbeats_dir = db_dir / HEARTBEATS_DIRNAME

    for directory in (annotations_dir, edf_dir, heartbeats_dir):
        directory.mkdir(parents=True, exist_ok=True)

    if not offline:
        download_url = _get_nsrr_url(DB_SLUG)

        download_nsrr(
            DB_SLUG,
            "datasets",
            "shhs?-dataset-*.csv",
            shallow=True,
            data_dir=data_dir,
        )

        xml_files = _list_nsrr(
            DB_SLUG,
            ANNOTATION_DIRNAME,
            f"{records_pattern}-nsrr.xml",
            shallow=False,
        )
        checksums = dict(xml_files)
        requested_records = [file[-27:-9] for file, _ in xml_files]

        edf_files = _list_nsrr(
            DB_SLUG,
            EDF_DIRNAME,
            f"{records_pattern}.edf",
            shallow=False,
        )
        checksums.update(edf_files)

        rpoints_files = _list_nsrr(
            DB_SLUG,
            RPOINTS_DIRNAME,
            f"{records_pattern}-rpoint.csv",
            shallow=False,
        )
        checksums.update(rpoints_files)
    else:
        xml_paths = sorted(annotations_dir.rglob(f"{records_pattern}-nsrr.xml"))
        requested_records = [str(file)[-27:-9] for file in xml_paths]

    subject_data = {}

    if any(r.startswith("shhs1") for r in requested_records):
        subject_data_file_shhs1 = next((db_dir / "datasets").glob("shhs1-dataset-*.csv"))
        with open(subject_data_file_shhs1, newline="", encoding="windows-1252") as csvfile:
            reader = csv.DictReader(csvfile)
            for row in reader:
                record_id = f"shhs1-{row['nsrrid']}"
                subject_data[record_id] = SubjectData(
                    gender=GENDER_MAPPING.get(row["gender"]),
                    age=int(row["age_s1"]) if row["age_s1"] != "" else None,
                    weight=float(row["weight"]) if row["weight"] else None,
                )
    if any(r.startswith("shhs2") for r in requested_records):
        subject_data_file_shhs2 = next((db_dir / "datasets").glob("shhs2-dataset-*.csv"))
        with open(subject_data_file_shhs2, newline="", encoding="windows-1252") as csvfile:
            reader = csv.DictReader(csvfile)
            for row in reader:
                record_id = f"shhs2-{row['nsrrid']}"
                subject_data[record_id] = SubjectData(
                    gender=GENDER_MAPPING.get(row["gender"]),
                    age=int(row["age_s2"]) if row["age_s2"] != "" else None,
                    weight=None,  # subject weight was not recorded in shhs2
                )

    for record_id in requested_records:
        heartbeats_file = heartbeats_dir / f"{record_id}.npy"
        if heartbeats_source == "annotation":
            rpoints_filename = f"{RPOINTS_DIRNAME}/{record_id}-rpoint.csv"
            rpoints_filepath = db_dir / rpoints_filename
            if not rpoints_filepath.is_file():
                if not offline and rpoints_filename in checksums:
                    _download_nsrr_file(
                        download_url + rpoints_filename,
                        rpoints_filepath,
                        checksums[rpoints_filename],
                    )
                else:
                    print(f"Skipping {record_id} due to missing heartbeat annotations.")
                    continue
            heartbeat_times = np.loadtxt(
                rpoints_filepath,
                delimiter=",",
                skiprows=1,
                usecols=19,  # column 19 ('seconds') contains the annotated heartbeat times
            )
        elif heartbeats_source == "cached":
            if not heartbeats_file.is_file():
                print(f"Skipping {record_id} due to missing cached heartbeats.")
                continue
        elif heartbeats_source == "ecg":
            edf_filename = EDF_DIRNAME + f"/{record_id}.edf"
            edf_filepath = db_dir / edf_filename
            edf_was_available = edf_filepath.is_file()
            if not offline:
                _download_nsrr_file(
                    download_url + edf_filename,
                    edf_filepath,
                    checksums[edf_filename],
                )

            ecg = read_edf(edf_filepath).get_signal("ECG")
            heartbeat_indices = detect_heartbeats(ecg.data, ecg.sampling_frequency)
            heartbeat_times = heartbeat_indices / ecg.sampling_frequency

            heartbeats_file.parent.mkdir(parents=True, exist_ok=True)
            np.save(heartbeats_file, heartbeat_times)

            if not edf_was_available and not keep_edfs:
                edf_filepath.unlink()

        xml_filename = ANNOTATION_DIRNAME + f"/{record_id}-nsrr.xml"
        xml_filepath = db_dir / xml_filename
        if not offline:
            _download_nsrr_file(
                download_url + xml_filename,
                xml_filepath,
                checksums[xml_filename],
            )

        parsed_xml = _parse_nsrr_xml(xml_filepath)

        yield SleepRecord(
            sleep_stages=parsed_xml.sleep_stages,
            sleep_stage_duration=parsed_xml.sleep_stage_duration,
            id=record_id[6:],  # remove subdirectory
            recording_start_time=parsed_xml.recording_start_time,
            heartbeat_times=heartbeat_times,
            subject_data=subject_data[record_id[6:]],  # remove subdirectory]
        )

sleepecg.read_slpdb(records_pattern='*', offline=False, data_dir=None)

Lazily read records from SLPDB.

Required files are downloaded from PhysioNet to <data_dir>/slpdb.

Parameters:

  • records_pattern (str) –

    Glob-like pattern to select record IDs, by default '*'.

  • offline (bool) –

    If True, search for local files only instead of downloading from PhysioNet, by default False.

  • data_dir (str | Path) –

    Directory where all datasets are stored. If None (default), the value will be taken from the configuration.

Yields:

  • SleepRecord

    Each element in the generator is of type SleepRecord.

Source code in sleepecg/io/sleep_readers.py
def read_slpdb(
    records_pattern: str = "*",
    offline: bool = False,
    data_dir: Optional[str | Path] = None,
) -> Iterator[SleepRecord]:
    """
    Lazily read records from [SLPDB](https://physionet.org/content/slpdb).

    Required files are downloaded from PhysioNet to `<data_dir>/slpdb`.

    Parameters
    ----------
    records_pattern : str, optional
         Glob-like pattern to select record IDs, by default `'*'`.
    offline : bool, optional
        If `True`, search for local files only instead of downloading from PhysioNet, by
        default `False`.
    data_dir : str | pathlib.Path, optional
        Directory where all datasets are stored. If `None` (default), the value will be
        taken from the configuration.

    Yields
    ------
    SleepRecord
        Each element in the generator is of type `SleepRecord`.
    """
    # https://physionet.org/content/slpdb/1.0.0/
    import wfdb

    DB_SLUG = "slpdb"

    STAGE_MAPPING = {
        "W": SleepStage.WAKE,
        "R": SleepStage.REM,
        "1": SleepStage.N1,
        "2": SleepStage.N2,
        "3": SleepStage.N3,
        "4": SleepStage.N3,
    }

    if data_dir is None:
        data_dir = get_config("data_dir")

    data_dir = Path(data_dir).expanduser()
    db_dir = data_dir / DB_SLUG

    requested_records = _list_physionet(
        data_dir=data_dir,
        db_slug=DB_SLUG,
        pattern=records_pattern,
    )

    if not offline:
        download_physionet(
            db_slug=DB_SLUG,
            requested_records=requested_records,
            extensions=[".hea", ".dat", ".st"],
            data_dir=data_dir,
        )

    for record_id in requested_records:
        record_file = str(db_dir / record_id)

        record = wfdb.rdrecord(record_file)
        start_time = record.base_time
        ecg = np.asarray(record.p_signal[:, record.sig_name.index("ECG")])
        fs = record.fs

        heartbeat_indices = detect_heartbeats(ecg, fs)
        heartbeat_times = heartbeat_indices / fs

        annot_st = wfdb.rdann(record_file, "st")

        # Some 30 second windows don't have a sleep stage annotation, so the annotation
        # array is initialized with `SleepStage.UNDEFINED` for every 30 second window.
        for sample_time, annotation in zip(annot_st.sample[::-1], annot_st.aux_note[::-1]):
            if annotation[0] in STAGE_MAPPING:
                number_of_sleep_stages = sample_time // (30 * fs) + 1
                break

        sleep_stages = np.full(number_of_sleep_stages, SleepStage.UNDEFINED)

        # Most annotations are at sample indices which are multiples of 30*fs. However,
        # annotations which would be at sample index 0, are at sample index 1. Integer
        # division is used when calculating the stage index to move these annotations to
        # sample index 0.
        for sample_time, annotation in zip(annot_st.sample, annot_st.aux_note):
            if annotation[0] in STAGE_MAPPING:
                sleep_stages[sample_time // (30 * fs)] = STAGE_MAPPING[annotation[0]]

        # Age and weight are given in the last line of the header file, which is contained
        # in record.comments[0] and looks like this:
        # '44 M 89 32-01-89' ('<age> <gender> <weight> <unspecified>')
        # For some records, age/weight is given as 'x'.
        age, _, weight, _ = record.comments[0].split()
        subject_data = SubjectData(
            gender=Gender.MALE,  # all slpdb subjects were male
            age=None if age == "x" else int(age),
            weight=None if weight == "x" else int(weight),
        )

        yield SleepRecord(
            sleep_stages=sleep_stages,
            sleep_stage_duration=30,
            id=record_id,
            recording_start_time=start_time,
            heartbeat_times=heartbeat_times,
            subject_data=subject_data,
        )

sleepecg.set_nsrr_token(token)

Set and verify the NSRR download token.

Implemented according to the NSRR API specs.

Parameters:

Source code in sleepecg/io/nsrr.py
def set_nsrr_token(token: str) -> None:
    """
    Set and verify the [NSRR](https://sleepdata.org) download token.

    Implemented according to the
    [NSRR API specs](https://github.com/nsrr/sleepdata.org/wiki/api-v1-account).

    Parameters
    ----------
    token : str
        NSRR [download token](https://sleepdata.org/token).
    """
    response = requests.get(
        "https://sleepdata.org/api/v1/account/profile.json",
        params={"auth_token": token},
    )
    authenticated = response.json()["authenticated"]
    if authenticated:
        username = response.json()["username"]
        email = response.json()["email"]
        print(f"Authenticated at sleepdata.org as {username} ({email})")
        global _nsrr_token
        _nsrr_token = token
    else:
        raise RuntimeError("Authentication at sleepdata.org failed, verify token!")

sleepecg.ECGRecord dataclass

Store a single ECG record.

Attributes:

  • ecg (ndarray) –

    The ECG signal.

  • fs (float) –

    The sampling frequency in Hz.

  • annotation (ndarray) –

    Indices of annotated heartbeats.

  • lead ((str, optional)) –

    Which ECG lead the signal was recorded from, by default None.

  • id ((str, optional)) –

    The record ID, by default None.

Source code in sleepecg/io/ecg_readers.py
@dataclass
class ECGRecord:
    """
    Store a single ECG record.

    Attributes
    ----------
    ecg : np.ndarray
        The ECG signal.
    fs : float
        The sampling frequency in Hz.
    annotation : np.ndarray
        Indices of annotated heartbeats.
    lead : str, optional
        Which ECG lead the signal was recorded from, by default `None`.
    id : str, optional
        The record ID, by default `None`.
    """

    ecg: np.ndarray
    fs: float
    annotation: np.ndarray
    lead: Optional[str] = None
    id: Optional[str] = None

    def export(self, filename: str | Path) -> None:
        """
        Export ECG record to CSV.

        Parameters
        ----------
        filename : str | pathlib.Path
            File name to write to.
        """
        export_ecg_record(self, filename)

    def plot(self, **kwargs: np.ndarray) -> tuple["plt.Figure", "plt.Axes"]:
        """
        Plot ECG time series with optional markers.

        Parameters
        ----------
        **kwargs : np.ndarray
            Positions of annotations (i.e. heartbeats) in samples. If more than one marker
            sequence is given, the keywords will be used as labels in the plot legend.

        Returns
        -------
        fig : matplotlib.figure.Figure
            The figure.
        ax : matplotlib.axes.Axes
            The axes in the figure.
        """
        return plot_ecg(self.ecg, self.fs, title=self.id, beats=self.annotation, **kwargs)

export(filename)

Export ECG record to CSV.

Parameters:

  • filename (str | Path) –

    File name to write to.

Source code in sleepecg/io/ecg_readers.py
def export(self, filename: str | Path) -> None:
    """
    Export ECG record to CSV.

    Parameters
    ----------
    filename : str | pathlib.Path
        File name to write to.
    """
    export_ecg_record(self, filename)

plot(**kwargs)

Plot ECG time series with optional markers.

Parameters:

  • **kwargs (ndarray) –

    Positions of annotations (i.e. heartbeats) in samples. If more than one marker sequence is given, the keywords will be used as labels in the plot legend.

Returns:

  • fig( Figure ) –

    The figure.

  • ax( Axes ) –

    The axes in the figure.

Source code in sleepecg/io/ecg_readers.py
def plot(self, **kwargs: np.ndarray) -> tuple["plt.Figure", "plt.Axes"]:
    """
    Plot ECG time series with optional markers.

    Parameters
    ----------
    **kwargs : np.ndarray
        Positions of annotations (i.e. heartbeats) in samples. If more than one marker
        sequence is given, the keywords will be used as labels in the plot legend.

    Returns
    -------
    fig : matplotlib.figure.Figure
        The figure.
    ax : matplotlib.axes.Axes
        The axes in the figure.
    """
    return plot_ecg(self.ecg, self.fs, title=self.id, beats=self.annotation, **kwargs)

sleepecg.Gender

Bases: IntEnum

Mapping of gender to integers.

Source code in sleepecg/io/sleep_readers.py
class Gender(IntEnum):
    """Mapping of gender to integers."""

    FEMALE = 0
    MALE = 1

sleepecg.SleepRecord dataclass

Store a single sleep record.

Attributes:

  • sleep_stages ((ndarray, optional)) –

    Sleep stages according to AASM guidelines, stored as integers as defined by SleepStage, by default None.

  • sleep_stage_duration ((int, optional)) –

    Duration of each sleep stage in seconds, by default None.

  • id ((str, optional)) –

    The record ID, by default None.

  • recording_start_time ((time, optional)) –

    Time at which the recording was started, by default None.

  • heartbeat_times ((ndarray, optional)) –

    Times of heartbeats relative to recording start in seconds, by default None.

  • subject_data ((SubjectData, optional)) –

    Dataclass containing subject data (such as gender or age), by default None.

Source code in sleepecg/io/sleep_readers.py
@dataclass
class SleepRecord:
    """
    Store a single sleep record.

    Attributes
    ----------
    sleep_stages : np.ndarray, optional
        Sleep stages according to AASM guidelines, stored as integers as defined by
        `SleepStage`, by default `None`.
    sleep_stage_duration : int, optional
        Duration of each sleep stage in seconds, by default `None`.
    id : str, optional
        The record ID, by default `None`.
    recording_start_time : datetime.time, optional
        Time at which the recording was started, by default `None`.
    heartbeat_times : np.ndarray, optional
        Times of heartbeats relative to recording start in seconds, by default `None`.
    subject_data : SubjectData, optional
        Dataclass containing subject data (such as gender or age), by default `None`.
    """

    sleep_stages: Optional[np.ndarray] = None
    sleep_stage_duration: Optional[int] = None
    id: Optional[str] = None
    recording_start_time: Optional[datetime.time] = None
    heartbeat_times: Optional[np.ndarray] = None
    subject_data: Optional[SubjectData] = None

sleepecg.SleepStage

Bases: IntEnum

Mapping of AASM sleep stages to integers.

To facilitate hypnogram plotting, values start with zero and increase with wakefulness.

Source code in sleepecg/io/sleep_readers.py
class SleepStage(IntEnum):
    """
    Mapping of AASM sleep stages to integers.

    To facilitate hypnogram plotting, values start with zero and increase with wakefulness.
    """

    UNDEFINED = 0
    N3 = 1
    N2 = 2
    N1 = 3
    REM = 4
    WAKE = 5

sleepecg.SubjectData dataclass

Store data about a single subject.

Attributes:

  • gender ((int, optional)) –

    The subject's gender, stored as an integer as defined by Gender, by default None.

  • age ((int, optional)) –

    The subject's age in years, by default None.

  • weight ((float, optional)) –

    The subject's weight in kg, by default None.

Source code in sleepecg/io/sleep_readers.py
@dataclass
class SubjectData:
    """
    Store data about a single subject.

    Attributes
    ----------
    gender : int, optional
        The subject's gender, stored as an integer as defined by `Gender`, by default
        `None`.
    age : int, optional
        The subject's age in years, by default `None`.
    weight : float, optional
        The subject's weight in kg, by default `None`.
    """

    gender: Optional[int] = None
    age: Optional[int] = None
    weight: Optional[float] = None