DataLoading Module¶

The DataLoading module is the entry point for most single-cell RNA-seq analysis pipelines, responsible for loading aligned data from 10X Genomics Cell Ranger output files into the pipeline's data context. The module as it stands has very little flexibility for file types and formats, so feel free to augment this as needed!

Overview¶

This module uses Scanpy's read_10x_h5 function to load data from 10X HDF5 files and prepares it for downstream analysis. It handles the initial data setup, including making variable names unique and creating the first data layer for tracking data transformations throughout the pipeline.

Multimodal Data

Right now, this only supports single modal data. Any other layers (i.e. Antibody Capture) are discarded Future implementations will shift from AnnData to using muon and MuData to resolve this

Parameters¶

Parameter	Type	Required	Default	Description
`file_path`	string	Yes	-	Path to the 10X h5 file (typically `sample_filtered_feature_bc_matrix.h5`)

Example Configuration¶

- name: data_loading
  type: DataLoading
  params:
    file_path: /path/to/filtered_feature_bc_matrix.h5

Input/Output¶

Inputs¶

None (this is typically the first module in a pipeline)

Outputs¶

data: An AnnData object containing the loaded single-cell data with:
.X: The main expression matrix
.var_names: Gene identifiers (made unique)
.obs_names: Cell barcodes
.layers['loaded_counts']: A copy of the original counts data

Functionality¶

The module performs the following operations:

File Validation: Checks that the specified file path exists and is accessible
Data Loading: Uses Scanpy's read_10x_h5 function to load the HDF5 file
Variable Name Processing: Ensures all gene names are unique using Scanpy's var_names_make_unique() function
Layer Creation: Creates a loaded_counts layer to preserve the original data before any transformations
Data Context Storage: Stores the loaded AnnData object in the pipeline's data context for use by subsequent modules

Supported File Formats¶

Currently, the DataLoading module supports:

10X HDF5 files (.h5): The standard output format from Cell Ranger, typically named sample_filtered_feature_bc_matrix.h5

The module expects the standard 10X HDF5 structure containing: - Expression matrix (genes × cells) - Gene information (features) - Cell barcode information

Usage Notes¶

File Path Requirements¶

The file_path parameter must point to a valid 10X HDF5 file
Relative paths are supported and resolved relative to the working directory
The file must be readable by the user running the pipeline (which should not be an issue when working inside the container)

Data Structure¶

After loading, the AnnData object contains: - Observations (.obs): Cell-level metadata (initially just cell barcodes) - Variables (.var): Gene-level metadata (gene IDs, symbols, etc.) - Expression Matrix (.X): Raw count data (typically sparse matrix format) - Layers: The loaded_counts layer preserves the original data

Memory Considerations¶

Large datasets may require significant memory. Consider the following: - 10X filtered matrices are typically manageable for most systems - Raw matrices (containing all droplets) may require more memory - This initially do not require large amounts of memory; however, the pipeline creates copies of data in layers, which increases memory usage, especially if some layers are not sparse

Error Handling¶

The module has minimal error handling, but will log a clear report for the source of the issue.

Integration with Other Modules¶

The DataLoading module is designed to work seamlessly with other pipeline modules, providing the data for most downstream operations:

Quality Control: The loaded data is immediately ready for QC metric calculation
Ambient RNA Removal: Can work with both filtered and raw count matrices
Preprocessing: All subsequent modules expect the AnnData structure created by this module

Example Usage¶

Basic Usage¶

modules:
  - name: load_data
    type: DataLoading
    params:
      file_path: ./data/filtered_feature_bc_matrix.h5

With Absolute Path¶

modules:
  - name: load_data
    type: DataLoading
    params:
      file_path: /home/user/scrnaseq_data/sample1/outs/filtered_feature_bc_matrix.h5

In a Complete Pipeline¶

pipeline:
  name: full_analysis
  output_dir: ./output

modules:
  - name: data_loading
    type: DataLoading
    params:
      file_path: ./data/filtered_feature_bc_matrix.h5

  - name: qc_metrics
    type: QCMetrics
    params:
      mito_pattern: "^MT-"

  # Additional modules...

Implementation Details¶

The module leverages several key components:

Scanpy Integration: Uses sc.read_10x_h5() for reliable 10X file parsing
Layer Management: Utilizes the pipeline's save_layer() utility function to track data versions
Error Logging: Comprehensive logging of all operations and errors
Data Validation: Ensures loaded data meets pipeline requirements

Output Verification¶

After successful execution, you can verify the data was loaded correctly:

# The loaded AnnData object should have:
print(f"Loaded {adata.n_obs} cells and {adata.n_vars} genes")
print(f"Layers available: {list(adata.layers.keys())}")
print(f"Matrix type: {type(adata.X)}")

Expected output:

Loaded 8381 cells and 36601 genes
Layers available: ['loaded_counts']
Matrix type: <class 'scipy.sparse._matrix.csr_matrix'>

Troubleshooting¶

Common Issues¶

File Not Found Error

Error loading data: [Errno 2] No such file or directory: 'path/to/file.h5'

- Verify the file path is correct - Check file permissions - Ensure the file exists and isn't corrupted

Memory Errors

MemoryError: Unable to allocate array

- Monitor system memory usage - Consider using a machine with more RAM for large datasets, especially if using more memory-hungry modules after this one

Invalid HDF5 Format

OSError: Unable to open file (file signature not found)

- Verify the file is a valid HDF5 file - Check if the file download completed successfully - Re-download the file if necessary

Performance Considerations¶

Loading Time: Varies depending on the file size, but generally a few seconds at most
Storage: Creates additional layers that increase memory footprint