Filtering System
The filtering system allows you to process only the data you need, reducing storage requirements.
Overview
The filtering system operates during the streaming pipeline, applying filters before data is inserted into the database:
Key features:
- Memory Efficient: Filters applied during streaming
- Early Rejection: Unwanted records discarded before database operations
- Statistics: Track filtering effectiveness
Filter Types
Taxonomy Filtering
Filter by NCBI taxonomy IDs to focus on specific organisms:
# Human data only (taxonomy ID 9606)
srake ingest --file archive.tar.gz \
--taxon-ids 9606
# Human, mouse, and zebrafish
srake ingest --file archive.tar.gz \
--taxon-ids 9606,10090,7955
# Exclude viruses and bacteria
srake ingest --file archive.tar.gz \
--exclude-taxon-ids 32630,2697049,562
# Mammals excluding E. coli contamination
srake ingest --file archive.tar.gz \
--taxon-ids 9606,10090 \
--exclude-taxon-ids 562
Common Taxonomy IDs:
9606
- Homo sapiens (human)10090
- Mus musculus (mouse)7955
- Danio rerio (zebrafish)7227
- Drosophila melanogaster562
- Escherichia coli
Organism Name Filtering
Filter by scientific names when you don’t know the taxonomy IDs:
# Single organism
srake ingest --file archive.tar.gz \
--organisms "homo sapiens"
# Multiple organisms
srake ingest --file archive.tar.gz \
--organisms "homo sapiens,mus musculus,rattus norvegicus"
Date Range Filtering
Filter by submission or publication dates:
srake ingest --file archive.tar.gz \
--date-from 2024-01-01 \
--date-to 2024-12-31
srake ingest --file archive.tar.gz \
--date-from 2024-10-01
srake ingest --file archive.tar.gz \
--date-to 2020-12-31
Platform Filtering
# Illumina data only
srake ingest --file archive.tar.gz \
--platforms ILLUMINA
# Long-read platforms
srake ingest --file archive.tar.gz \
--platforms OXFORD_NANOPORE,PACBIO_SMRT
# Multiple platforms
srake ingest --file archive.tar.gz \
--platforms ILLUMINA,ION_TORRENT
Available Platforms:
ILLUMINA
OXFORD_NANOPORE
PACBIO_SMRT
ION_TORRENT
LS454
ABI_SOLID
COMPLETE_GENOMICS
Library Strategy Filtering
Filter by experimental strategy:
# RNA sequencing only
srake ingest --file archive.tar.gz \
--strategies RNA-Seq
# Whole genome sequencing
srake ingest --file archive.tar.gz \
--strategies WGS,WXS
# Epigenomics studies
srake ingest --file archive.tar.gz \
--strategies ChIP-Seq,ATAC-Seq,Bisulfite-Seq
# Multiple strategies
srake ingest --file archive.tar.gz \
--strategies RNA-Seq,WGS,ChIP-Seq
Common Strategies:
RNA-Seq
- RNA sequencingWGS
- Whole Genome SequencingWXS
- Whole Exome SequencingChIP-Seq
- Chromatin IPATAC-Seq
- Chromatin accessibilityBisulfite-Seq
- DNA methylationHi-C
- Chromosome conformation
Quality Filtering
Filter by sequencing depth and quality:
# Minimum 10M reads and 1GB bases
srake ingest --file archive.tar.gz \
--min-reads 10000000 \
--min-bases 1000000000
# Between 5M and 50M reads
srake ingest --file archive.tar.gz \
--min-reads 5000000 \
--max-reads 50000000
# Ultra-deep sequencing (10GB+ bases)
srake ingest --file archive.tar.gz \
--min-bases 10000000000
Complex Filter Combinations
Research-Specific Workflows
# Human cancer RNA-Seq studies
srake ingest --file archive.tar.gz \
--taxon-ids 9606 \
--strategies RNA-Seq,WGS \
--date-from 2023-01-01 \
--min-reads 20000000
# Population genomics data
srake ingest --file archive.tar.gz \
--taxon-ids 9606 \
--strategies WGS \
--platforms ILLUMINA \
--min-bases 30000000000
# Microbiome studies
srake ingest --file archive.tar.gz \
--strategies AMPLICON,WGS \
--platforms ILLUMINA,ION_TORRENT \
--exclude-taxon-ids 9606,10090
# Single-cell RNA-Seq
srake ingest --file archive.tar.gz \
--taxon-ids 9606,10090 \
--strategies RNA-Seq \
--date-from 2022-01-01 \
--platforms ILLUMINA
Preview Mode
Test your filters without inserting data:
Filter Processing
Filters are applied during the streaming pipeline, allowing efficient processing of large datasets without loading everything into memory.
Filter Configuration Files
For complex, reusable filter sets, use YAML configuration:
# filters.yaml
taxonomy:
include: [9606, 10090]
exclude: [562]
platforms:
- ILLUMINA
- OXFORD_NANOPORE
strategies:
- RNA-Seq
- WGS
date:
from: "2024-01-01"
to: "2024-12-31"
quality:
min_reads: 10000000
min_bases: 1000000000
# Use configuration file
srake ingest --file archive.tar.gz \
--filter-config filters.yaml
# Override specific settings
srake ingest --file archive.tar.gz \
--filter-config filters.yaml \
--date-from 2025-01-01
Best Practices
Tips for Effective Filtering
- Start with –stats-only to preview results
- Use taxonomy filters for targeted datasets
- Combine filters for precise data selection
- Save configurations for reproducible workflows
- Monitor statistics to verify filter effectiveness