Examples

Examples

Examples and Use Cases

Real-world examples demonstrating common workflows with SRAKE (SRA Knowledge Engine), including pipeline composition and automation patterns.

Research Workflows

Finding RNA-Seq Data for a Specific Organism

Find all human RNA-Seq experiments published in 2024:

# Build search index first
srake search index --build

# Search with advanced query syntax
srake search "organism:human AND library_strategy:RNA-Seq" --advanced \
  --date-from 2024-01-01 \
  --limit 100 \
  --format csv \
  --output human_rna_seq_2024.csv

# Search for specific disease studies
srake search "breast cancer AND organism:\"homo sapiens\" AND library_strategy:RNA-Seq" \
  --advanced \
  --spots-min 10000000 \
  --format json \
  --output breast_cancer_studies.json

Using Advanced Search Features

Leverage boolean operators and field-specific queries:

# Complex boolean queries
srake search "(cancer OR tumor) AND organism:human NOT cell_type:hela" --advanced

# Wildcard searches
srake search "RNA* AND platform:ILLUMINA" --advanced

# Range queries for high-coverage data
srake search "spots:[10000000 TO *] AND bases:[1000000000 TO *]" --advanced

# Fuzzy search for typo tolerance
srake search "transciptome" --fuzzy  # Finds "transcriptome"

Aggregation and Analytics

Analyze metadata distributions:

# Count studies by organism
srake search "RNA-Seq" --aggregate-by organism

# Get faceted results
srake search "cancer" --facets --format json | \
  jq '.facets.platform' | head -20

# Count total matching records
srake search "single cell" --count-only

# Group by library strategy
srake search --organism "homo sapiens" --aggregate-by library_strategy

Downloading Data for a Published Study

Download all FASTQ files for a study mentioned in a paper:

# Convert GEO accession from paper to SRA
srake convert GSE123456 --to SRP

# Get all runs for the study
srake runs SRP123456 --format json --output runs.json

# Download all runs in parallel
srake download SRP123456 \
  --type fastq \
  --source aws \
  --parallel 4 \
  --output ./fastq_files/

Cross-referencing Multiple Studies

Compare samples across different studies:

# Get samples from multiple studies
srake samples SRP001 --format json > study1_samples.json
srake samples SRP002 --format json > study2_samples.json
srake samples SRP003 --format json > study3_samples.json

# Or in batch
for study in SRP001 SRP002 SRP003; do
  srake samples $study --detailed --format json > ${study}_samples.json
done

Unix Pipeline Integration

Composable Commands

srake commands can be chained together using Unix pipes:

# Find experiments → Get runs → Download
srake search "CRISPR" --format tsv --no-header | \
  cut -f1 | \
  xargs -I {} srake runs {} --format tsv --no-header | \
  cut -f1 | \
  srake download --type fastq

# Convert accessions in bulk
cat geo_accessions.txt | \
  srake convert --to SRP --format tsv --no-header | \
  cut -f2 > sra_projects.txt

# Chain multiple conversions
echo "GSE123456" | \
  srake convert --to SRP | \
  grep SRP | \
  srake runs --format json

Stream Processing

Process large datasets without intermediate files:

# Real-time filtering and conversion
srake search "RNA-Seq" --format tsv --no-header | \
  awk '$3 > 1000000' | \
  cut -f1 | \
  while read acc; do
    srake convert $acc --to GSE --quiet
  done

# Parallel processing with xargs
srake search "mouse" --limit 100 --format tsv --no-header | \
  cut -f1 | \
  xargs -P 4 -I {} srake metadata {} --format json --quiet

Data Discovery

Finding Related Experiments

Discover all experiments related to a sample:

# Start with a sample accession
SAMPLE="SRS123456"

# Get all experiments for this sample
srake experiments $SAMPLE --detailed

# Get the parent study
srake studies $SAMPLE

# Get all other samples in the study
STUDY=$(srake studies $SAMPLE --format json | jq -r '.[0].study_accession')
srake samples $STUDY --detailed

Exploring Platform-Specific Data

Find all Oxford Nanopore sequencing data:

# Ingest only Nanopore data
srake ingest --auto \
  --platforms OXFORD_NANOPORE

# Search for specific applications
srake search "metagenome" \
  --platform OXFORD_NANOPORE \
  --limit 50

Batch Operations

Converting a List of Accessions

Convert multiple accessions from a publication supplementary table:

# Create accession list
cat > geo_accessions.txt << EOF
GSE111111
GSE222222
GSE333333
GSE444444
EOF

# Batch convert to SRA projects
srake convert --batch geo_accessions.txt \
  --to SRP \
  --format json \
  --output sra_projects.json

# Extract just the SRP IDs
cat sra_projects.json | jq -r '.[] | select(.error == null) | .targets[]' > srp_list.txt

Bulk Download with Filtering

Download only high-quality runs from multiple experiments:

# Get runs with quality metrics
srake runs SRP123456 --detailed --format json | \
  jq '.[] | select(.total_bases > 10000000000) | .run_accession' > \
  high_quality_runs.txt

# Download filtered runs
srake download --list high_quality_runs.txt \
  --source aws \
  --parallel 4 \
  --threads 2

Export Examples

Exporting for R/Bioconductor

Export database for use with SRAdb package:

# Export with FTS3 for SRAdb compatibility
srake db export -o SRAmetadb.sqlite --fts-version 3

# Use in R
# library(SRAdb)
# sra_con <- dbConnect(SQLite(), "SRAmetadb.sqlite")
# getSRA(search_term="breast cancer", sra_con=sra_con)

Creating Versioned Exports

Maintain versioned exports for reproducibility:

#!/bin/bash
# export_versioned.sh

DATE=$(date +%Y%m%d)
VERSION="v1.0"

# Export with metadata
srake db export \
  -o "SRAmetadb_${VERSION}_${DATE}.sqlite" \
  --fts-version 5

# Create compatibility version
srake db export \
  -o "SRAmetadb_${VERSION}_${DATE}_compat.sqlite" \
  --fts-version 3

# Compress for storage
gzip -k "SRAmetadb_${VERSION}_${DATE}.sqlite"
gzip -k "SRAmetadb_${VERSION}_${DATE}_compat.sqlite"

# Create checksums
sha256sum SRAmetadb_*.sqlite* > checksums.txt

Integration with Legacy Pipelines

Export and convert for existing workflows:

# Export database
srake db export -o SRAmetadb.sqlite --fts-version 3

# Extract specific data for legacy tool
sqlite3 SRAmetadb.sqlite <<EOF
.headers on
.mode csv
.output legacy_data.csv
SELECT
  run_accession,
  experiment_accession,
  sample_accession,
  study_accession,
  library_strategy,
  platform,
  organism
FROM sra
WHERE library_strategy = 'RNA-Seq'
  AND platform = 'ILLUMINA';
EOF

# Use with legacy pipeline
./legacy_pipeline.sh --input legacy_data.csv

Automated Export Pipeline

Set up regular exports after data updates:

#!/bin/bash
# auto_export.sh

# Update database
srake ingest --auto --yes

# Rebuild search index
srake search index --rebuild

# Export both FTS versions
srake db export -o /exports/SRAmetadb_fts5.sqlite --fts-version 5
srake db export -o /exports/SRAmetadb_fts3.sqlite --fts-version 3

# Notify completion
echo "Export completed: $(date)" | mail -s "SRAmetadb Export" admin@example.com

Integration Examples

Building a Local Index

Create a searchable index of specific data types:

# 1. Ingest filtered data
srake ingest --auto \
  --organisms "homo sapiens,mus musculus" \
  --strategies "RNA-Seq,ChIP-Seq,ATAC-Seq" \
  --min-reads 10000000 \
  --date-from 2023-01-01

# 2. Export metadata for indexing
srake search "*" --limit 0 --format json > all_metadata.json

# 3. Start API server for queries
srake server --port 8080 &

# 4. Query via API
curl "http://localhost:8080/api/search?q=transcription+factor&limit=20"

Creating a Download Queue

Generate and process a download queue:

#!/bin/bash
# download_queue.sh

# Get all RNA-Seq runs from 2024
srake search "RNA-Seq" \
  --format json \
  --date-from 2024-01-01 | \
  jq -r '.results[].accession' > rna_seq_2024.txt

# Process in batches of 10
split -l 10 rna_seq_2024.txt batch_

# Download each batch
for batch in batch_*; do
  echo "Processing $batch"
  srake download --list $batch \
    --parallel 2 \
    --output ./downloads/
  sleep 60  # Pause between batches
done

Metadata Analysis Pipeline

Extract and analyze metadata for a research domain:

# Get all cancer-related studies
srake search "cancer" --format json --limit 1000 > cancer_studies.json

# Extract platform distribution
cat cancer_studies.json | \
  jq -r '.results[].platform' | \
  sort | uniq -c | sort -rn

# Get temporal distribution
cat cancer_studies.json | \
  jq -r '.results[].published' | \
  cut -d'-' -f1 | \
  sort | uniq -c

Advanced Filtering

Multi-criteria Filtering

Complex filtering for specific research needs:

# Ingest single-cell RNA-seq from human brain
srake ingest --auto \
  --taxon-ids 9606 \
  --organisms "homo sapiens" \
  --strategies "RNA-Seq" \
  --min-reads 100000000 \
  --filter-verbose

# Search with additional criteria
srake search "brain OR neuron OR glia" \
  --format json | \
  jq '.results[] | select(.title | test("single.cell|sc.?RNA|10x"; "i"))'

Quality Control Pipeline

Filter and validate high-quality datasets:

# Function to check data quality
check_quality() {
  local accession=$1

  # Get run information
  srake runs $accession --format json | \
    jq -r '.[] | "\(.run_accession):\(.total_bases):\(.total_spots)"' | \
    while IFS=: read -r run bases spots; do
      if [ $bases -gt 10000000000 ]; then
        echo "$run PASS"
      else
        echo "$run FAIL"
      fi
    done
}

# Check multiple studies
for study in SRP001 SRP002 SRP003; do
  echo "Checking $study"
  check_quality $study
done

Performance Optimization

Parallel Processing

Maximize throughput with parallel operations:

# Parallel conversion
cat accessions.txt | \
  parallel -j 4 'srake convert {} --to GSE --format json > {}.json'

# Parallel metadata fetch
cat studies.txt | \
  parallel -j 8 'srake metadata {} --format json' > all_metadata.jsonl

# Parallel download with resource limits
nice -n 10 srake download --list large_dataset.txt \
  --parallel 4 \
  --threads 2 \
  --output /data/downloads/

Incremental Updates

Keep your database current with minimal overhead:

#!/bin/bash
# daily_update.sh

# Check last update
LAST_UPDATE=$(srake db info | grep "Last update" | cut -d: -f2)

# Ingest only new data
srake ingest --daily \
  --date-from "$LAST_UPDATE" \
  --no-progress

# Log update
echo "$(date): Updated from $LAST_UPDATE" >> update.log

Troubleshooting Examples

Debugging Failed Downloads

# Dry run to check URLs
srake download SRR123456 --dry-run --verbose

# Test different sources
for source in ftp aws gcp ncbi; do
  echo "Testing $source"
  srake download SRR123456 \
    --source $source \
    --dry-run
done

# Use verbose mode for debugging
srake download SRR123456 \
  --verbose \
  --retry 5

Handling Large Result Sets

# Paginate through large results
OFFSET=0
LIMIT=1000

while true; do
  COUNT=$(srake search "human" \
    --offset $OFFSET \
    --limit $LIMIT \
    --format json | \
    jq '.results | length')

  if [ $COUNT -eq 0 ]; then
    break
  fi

  # Process batch
  srake search "human" \
    --offset $OFFSET \
    --limit $LIMIT \
    --format json > batch_$OFFSET.json

  OFFSET=$((OFFSET + LIMIT))
done

Automation Scripts

Daily Report Generator

#!/bin/bash
# daily_report.sh

DATE=$(date +%Y-%m-%d)
REPORT="report_$DATE.html"

cat > $REPORT << EOF
<html>
<head><title>SRA Daily Report - $DATE</title></head>
<body>
<h1>SRA Daily Report</h1>
<p>Generated: $(date)</p>
EOF

# Database statistics
echo "<h2>Database Statistics</h2><pre>" >> $REPORT
srake db info >> $REPORT
echo "</pre>" >> $REPORT

# New studies today
echo "<h2>New Studies</h2><pre>" >> $REPORT
srake search "*" --date-from $DATE --format table >> $REPORT
echo "</pre>" >> $REPORT

echo "</body></html>" >> $REPORT

# Email report
mail -s "SRA Daily Report $DATE" \
  -a $REPORT \
  team@example.com < /dev/null

Next Steps