Resume Capability
srake includes intelligent resume functionality that handles interruptions gracefully, allowing you to continue processing from where you left off.
Overview
Processing large SRA metadata archives (14GB+) can take significant time. Network issues, system crashes, or user interruptions can occur during processing. srake’s resume capability ensures you never have to start over from the beginning.
Key Features
- Automatic Progress Tracking: Real-time tracking of download and processing progress
- Database-Backed Persistence: Progress stored in SQLite for reliability
- Checkpoint System: Periodic checkpoints for accurate recovery points
- File-Level Deduplication: Skip already-processed XML files on resume
- HTTP Range Support: Resume downloads from exact byte position
- Smart Recovery: Automatically detect and resume interrupted sessions
How It Works
Progress Tracking
srake tracks multiple aspects of progress:
Download Progress
- Bytes downloaded vs. total size
- HTTP range support for partial downloads
- Network failure recovery
Processing Progress
- Current tar position in archive
- Last processed XML file
- Records inserted into database
Checkpoint System
- Periodic checkpoints (default: every 1000 records)
- Safe points for recovery
- Minimal performance impact (< 100ms)
Resume Detection
When you run an ingest command, srake automatically:
- Checks for existing progress records
- Validates the source matches
- Offers to resume or start fresh
- Resumes from the last safe checkpoint
Using Resume
Automatic Resume
Simply run the same command again after interruption:
# Original command
srake ingest --file NCBI_SRA_Full_20250818.tar.gz
# If interrupted, run again
srake ingest --file NCBI_SRA_Full_20250818.tar.gz
# Output:
Previous ingestion found:
Source: NCBI_SRA_Full_20250818.tar.gz
Progress: 45.3% complete (6.3GB/14GB)
Records: 1,234,567 processed
Started: 2025-01-17 10:30:00
Resume from last position? (y/n): y
Resuming from: experiment_batch_042.xml
[====================>.................] 45.3% | 6.3GB/14GB | ETA: 15 min
Force Fresh Start
Override existing progress and start from beginning:
srake ingest --file archive.tar.gz --force
# Warning shown:
⚠️ Existing progress will be discarded
Continue? (y/n): y
Starting fresh ingestion...
Check Status
View current or last ingestion status:
srake ingest --status
# Output:
Current Ingestion Status
────────────────────────
Source: NCBI_SRA_Full_20250818.tar.gz
State: In Progress
Progress: 67.8% complete
Records Processed: 2,345,678
Start Time: 2025-01-17 10:30:00
Last Update: 2025-01-17 11:45:23
Estimated Time Remaining: 12 minutes
Configure Checkpoints
Adjust checkpoint frequency for your needs:
# Checkpoint every 5000 records (less frequent)
srake ingest --file archive.tar.gz --checkpoint 5000
# Checkpoint every 100 records (more frequent, safer)
srake ingest --file archive.tar.gz --checkpoint 100
Interactive Mode
Get prompted before resuming:
srake ingest --file archive.tar.gz --interactive
# Always prompts:
Previous ingestion found. Resume? (y/n):
Resume with Filters
Resume works seamlessly with filtering:
# Original filtered ingestion
srake ingest --file archive.tar.gz \
--taxon-ids 9606 \
--platforms ILLUMINA
# Resume with same filters applied
srake ingest --file archive.tar.gz \
--taxon-ids 9606 \
--platforms ILLUMINA
# Filters are preserved and reapplied
Performance Impact
Resume capability adds minimal overhead:
Aspect | Impact |
---|---|
Memory Usage | < 1MB for progress tracking |
Processing Speed | < 5% reduction |
Checkpoint Time | < 100ms per checkpoint |
Resume Time | < 5 seconds to restart |
Database Size | < 100KB for progress records |
Architecture
Progress Database Schema
CREATE TABLE ingest_progress (
id INTEGER PRIMARY KEY,
source_url TEXT NOT NULL,
source_hash TEXT UNIQUE NOT NULL,
total_bytes INTEGER,
downloaded_bytes INTEGER,
processed_bytes INTEGER,
last_tar_position INTEGER,
last_xml_file TEXT,
records_processed INTEGER,
state TEXT,
started_at TIMESTAMP,
updated_at TIMESTAMP,
completed_at TIMESTAMP,
error_message TEXT
);
CREATE TABLE processed_files (
id INTEGER PRIMARY KEY,
progress_id INTEGER,
filename TEXT NOT NULL,
processed_at TIMESTAMP,
FOREIGN KEY (progress_id) REFERENCES ingest_progress(id)
);
Recovery Process
Validation Phase
- Verify source file exists/accessible
- Check source hash matches
- Validate database consistency
Restoration Phase
- Seek to last tar position
- Skip processed XML files
- Restore counters and statistics
Continuation Phase
- Resume normal processing
- Continue checkpoint creation
- Update progress tracking
Troubleshooting
Resume Not Working
If resume doesn’t work as expected:
Check source file hasn’t changed
# View progress details srake ingest --status
Verify database integrity
# Database location ls -la ./data/metadata.db
Force restart if needed
srake ingest --file archive.tar.gz --force
Corrupted Progress
If progress records are corrupted:
# Clean up progress records
srake ingest --cleanup
# Start fresh
srake ingest --file archive.tar.gz
Network Issues
For unstable networks:
# Increase retry attempts
srake ingest --file https://ftp.ncbi.nlm.nih.gov/sra/archive.tar.gz \
--max-retries 10 \
--retry-delay 30
Best Practices
- Let It Resume: Don’t use
--force
unless necessary - Regular Checkpoints: Default (1000 records) works well for most cases
- Monitor Progress: Use
--verbose
to see detailed progress - Keep Database: Don’t delete metadata.db during processing
- Network Stability: For large files, ensure stable connection
Examples
Research Workflow
# Start large ingestion Friday evening
srake ingest --monthly \
--taxon-ids 9606 \
--strategies RNA-Seq
# System maintenance interrupts at 30%
# Resume Monday morning
srake ingest --monthly \
--taxon-ids 9606 \
--strategies RNA-Seq
# Continues from 30%, completes successfully
Batch Processing
# Process multiple archives with resume support
for archive in *.tar.gz; do
srake ingest --file "$archive"
# Each file has independent resume tracking
done
Filtered Resume
# Complex filter with resume
srake ingest --file huge_archive.tar.gz \
--taxon-ids 9606,10090 \
--platforms ILLUMINA \
--strategies RNA-Seq,WGS \
--date-from 2024-01-01 \
--min-reads 10000000
# Power failure at 60%
# Resume with exact same filters
srake ingest --file huge_archive.tar.gz \
--taxon-ids 9606,10090 \
--platforms ILLUMINA \
--strategies RNA-Seq,WGS \
--date-from 2024-01-01 \
--min-reads 10000000
# Continues from 60% with filters intact
Technical Details
File Identification
Files are identified by:
- Source URL/path
- SHA-256 hash of first 1MB
- File size
- Modification time
State Management
Progress states:
pending
: Initialized but not starteddownloading
: Downloading from remote sourceprocessing
: Processing tar archivecompleted
: Successfully finishedfailed
: Error occurredcancelled
: User cancelled
Cleanup Policy
- Progress records kept for 30 days after completion
- Failed attempts kept for 7 days
- Manual cleanup available via
--cleanup
Next Steps
- Learn about Performance Optimizations
- Explore Architecture Details
- See Real-World Examples