SRAmetadb Export

Export your srake database to the classic SRAmetadb.sqlite format for compatibility with existing bioinformatics tools and workflows.

Overview

The export feature creates a SQLite database that matches the original SRAmetadb schema, allowing seamless integration with tools and pipelines that expect the traditional format. This bridges the gap between srake’s modern architecture and legacy systems.

Key Features

Features at a Glance

Dual FTS Support: Choose between FTS3 (legacy compatibility) and FTS5 (modern performance)
Complete Schema Mapping: Maps srake’s modern schema to classic SRAmetadb tables
Compression Support: Optional gzip compression for exported databases
Batch Processing: Configurable batch sizes for efficient memory usage
Progress Tracking: Real-time progress updates for large exports
Flexible Output: Export to any location with custom naming

Quick Start

Basic Export

Export your srake database to SRAmetadb format:

# Export with modern FTS5 (recommended)
srake db export -o SRAmetadb.sqlite

# Export with FTS3 for legacy tool compatibility
srake db export -o SRAmetadb.sqlite --fts-version 3

Export Options

# Export from specific database
srake db export --db /path/to/srake.db -o output.sqlite

# Export with compression
srake db export -o SRAmetadb.sqlite.gz --compress

# Large dataset with custom batch size
srake db export -o SRAmetadb.sqlite --batch-size 50000

# Force overwrite existing file
srake db export -o SRAmetadb.sqlite --force

Output Schema

The exported database contains all the standard SRAmetadb tables and structures:

Core Tables

Table	Description
study	Research studies and projects
experiment	Sequencing experiments
sample	Biological samples
run	Sequencing runs
submission	Data submissions
sra	Denormalized table joining all data

Additional Components

Full-text Search: sra_ft virtual table for text searching
Metadata: metaInfo table with version and creation information
Documentation: col_desc table with column descriptions

FTS Version Comparison

Build Note: FTS5 support requires building srake with the sqlite_fts5 build tag. Use make build or go build -tags "sqlite_fts5".

FTS5 (Default - Recommended)

srake db export -o modern.sqlite --fts-version 5

Advantages:

🚀 Faster search performance
💾 Smaller index size
🌍 Better Unicode support
🔍 Advanced search features
📊 Built-in ranking functions

Best for:

New projects
Modern analysis pipelines
Performance-critical applications

FTS3 (Legacy Compatibility)

srake db export -o legacy.sqlite --fts-version 3

Advantages:

✅ 100% compatibility with older tools
📦 Works with legacy R packages
🔧 No modifications needed for existing scripts

Best for:

Existing pipelines requiring FTS3
Legacy bioinformatics tools
Older R/Bioconductor packages

Integration Examples

With R/Bioconductor

# Using SRAdb package
library(SRAdb)

# Load exported database
sqlfile <- "SRAmetadb.sqlite"
sra_con <- dbConnect(SQLite(), sqlfile)

# Query as usual
rs <- dbGetQuery(sra_con,
  "SELECT * FROM sra WHERE study_accession = 'SRP000001'")

# Full-text search
results <- dbGetQuery(sra_con,
  "SELECT * FROM sra_ft WHERE sra_ft MATCH 'cancer AND RNA-Seq'")

With Python

import sqlite3
import pandas as pd

# Connect to exported database
conn = sqlite3.connect('SRAmetadb.sqlite')

# Query with pandas
query = """
SELECT study_accession, study_title, organism
FROM study
WHERE organism = 'Homo sapiens'
"""
df = pd.read_sql_query(query, conn)

# Full-text search
fts_query = """
SELECT * FROM sra_ft
WHERE sra_ft MATCH 'single cell'
LIMIT 100
"""
results = pd.read_sql_query(fts_query, conn)

Command Line

# Query exported database
sqlite3 SRAmetadb.sqlite "SELECT COUNT(*) FROM sra"

# Full-text search
sqlite3 SRAmetadb.sqlite \
  "SELECT run_accession FROM sra_ft WHERE sra_ft MATCH 'ChIP-Seq'"

# Export results to CSV
sqlite3 -header -csv SRAmetadb.sqlite \
  "SELECT * FROM experiment WHERE platform = 'ILLUMINA'" > results.csv

Advanced Usage

Automated Export Pipeline

Create a script for regular exports:

#!/bin/bash
# export_pipeline.sh

# Set paths
SRAKE_DB="/data/srake/srake.db"
EXPORT_DIR="/data/exports"
DATE=$(date +%Y%m%d)

# Export with both FTS versions
srake db export \
  --db "$SRAKE_DB" \
  -o "$EXPORT_DIR/SRAmetadb_fts5_$DATE.sqlite" \
  --fts-version 5 \
  --batch-size 50000

srake db export \
  --db "$SRAKE_DB" \
  -o "$EXPORT_DIR/SRAmetadb_fts3_$DATE.sqlite" \
  --fts-version 3 \
  --batch-size 50000

# Compress for archival
gzip "$EXPORT_DIR/SRAmetadb_fts5_$DATE.sqlite"
gzip "$EXPORT_DIR/SRAmetadb_fts3_$DATE.sqlite"

echo "Export completed: $DATE"

Performance Optimization

For large databases, optimize the export process:

# Use larger batch size for better performance
srake db export -o output.sqlite --batch-size 100000

# Export without progress bar for scripts
srake db export -o output.sqlite --no-progress --quiet

# Monitor memory usage during export
srake db export -o output.sqlite --debug

Validation

Verify the exported database:

# Check table counts
sqlite3 SRAmetadb.sqlite <<EOF
SELECT 'Studies:', COUNT(*) FROM study;
SELECT 'Experiments:', COUNT(*) FROM experiment;
SELECT 'Samples:', COUNT(*) FROM sample;
SELECT 'Runs:', COUNT(*) FROM run;
SELECT 'SRA Records:', COUNT(*) FROM sra;
EOF

# Test FTS functionality
sqlite3 SRAmetadb.sqlite \
  "SELECT COUNT(*) FROM sra_ft WHERE sra_ft MATCH 'test'"

# Verify schema
sqlite3 SRAmetadb.sqlite ".schema study"

Schema Mapping

The export process handles the mapping between srake’s modern schema and the classic SRAmetadb format:

Field Transformations

srake Field	SRAmetadb Field	Transformation
JSON arrays	Pipe-delimited strings	`["A","B"]` → `"A\|B"`
Nested metadata	Flattened columns	Extracted to individual fields
Modern timestamps	Legacy date format	ISO 8601 → `YYYY-MM-DD HH:MM:SS`
Missing fields	Default values	Populated with appropriate defaults

Data Integrity

Relationships preserved: Foreign keys maintained between tables
Accessions unchanged: All SRA accessions remain identical
Metadata complete: All essential metadata included
URLs generated: SRA and Entrez links created automatically

Troubleshooting

Common Issues

Large Database Export

For databases over 10GB, use larger batch sizes and consider disabling progress:

srake db export -o output.sqlite --batch-size 100000 --no-progress

Memory Usage

Export process requires approximately 2x the batch size in RAM. Adjust batch size if encountering memory issues.

Verification

Always verify the export completed successfully:

sqlite3 output.sqlite "SELECT COUNT(*) FROM sra"

Error Messages

Error	Solution
`database not found`	Ensure source database exists and path is correct
`output file exists`	Use `--force` to overwrite or choose different name
`invalid FTS version`	Use either 3 or 5 for `--fts-version`
`out of memory`	Reduce `--batch-size` value

Best Practices

Choose appropriate FTS version
- Use FTS5 for new projects
- Use FTS3 only when required for compatibility
Optimize batch size
- Start with default (10000)
- Increase for better performance if RAM allows
- Decrease if encountering memory issues
Regular exports
- Automate exports after data updates
- Keep versioned exports for reproducibility
- Compress old exports to save space
Validation
- Always verify record counts after export
- Test FTS functionality
- Check critical queries work as expected

Next Steps

API Access: Learn about programmatic database queries
Automation Guide: Explore scripted workflows
Examples: View real-world use cases
CLI Reference: Complete command documentation

Advanced Search