Warehousing DbSNP, Part I: Downloading Chromosome 1 & Creating our Database

Intro

This series of posts cover how to migrate DbSNP’s newly downloadable JSON files into a slimmed-down, relational database, containing population frequency data, associated pubmed IDs, and clinvar data for each SNP in the database.

The entire codebase of this walkthrough is available here on github. I’ve also sprinkled in Github Links above all code snippets, to navigate to each corresponding line & file.

Moving forward, NCBI plans to store 1 JSON file per chromosome, and these files can be found here. The total size of this JSON payload is a whopping 104GB of gzipped data.

The uncompressed size of chromosome 1’s file is 162GB, with the compressed size being 8GB. Using this compression ratio (1:20) as a rule-of-thumb, the uncompressed size of the entire dataset is 2.1TB…that’s a lot of data to process and warehouse. Let’s begin with step 1: downloading refsnp_chr1.json.gz as our case study.

I gave a lightning talk at PyCon2018 about a live, deployed application of this database: The Snip API. You can easily turn your 23&Me Raw Data into insights from DbSNP using this (free) API; the client is found here on github.

Downloading the file

In a perfect, simpler world, the following few LOC would download our 1st file:

import ftplib

dbnsp_filename = "refsnp-chr1.json.gz"
with open(dbsnp_filename.replace(".gz", ""), "wb") as fp:
    ftp = ftplib.FTP("ftp.ncbi.nlm.nih.gov")
    ftp.login()
    ftp.cwd("snp/.redesign/latest_release/JSON")
    ftp.retrbinary("RETR " + dbsnp_filename, fp.write, 1024)
    ftp.quit()

Unfortunately, the combination of the FTP protocal, large file download size and network latency create a perfect storm for server side & client side timeouts. Running the above code will lead to server side (or client side) timeout, and eventually the download will come to a halt.

Kudos to this excellent stackoverflow answer for a graceful solution to avoid FTP downloads timing out. We just send simple “NOOP” commands, equivalent to a ‘PING’, from a separate thread, to keep our connection alive:

Github Link

import ftplib

dbsnp_filename = "refsnp-chr1.json.gz"
with open(dbsnp_filename, "wb") as fp:
    ftp = ftplib.FTP("ftp.ncbi.nlm.nih.gov")
    ftp.login()
    ftp.cwd("snp/.redesign/latest_release/JSON")
    size_gb = round(ftp.size(dbsnp_filename) / (1024**3), 2)
    print(f"Filesize: {size_gb} GB")
    sock = None
    while not sock:  # Try to open the socket conn w/ server
        print("Trying to establish FTP conn")
        sock = ftp.transfercmd(f"RETR {dbsnp_filename}")
        time.sleep(5)

    def download():
        transferred, blocks = 0, 0
        while True:
            byte_chunk = sock.recv(1024*1024*8)
            if not byte_chunk:
                break
            blocks += 1
            transferred += len(byte_chunk)
            transferred_mb = round(transferred / 1024 / 1024, 2)
            if blocks % 1000 == 0:
                print(f"Transferred {transferred_mb}MB / "
                      f"{size_gb * 1024}MB")
            fp.write(byte_chunk)
        sock.close()
        fp.close()
    t = threading.Thread(target=download)
    t.start()
    while t.is_alive():
        t.join(60)
        ftp.voidcmd("NOOP")

For those wondering, on a shoddy cafe WiFi, downloading refsnp-chr1.json.gz takes about 48 minutes, which was insigificantly less than the time it takes to download the file in Google Chrome (50 minutes). Make sure you aren’t on shoddy cafe WiFi, times 23…

We now have the file refsnp-chr1.json.gz in our working directory, which holds a JSON object per RefSNP on each line. We want to extract a small subset of the ~200GB of data stored in this file, and insert it into a Relational Database. Let’s define our schema:

Defining our Relational Database Schema

Rather than writing an unmaintanable mess of SQL DDL statements, I opt to use SQLAlchemy’s core library for defining our database schema (i.e. the tables). It’s a powerful, well-maintained library, producing code that is far more maintainable than free-form SQL text. Here are the models representing tables to insert our data into:

Github Link

# coding: utf-8
import os
from sqlalchemy import (
     BigInteger, Integer, Text, String,
     ForeignKey, Column, create_engine,
     ForeignKeyConstraint)
from sqlalchemy.dialects.postgresql import ARRAY
from sqlalchemy.ext.declarative import declarative_base, declared_attr


Base = declarative_base()
metadata = Base.metadata
engine = create_engine(os.environ["SNIP_DB_URL"])
metadata.bind = engine


class RefSnpAllele(Base):
    __tablename__ = 'ref_snp_alleles'

    ins_seq = Column(Text, index=True)
    del_seq = Column(Text, index=True)
    position = Column(BigInteger)
    chromosome = Column(String(2))
    gene_locii = Column(ARRAY(Text, dimensions=1))
    
    ref_snp_allele_idx = Column(Integer, primary_key=True)
    ref_snp_id = Column(Integer, primary_key=True)


class RefSnpAlleleRelative(object):
    __tablename__ = 'ref_snp_allele_relative'

    id = Column(Integer, primary_key=True)
    ref_snp_allele_idx = Column(index=True)
    ref_snp_id = Column(index=True)

    @declared_attr
    def __table_args__(cls):
        return (ForeignKeyConstraint(
            [cls.ref_snp_id, cls.ref_snp_allele_idx],
            [RefSnpAllele.ref_snp_id, RefSnpAllele.ref_snp_allele_idx]), {})


class RefSnpAlleleFrequencyStudy(RefSnpAlleleRelative, Base):
    __tablename__ = 'ref_snp_allele_freq_studies'

    name = Column(Text)
    allele_count = Column(Integer)
    total_count = Column(Integer)


class RefSnpClinicalDisease(RefSnpAlleleRelative, Base):
    __tablename__ = 'ref_snp_allele_clin_diseases'

    disease_name_csv = Column(Text)
    clinical_significance_csv = Column(Text)
    citation_list = Column(ARRAY(Integer, dimensions=1))

Note: We’re using Inheritance to DRY (don’t repeat yourself) out the ref_snp_allele_clin_diseases and ref_snp_allele_freq_studies model definitions. Avoiding small redundancies like these allow for more malleable code, and less opportunity for bugs down the road.

To create these tables, I assume we’ve already created a local PostgreSQL database. Creating and configuring this database is outside the scope of this post, but if on a Mac, you can start by installing PostgreSQL with:

brew install postgresql

…and then reading the quick docs here. There is plenty of documentation on configuring these easily.

Next, to create our tables we leverage SQLAlchemy’s create_all() method, and some quick SQL statements in the following snippet:

Github Link

def init_db(database_name):
    # Connect to 'postgres' admin DB, and drop the DB named 'database_name'
    os.system(f"echo 'DROP DATABASE {database_name};"
          f"CREATE DATABASE {database_name};'"
          " | psql postgres")
    metadata.create_all()

You can now connect to the PostgreSQL database and view the schema:

(env) SeanH@foo:~/snip_warehouse/snip_warehouse$ psql -U SeanH snip
psql (10.3)
Type "help" for help.

snip=# \dt
                   List of relations
 Schema |             Name             | Type  | Owner
--------+------------------------------+-------+-------
 public | ref_snp_allele_clin_diseases | table | SeanH
 public | ref_snp_allele_freq_studies  | table | SeanH
 public | ref_snp_alleles              | table | SeanH

Defining DTOs (Data Transfer Objects)

Let’s take a quick tangent, and talk about DTOs, Python dicts and namedtuples.

It’s very easy to use Dicts in Python. Consequently, it’s very easy to abuse them. When passing data between functions in a non-trivial application, PLEASE take the time to define an explicit type, like a namedtuple, rather than abusing a dict. You will thank yourself months down the road…

Let’s define a series of Data Transfer Object types, using the collections.namedtuple datatype, to provide naming consistencies, strict(er) function interfaces, immutability and readability:

Github Link

# DTO's
RefSnpCopyFromData = namedtuple("RefSnpCopyFromData", [
    "ref_snp_alleles",
    "ref_snp_allele_freq_studies",
    "ref_snp_allele_clin_diseases"])

RefSnpAllele = namedtuple("RefSnpAllele", [
    'ins_seq', 'del_seq', 'position', 'chromosome',
    'ref_snp_allele_idx', 'ref_snp_id', 'gene_locii'])


RefSnpAlleleFreqStudy = namedtuple("RefSnpAlleleFreqStudy", [
    'name', 'allele_count', 'total_count', 'ref_snp_allele_idx',
    'ref_snp_id'])

RefSnpAlleleClinDisease = namedtuple("RefSnpAlleleClinDisease", [
    'disease_name_csv', 'clinical_significance_csv', 'citation_list',
    'ref_snp_allele_idx', 'ref_snp_id'])

We will use these types in part II of our series to keep ourselves sane.

Parsing the JSON Data

As previously mentioned, each line of our refsnp-chr1.json.gz file contains a JSON object, representing a RefSNP.

Each RefSNP JSON object holds all of the Alleles of the RefSNP in question, usually a single-nucleotide variation (i.e. a G is replaced with a C). Our goal is to grab the frequency of each allele, and the clinically significant diseases that are influenced by the allele.

Uncompressed, the 200GB file is too big to load into memory. We turn to iterators to help us decompress and stream the file contents through our parser.

Part II of our tutorial picks up here!