Intro
This series of posts cover how to migrate DbSNP’s newly downloadable JSON files into a slimmed-down, relational database, containing population frequency data, associated pubmed IDs, and clinvar data for each SNP in the database.
The entire codebase of this walkthrough is available here on github. I’ve also sprinkled in
Github Links
above all code snippets, to navigate to each corresponding line & file.
Moving forward, NCBI plans to store 1 JSON file per chromosome, and these files can be found here. The total size of this JSON payload is a whopping 104GB of gzipped data.
The uncompressed size of chromosome 1’s file is 162GB, with the compressed size being 8GB. Using this compression ratio (1:20) as a rule-of-thumb, the uncompressed size of the entire dataset is 2.1TB…that’s a lot of data to process and warehouse. Let’s begin with step 1: downloading refsnp_chr1.json.gz
as our case study.
I gave a lightning talk at PyCon2018 about a live, deployed application of this database: The Snip API. You can easily turn your 23&Me Raw Data into insights from DbSNP using this (free) API; the client is found here on github.
Downloading the file
In a perfect, simpler world, the following few LOC would download our 1st file:
import ftplib
dbnsp_filename = "refsnp-chr1.json.gz"
with open(dbsnp_filename.replace(".gz", ""), "wb") as fp:
ftp = ftplib.FTP("ftp.ncbi.nlm.nih.gov")
ftp.login()
ftp.cwd("snp/.redesign/latest_release/JSON")
ftp.retrbinary("RETR " + dbsnp_filename, fp.write, 1024)
ftp.quit()
Unfortunately, the combination of the FTP
protocal, large file download size and network latency create a perfect storm for server side & client side timeouts. Running the above code will lead to server side (or client side) timeout, and eventually the download will come to a halt.
Kudos to this excellent stackoverflow answer for a graceful solution to avoid FTP downloads timing out. We just send simple “NOOP” commands, equivalent to a ‘PING’, from a separate thread, to keep our connection alive:
import ftplib
dbsnp_filename = "refsnp-chr1.json.gz"
with open(dbsnp_filename, "wb") as fp:
ftp = ftplib.FTP("ftp.ncbi.nlm.nih.gov")
ftp.login()
ftp.cwd("snp/.redesign/latest_release/JSON")
size_gb = round(ftp.size(dbsnp_filename) / (1024**3), 2)
print(f"Filesize: {size_gb} GB")
sock = None
while not sock: # Try to open the socket conn w/ server
print("Trying to establish FTP conn")
sock = ftp.transfercmd(f"RETR {dbsnp_filename}")
time.sleep(5)
def download():
transferred, blocks = 0, 0
while True:
byte_chunk = sock.recv(1024*1024*8)
if not byte_chunk:
break
blocks += 1
transferred += len(byte_chunk)
transferred_mb = round(transferred / 1024 / 1024, 2)
if blocks % 1000 == 0:
print(f"Transferred {transferred_mb}MB / "
f"{size_gb * 1024}MB")
fp.write(byte_chunk)
sock.close()
fp.close()
t = threading.Thread(target=download)
t.start()
while t.is_alive():
t.join(60)
ftp.voidcmd("NOOP")
For those wondering, on a shoddy cafe WiFi, downloading refsnp-chr1.json.gz takes about 48 minutes, which was insigificantly less than the time it takes to download the file in Google Chrome (50 minutes). Make sure you aren’t on shoddy cafe WiFi, times 23…
We now have the file refsnp-chr1.json.gz
in our working directory, which holds a JSON object per RefSNP on each line. We want to extract a small subset of the ~200GB of data stored in this file, and insert it into a Relational Database. Let’s define our schema:
Defining our Relational Database Schema
Rather than writing an unmaintanable mess of SQL DDL statements, I opt to use SQLAlchemy’s core library for defining our database schema (i.e. the tables). It’s a powerful, well-maintained library, producing code that is far more maintainable than free-form SQL text. Here are the models representing tables to insert our data into:
# coding: utf-8
import os
from sqlalchemy import (
BigInteger, Integer, Text, String,
ForeignKey, Column, create_engine,
ForeignKeyConstraint)
from sqlalchemy.dialects.postgresql import ARRAY
from sqlalchemy.ext.declarative import declarative_base, declared_attr
Base = declarative_base()
metadata = Base.metadata
engine = create_engine(os.environ["SNIP_DB_URL"])
metadata.bind = engine
class RefSnpAllele(Base):
__tablename__ = 'ref_snp_alleles'
ins_seq = Column(Text, index=True)
del_seq = Column(Text, index=True)
position = Column(BigInteger)
chromosome = Column(String(2))
gene_locii = Column(ARRAY(Text, dimensions=1))
ref_snp_allele_idx = Column(Integer, primary_key=True)
ref_snp_id = Column(Integer, primary_key=True)
class RefSnpAlleleRelative(object):
__tablename__ = 'ref_snp_allele_relative'
id = Column(Integer, primary_key=True)
ref_snp_allele_idx = Column(index=True)
ref_snp_id = Column(index=True)
@declared_attr
def __table_args__(cls):
return (ForeignKeyConstraint(
[cls.ref_snp_id, cls.ref_snp_allele_idx],
[RefSnpAllele.ref_snp_id, RefSnpAllele.ref_snp_allele_idx]), {})
class RefSnpAlleleFrequencyStudy(RefSnpAlleleRelative, Base):
__tablename__ = 'ref_snp_allele_freq_studies'
name = Column(Text)
allele_count = Column(Integer)
total_count = Column(Integer)
class RefSnpClinicalDisease(RefSnpAlleleRelative, Base):
__tablename__ = 'ref_snp_allele_clin_diseases'
disease_name_csv = Column(Text)
clinical_significance_csv = Column(Text)
citation_list = Column(ARRAY(Integer, dimensions=1))
Note: We’re using Inheritance to DRY (don’t repeat yourself) out the
ref_snp_allele_clin_diseases
andref_snp_allele_freq_studies
model definitions. Avoiding small redundancies like these allow for more malleable code, and less opportunity for bugs down the road.
To create these tables, I assume we’ve already created a local PostgreSQL database. Creating and configuring this database is outside the scope of this post, but if on a Mac, you can start by installing PostgreSQL with:
brew install postgresql
…and then reading the quick docs here. There is plenty of documentation on configuring these easily.
Next, to create our tables we leverage SQLAlchemy’s create_all()
method, and some quick SQL statements in the following snippet:
def init_db(database_name):
# Connect to 'postgres' admin DB, and drop the DB named 'database_name'
os.system(f"echo 'DROP DATABASE {database_name};"
f"CREATE DATABASE {database_name};'"
" | psql postgres")
metadata.create_all()
You can now connect to the PostgreSQL database and view the schema:
(env) SeanH@foo:~/snip_warehouse/snip_warehouse$ psql -U SeanH snip
psql (10.3)
Type "help" for help.
snip=# \dt
List of relations
Schema | Name | Type | Owner
--------+------------------------------+-------+-------
public | ref_snp_allele_clin_diseases | table | SeanH
public | ref_snp_allele_freq_studies | table | SeanH
public | ref_snp_alleles | table | SeanH
Defining DTOs (Data Transfer Objects)
Let’s take a quick tangent, and talk about DTOs, Python dicts
and namedtuples
.
It’s very easy to use Dicts
in Python. Consequently, it’s very easy to abuse them. When passing data between functions in a non-trivial application, PLEASE take the time to define an explicit type, like a namedtuple
, rather than abusing a dict
. You will thank yourself months down the road…
Let’s define a series of Data Transfer Object types
, using the collections.namedtuple
datatype, to provide naming consistencies, strict(er) function interfaces, immutability and readability:
# DTO's
RefSnpCopyFromData = namedtuple("RefSnpCopyFromData", [
"ref_snp_alleles",
"ref_snp_allele_freq_studies",
"ref_snp_allele_clin_diseases"])
RefSnpAllele = namedtuple("RefSnpAllele", [
'ins_seq', 'del_seq', 'position', 'chromosome',
'ref_snp_allele_idx', 'ref_snp_id', 'gene_locii'])
RefSnpAlleleFreqStudy = namedtuple("RefSnpAlleleFreqStudy", [
'name', 'allele_count', 'total_count', 'ref_snp_allele_idx',
'ref_snp_id'])
RefSnpAlleleClinDisease = namedtuple("RefSnpAlleleClinDisease", [
'disease_name_csv', 'clinical_significance_csv', 'citation_list',
'ref_snp_allele_idx', 'ref_snp_id'])
We will use these types in part II of our series to keep ourselves sane.
Parsing the JSON Data
As previously mentioned, each line of our refsnp-chr1.json.gz
file contains a JSON object, representing a RefSNP.
Each RefSNP JSON object holds all of the Alleles of the RefSNP in question, usually a single-nucleotide variation (i.e. a G
is replaced with a C
). Our goal is to grab the frequency of each allele, and the clinically significant diseases that are influenced by the allele.
Uncompressed, the 200GB file is too big to load into memory. We turn to iterators
to help us decompress and stream the file contents through our parser.