Retrieve Nucleotide Data

This notebook will likely be broken into at least two
# !cd ../ && pip install -e '.[dev]'
import os

import numpy as np
import pandas as pd

import plotly.express as px

import hilbertcurve
from hilbertcurve.hilbertcurve import HilbertCurve

    # !conda install openpyxl -y
    # ! conda install h5py -y
# ! pip install hilbertcurve

Access marker records

Working with these records proved tricky. Ultimately I need the nucleotide data in a tensor, but after using tassel to save the data (ZeaGBSv27_publicSamples_imputedV5_AGPv4-181023) as a table (along with position list and taxa list) it’s too big to easily load (>30Gb). As a work around to easily access specific genomes, I split the table into a separate file for the header and each genome so that these files can be read piecemeal. See the Readme below for more details.


source

read_txt

 read_txt (path, **kwargs)

source

taxa_to_filename

 taxa_to_filename (taxa='05-397:250007467')
taxa_to_filename(taxa = '05-397:250007467')
'05-397__250007467'

source

find_AGPv4

 find_AGPv4 (taxa, **kwargs)

Search for existing marker sets __

Details
taxa should be the desired taxa or a regex fragment (stopping before the __). E.g. ‘B73’ or ‘B’
kwargs

source

get_AGPv4

 get_AGPv4 (taxa, **kwargs)

Retrieve an existing marker set

# def get_AGPv4( 
#     taxa,
#     table_directory = '../data/zma/panzea/genotypes/GBS/v27/ZeaGBSv27_publicSamples_imputedV5_AGPv4-181023_Table/'
# ):
#     with open(table_directory+taxa, 'r') as f:
#         data = f.read()    
#     data = data.split('\t')
#     return(data)
get_AGPv4('05-397__250007467')[0:4]
['05-397:250007467', 'T', 'T', 'A']

In addition to returning a specific taxa, the table’s headers can be retieved with “taxa”.

get_AGPv4(taxa = 'taxa')[0:4]
['Taxa', '52399', '54239', '54405']

Converting between site and chromosome/position requires the AGPv4_site dataframe. A given record contains the taxa as well as the nucleotides, so with that entry excluded the chromosome / position can be paired up.

len(get_AGPv4(taxa = 'taxa')), AGPv4_site.shape
(943456, (943455, 4))
ith_taxa = '05-397:250007467'
res = get_AGPv4(taxa_to_filename(taxa = ith_taxa))   # Retrieve record
temp = AGPv4_site.loc[:, ['Chromosome', 'Position']]  
temp[res[0]] = res[1:]                               # Add Col. with Nucleotides
temp.head()
Chromosome Position 05-397:250007467
0 1 52399 T
1 1 54239 T
2 1 54405 A
3 1 55917 N
4 1 55927 N

Look at SNP coverage

mask = (temp.Chromosome == 1)

temp_pos = temp.loc[mask, ['Position']]
temp_pos['Shift'] = 0
temp_pos.loc[1: , ['Shift']] = np.array(temp_pos.Position)[:-1]
temp_pos['Diff'] = temp_pos['Position'] - temp_pos['Shift']

temp_pos.loc[0, 'Diff'] = None
temp_pos
Position Shift Diff
0 52399 0 NaN
1 54239 52399 1840.0
2 54405 54239 166.0
3 55917 54405 1512.0
4 55927 55917 10.0
... ... ... ...
147145 306971046 306910117 60929.0
147146 306971061 306971046 15.0
147147 306971063 306971061 2.0
147148 306971073 306971063 10.0
147149 306971080 306971073 7.0

147150 rows × 3 columns

# px.histogram(temp_pos, x = 'Diff')

Encode a marker sequence into ATCG

res = get_AGPv4(taxa_to_filename(taxa = '05-397:250007467')) 
res = res[1:] # drop taxa

source

list_to_ACGT

 list_to_ACGT (in_seq, progress=False)
Type Default Details
in_seq This should be a list with strings corresponding to IUPAC codes e.g. [‘A’, ‘C’, ‘Y’]
progress bool False
res = list_to_ACGT(in_seq = res)
res = res[0:1000]
100%|███████████████████████████████████████████████████████████████████████████████████| 14/14 [00:00<00:00, 14.64it/s]

Hilbert Curve Transform

res.shape
(1000, 4)

source

calc_needed_hilbert_p

 calc_needed_hilbert_p (n_needed=1048576, max_p=20)

source

np_2d_to_hilbert

 np_2d_to_hilbert (in_seq)
Details
in_seq This should be a 2d numpy array with dimensions of [sequence, channels]

source

np_3d_to_hilbert

 np_3d_to_hilbert (in_seq)

This is the 3d version of np_2d_to_hilbert. The goal is to process all of the samples of an array in one go.

Details
in_seq This should be a 3d numpy array with dimensions of [samples, sequence, channels]
demo = np_2d_to_hilbert(
    in_seq = np.asarray([np.linspace(1, 100, num= 50),
                         np.linspace(100, 1, num= 50)]).T
)
100%|███████████████████████████████████████████████████████████████████████████████| 50/50 [00:00<00:00, 168852.82it/s]
px.imshow(demo[:,:,0])
px.imshow(demo[:,:,1])

Apply to subset of real marker data

Explictly convert a taxa

taxa_to_filename(taxa = '05-397:250007467')
'05-397__250007467'

Or search for a taxa

find_AGPv4(taxa = '05-397')
['05-397__250007467']

Retrieve the sequence data

res = get_AGPv4(taxa_to_filename(taxa = '05-397:250007467')) 
res = res[1:] # drop taxa
res[0:10]
['T', 'T', 'A', 'N', 'N', 'N', 'N', 'C', 'C', 'N']

Convert from characters to encoded nucleotide probabilities

res = list_to_ACGT(in_seq = res)
res = res[0:1000]
res
100%|███████████████████████████████████████████████████████████████████████████████████| 14/14 [00:00<00:00, 15.00it/s]
array([[0., 0., 0., 1.],
       [0., 0., 0., 1.],
       [1., 0., 0., 0.],
       ...,
       [0., 0., 0., 1.],
       [1., 0., 0., 0.],
       [0., 1., 0., 0.]])

Convert the sequence to a hilbert curve

# This will happen under the hood
# calc_needed_hilbert_p(n_needed=res.shape[0])
res_hilb = np_2d_to_hilbert(
    in_seq = res
)
100%|██████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 1209081.58it/s]
px.imshow( res[0:20, 0:1] )
px.imshow( res_hilb[:, :, 0] )
px.imshow( res_hilb[:, :, 1] )