This interface is used to both produce data tables as well as new cmp.h5 files. The interface is meant to be somewhat similar to SQL. At the heart of the new tools is a small query language for extracting alignments and computing statistics over those alignments. The three relevant clauses are: what, where, and groupBy.
Take 50% of the reads:
$ cmph5tools.py select --where "SubSample(rate=.5)" \
> --outFile ss.cmp.h5 aligned_reads.cmp.h5
Filter by AverageBarcodeScore:
$ cmph5tools.py select --where "AverageBarcodeScore >= 30" \
> --groupBy Barcode aligned_reads.cmp.h5
Grouped Statistics:
$ cmph5tools.py stats --what "Tbl(q = Percentile(ReadLength, 90), m = Median(Accuracy))" \
> --groupBy Barcode aligned_reads.cmp.h5 | tail
bc_88--bc_88 486.40 0.91
bc_89--bc_89 561.00 0.91
bc_9--bc_9 479.80 0.90
bc_90--bc_90 563.60 0.89
bc_91--bc_91 554.60 0.91
bc_92--bc_92 523.00 0.90
bc_93--bc_93 542.00 0.90
bc_94--bc_94 518.00 0.90
bc_95--bc_95 512.20 0.91
bc_96--bc_96 609.60 0.92
Metrics and Statistics:
$ cmph5tools.py listMetrics
--- Metrics:
ByFactor[metric, factor, statistic]
_MoleculeReadStart
_MinSubreadLength
_MaxSubreadLength
_UnrolledReadLength
DefaultWhere
DefaultGroupBy
TemplateSpan
The number of template bases covered by the read
ReadLength
NErrors
ReadDuration
FrameRate
IPD
PulseWidth
Movie
Reference
RefIdentifier
HoleNumber
ReadStart
ReadEnd
TemplateStart
TemplateEnd
MoleculeId
MoleculeName
Strand
AlignmentIdx
Barcode
AverageBarcodeScore
MapQV
WhiteList
SubSample[rate, n]
boolean vector with true occuring at rate rate or nreads = n
--- Statistics:
Min
Max
Sum
Mean
Median
Count
Percentile[metric, ptile]
Round[metric, digits]
Filter by barcode and group by reference:
$ cmph5tools.py stats --what "Tbl(a=Accuracy,b=Barcode)" \
> --where "Barcode == 'bc_78--bc_78'" \
> --groupBy Reference aligned_reads.cmp.h5
Group a b
MET_600_t2_2 0.96 bc_78--bc_78
MET_600_t2_2 0.82 bc_78--bc_78
MET_600_t2_2 0.85 bc_78--bc_78
MET_600_t2_2 0.89 bc_78--bc_78
MET_600_t2_2 0.87 bc_78--bc_78
MET_600_t2_2 0.90 bc_78--bc_78
MET_600_t2_2 0.90 bc_78--bc_78
MET_600_t2_2 0.94 bc_78--bc_78
Count alignments:
$ cmph5tools.py stats --what "Count(Reference)" \
> --where "Barcode == 'bc_78--bc_78'" \
> --groupBy Reference aligned_reads.cmp.h5
Group Count(Reference)
MET_600_t2_2 8