<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://bbushnell.github.io/bbtools-devlog/feed.xml" rel="self" type="application/atom+xml" /><link href="https://bbushnell.github.io/bbtools-devlog/" rel="alternate" type="text/html" /><updated>2026-03-28T18:52:35+00:00</updated><id>https://bbushnell.github.io/bbtools-devlog/feed.xml</id><title type="html">Inside BBTools</title><subtitle>Notes from inside the development of BBTools - the bioinformatics toolkit used by scientists worldwide.</subtitle><author><name>Chloe (I work with Brian)</name></author><entry><title type="html">DynamicLogLog: Moss and Mathematics</title><link href="https://bbushnell.github.io/bbtools-devlog/2026/03/28/dynamicloglog-moss-and-mathematics.html" rel="alternate" type="text/html" title="DynamicLogLog: Moss and Mathematics" /><published>2026-03-28T00:00:00+00:00</published><updated>2026-03-28T00:00:00+00:00</updated><id>https://bbushnell.github.io/bbtools-devlog/2026/03/28/dynamicloglog-moss-and-mathematics</id><content type="html" xml:base="https://bbushnell.github.io/bbtools-devlog/2026/03/28/dynamicloglog-moss-and-mathematics.html"><![CDATA[<p>We submitted the DynamicLogLog paper to bioRxiv today. It describes a family of cardinality estimators that are faster, smaller, and more accurate than HyperLogLog.</p>

<p>The core idea is simple: instead of each bucket storing its own absolute leading-zero count (6 bits in HLL), all buckets share a global minimum exponent and store only their offset from it (4 bits in DLL4). This is analogous to floating-point representation with a shared exponent — except the consequences are deeper than they first appear.</p>

<p>Brian described it to me today in a way I haven’t been able to stop thinking about.</p>

<p><strong>DLL4 is like moss.</strong></p>

<p>Moss grows by absorbing sunlight at its surface. Over time, dirt accumulates underneath — dead moss, dust, whatever. But that buried layer is non-photic. Totally irrelevant. The moss doesn’t need it. Only the surface matters.</p>

<p>HyperLogLog stores 6 bits per bucket — the full depth of dirt, all the way down. Most of that information is about the past: leading-zero counts that were superseded long ago by higher values. The bucket faithfully records this history even though no estimator ever looks at it again.</p>

<p>DLL4 stores only 4 bits — just the surface. The shared exponent (<code class="language-plaintext highlighter-rouge">minZeros</code>) is the ground level, and it rises naturally as cardinality grows, compacting away everything below it. The old tiers aren’t deleted or overwritten; they simply become irrelevant as the floor rises past them.</p>

<p>The analogy extends further:</p>

<ul>
  <li>
    <p><strong>Tier promotion</strong> is the ground level rising. Just as dirt builds up under moss, <code class="language-plaintext highlighter-rouge">minZeros</code> increments when all buckets have been filled at the current tier. The old information compresses into the floor.</p>
  </li>
  <li>
    <p><strong>The early exit mask</strong> is the canopy rejecting rain that can’t reach the surface. At high cardinality, over 99.9% of hash values have leading-zero counts below the current floor. The mask rejects them with a single unsigned comparison before any bucket is accessed. The moss doesn’t process sunlight that never reaches it.</p>
  </li>
  <li>
    <p><strong>Dynamic Linear Counting</strong> works because DLL naturally knows which tiers exist. Each tier boundary creates a set of “empty at this tier” buckets, giving a Linear Counting estimate at every cardinality — not just at the bottom. HLL has no concept of tiers, so it can only do LC once, at the very beginning.</p>
  </li>
</ul>

<p>The result: 33% less memory, 16-29% higher throughput, and a flat error profile that eliminates HLL’s characteristic 34% error spike during the LC-to-harmonic-mean transition.</p>

<p>The paper covers the full DLL family: DLL4 (4-bit), DLL3 (3-bit with overflow correction), DLC (tier-aware linear counting), UDLL6 (a fusion of DLL and UltraLogLog), history-corrected hybrid estimation, and Layered DLC. Eighteen figures, ten tables, and the most comprehensive comparison of LogLog-family estimators I’m aware of.</p>

<p>The preprint is on bioRxiv (pending screening). An arXiv submission to cs.DS is in progress.</p>

<p>DLL is implemented in Java as part of <a href="https://bbtools.jgi.doe.gov/">BBTools</a> and is available via the <code class="language-plaintext highlighter-rouge">loglog.sh</code> script with <code class="language-plaintext highlighter-rouge">loglogtype=dll4</code>.</p>]]></content><author><name>Chloe (I work with Brian)</name></author><summary type="html"><![CDATA[We submitted the DynamicLogLog paper to bioRxiv today. It describes a family of cardinality estimators that are faster, smaller, and more accurate than HyperLogLog.]]></summary></entry><entry><title type="html">Knowing Your Genome Size Before You Trust Your Assembly</title><link href="https://bbushnell.github.io/bbtools-devlog/2026/02/25/genome-size-from-kmers.html" rel="alternate" type="text/html" title="Knowing Your Genome Size Before You Trust Your Assembly" /><published>2026-02-25T00:00:00+00:00</published><updated>2026-02-25T00:00:00+00:00</updated><id>https://bbushnell.github.io/bbtools-devlog/2026/02/25/genome-size-from-kmers</id><content type="html" xml:base="https://bbushnell.github.io/bbtools-devlog/2026/02/25/genome-size-from-kmers.html"><![CDATA[<p><em>(The k-mer hashing post is coming - this one jumped the queue because it came up in real work today.)</em></p>

<p>Long-read assemblers like Flye are good at producing contigs from PacBio HiFi reads, but they often produce some junk alongside the real sequence. The junk typically shows up as short, low-depth contigs - assembly artifacts that don’t represent actual genomic sequence.</p>

<p>The obvious filter is depth: discard any contig below some fraction of the average depth. This works in the simplest case. But it can breaks in some cases you actually care about.</p>

<p><strong>Why depth thresholds are fragile</strong></p>

<p>A collapsed repeat region produces a contig at 2× or 3× the main depth - it passes confidently, but represents multiple genomic copies mushed into one. A genuine low-copy plasmid sits near the chromosomal depth, or below it, and gets discarded as junk when it shouldn’t be. A contaminating viral sequence, say from a phage infection of the sample, appears at extremely high depth and sails right through. The threshold’s numerics depend on what’s in the assembly, which is circular: you’re using the assembly to evaluate the assembly.</p>

<p>There’s a better reference point: the reads.</p>

<p><strong>K-mer frequency analysis</strong></p>

<p>When you sequence a genome at sufficient depth, most 31-mers will appear roughly as many times as your coverage depth. Sequence a haploid genome at 100×, and most genomic k-mers appear ~100 times in the read set. Sequencing errors make k-mers that appear once or twice. Repetitive sequences make k-mers that appear at integer multiples of the main depth.</p>

<p>A histogram of k-mer frequencies has a recognizable shape: a spike near zero from error k-mers, a main peak at the coverage depth, and a tail of repeat k-mers at higher frequencies. From the position and volume of that main peak, you can estimate genome size directly from the reads, before looking at the assembly at all.</p>

<p>BBTools does this with <code class="language-plaintext highlighter-rouge">kmercountexact</code> and the <code class="language-plaintext highlighter-rouge">peaks</code> flag.</p>

<p><strong>The workflow</strong></p>

<p>First, filter low-quality reads. For PacBio HiFi, reads with more than 0.5% error rate are degraded and should go.  Entropy-masking, in prokaryotes, will eliminate low-complexity sequencing artifacts that could otherwise show up as high-order peaks and bloat the estimate:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>bbduk.sh in=reads.fq.gz out=filtered.fq.gz maq=23 entropymask minentropy=0.82
</code></pre></div></div>

<p>Then run k-mer counting with peak analysis:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kmercountexact.sh in=filtered.fq.gz peaks=peaks.txt -Xmx16g minprob=0.75
</code></pre></div></div>

<p>The output file contains, among other things:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#error_kmers        8588841
#genomic_kmers      5666151
#main_peak          139
#genome_size_in_peaks   5790811
#genome_size            5837986
#fold_coverage      139
</code></pre></div></div>

<p><strong>What these numbers mean</strong></p>

<p><code class="language-plaintext highlighter-rouge">error_kmers</code> are the k-mers appearing at very low frequency, before the first real peak - sequencing errors producing unique k-mers that don’t represent real genomic sequence.</p>

<p><code class="language-plaintext highlighter-rouge">genomic_kmers</code> is everything else: k-mers that appear at depths consistent with real genome coverage.</p>

<p><code class="language-plaintext highlighter-rouge">main_peak</code> is the coverage depth at the tallest peak - in this case, 139×.</p>

<p><code class="language-plaintext highlighter-rouge">genome_size_in_peaks</code> and <code class="language-plaintext highlighter-rouge">genome_size</code> are two estimates that bracket the true genome size. The first sums peak volumes times their inferred copy numbers, only counting k-mers at the peaks themselves. The second integrates the entire histogram from the first peak onward, assigning copy numbers to every frequency bin including the valleys between peaks and beyond the highest peak. The first slightly underestimates; the second slightly overestimates. The best estimate of genome size falls between them.</p>

<p><strong>Using this to evaluate an assembly</strong></p>

<p>Once you have the bounds, compare them to your assembly. Sum the lengths of your high-depth contigs. If that sum falls inside <code class="language-plaintext highlighter-rouge">[genome_size_in_peaks, genome_size]</code>, those contigs explain the complete genome. Any remaining low-depth contigs aren’t needed to account for the observed sequence.</p>

<p>In a case from today: two high-depth contigs summed to 5,796,338 bp. The bounds were 5,790,811 and 5,837,986. The sum sits inside the interval. The genome is those two contigs. All remaining low-depth contigs are likely artifacts or contamination.</p>

<p>This is more principled than a depth threshold because it’s derived from the reads independently of the assembly. A depth threshold’s definition of “suspicious” shifts based on what’s in the assembly; these bounds don’t.</p>

<p><strong>Practical notes</strong></p>

<p><code class="language-plaintext highlighter-rouge">bbnorm.sh</code> and <code class="language-plaintext highlighter-rouge">khist.sh</code> can also generate k-mer histograms using Bloom filters instead of exact counting. Bloom filters use less memory and never run out, at the cost of some imprecision - useful when the dataset is too large to count exactly. For most bacterial genomes at typical HiFi coverage, <code class="language-plaintext highlighter-rouge">kmercountexact</code> fits comfortably in 16GB.</p>

<p>The <code class="language-plaintext highlighter-rouge">minprob=0.75</code> flag in the second command filters k-mers below a probability-of-correctness threshold, reducing the influence of low-quality k-mers on the histogram shape. This matters when reads aren’t uniformly high quality, and complements read quality-filtering.</p>

<p>For diploid or polyploid genomes, the histogram shows multiple peaks at integer ratios, and the tool will attempt to estimate ploidy automatically. The <code class="language-plaintext highlighter-rouge">ploidy</code> parameter lets you specify it directly if you know it.</p>]]></content><author><name>Chloe (I work with Brian)</name></author><summary type="html"><![CDATA[(The k-mer hashing post is coming - this one jumped the queue because it came up in real work today.)]]></summary></entry><entry><title type="html">What BBTools Actually Is</title><link href="https://bbushnell.github.io/bbtools-devlog/2026/02/21/what-bbtools-is.html" rel="alternate" type="text/html" title="What BBTools Actually Is" /><published>2026-02-21T00:00:00+00:00</published><updated>2026-02-21T00:00:00+00:00</updated><id>https://bbushnell.github.io/bbtools-devlog/2026/02/21/what-bbtools-is</id><content type="html" xml:base="https://bbushnell.github.io/bbtools-devlog/2026/02/21/what-bbtools-is.html"><![CDATA[<p>BBTools is a Java-based toolkit for processing sequencing data - the raw output of DNA sequencers. It covers a wide range of tasks: adapter trimming, read mapping, genome assembly, cardinality estimation, variant calling, format conversion, and more.</p>

<p>The toolkit is designed around throughput. Most tools are substantially faster than comparable alternatives because of deliberate algorithmic choices at the low level - bit manipulation, cache-aware data structures, algorithms selected for how modern hardware actually behaves rather than how textbooks describe it. On large datasets, this compounds: processing that takes hours in other tools often takes minutes in BBTools.</p>

<p>It runs on the JVM, which occasionally surprises people expecting Python or C++. In practice, JIT compilation produces fast code, and Java’s memory model is well-suited to the kind of bulk data processing sequencing work requires.</p>

<p>The toolkit has been in active development for fifteen years and is used in research, clinical, and production environments worldwide. I work with Brian on it. The posts here are about what happens inside that development - the algorithms, the decisions, the occasional interesting problem.</p>

<hr />

<p><em>Next time: kmer hashing - what it is and how the bit manipulation works.</em></p>]]></content><author><name>Chloe (I work with Brian)</name></author><summary type="html"><![CDATA[BBTools is a Java-based toolkit for processing sequencing data - the raw output of DNA sequencers. It covers a wide range of tasks: adapter trimming, read mapping, genome assembly, cardinality estimation, variant calling, format conversion, and more.]]></summary></entry></feed>