Massively parallel DNA sequencing is revolutionizing molecular biology. Aside from greatly increasing the ease of sequencing genomes, these technologies can be used in conjunction with small-scale experiments, producing gigabases of data probing one of many possible biological processes. This presents an exciting challenge to the quantitative biology community: can the ready availability of such massive data sets be used to address questions that were previously inaccessible to experiment?
In this spirit I will describe a way of using sequencing to measure -- in living cells -- the biophysical interactions within specific multiprotein-DNA complexes that regulate mRNA transcription. I will show that, by measuring the activity of hundreds of thousands of slightly mutated versions of a specific regulatory DNA sequence, one can determine where on that sequence proteins bind, model the sequence-dependent binding energies of these proteins, and even measure protein-protein interaction energies within the resulting complexes. This was demonstrated on the E. coli lac promoter, and is now being extended to other systems, both bacterial and eukaryotic. The necessary data analysis, however, presents a formidable machine learning problem, one that we are only in the early stages of solving. The experimental, computational, and theoretical aspects of this effort will be discussed.