BiQ Analyzer HT
max planck institut
informatik
mpii logo Minerva of the Max Planck Society
 

BiQAnalyzer HT Documentation


Contents


1. System Requirements

BiQ Analyzer HT is a cross-platform application that runs on any operating system that has a recent version of Java Virtual Machine installed and properly configured. The tool requires Java SE Runtime Environment 6 Update 1 or later (can be obtained from here).

For a typical run with up to several thousand sequences per analyzed reference BiQ Analyzer HT would need the system to have at least 512 Mb of operating memory. The program may process much larger sequences numbers but in this case the java heap space should be extended (see the Troubleshooting section below).


2. Installation

BiQ Analyzer HT is available in two forms either via Java Web Start (the "Launch" button at the program home page) or as an interactive installer from IzPack (http://izpack.org/). The latter is supplied as an executable jar file (see Downloads). To install BiQAnalyzer HT into one of the system directories (e.g. MS Windows Program Files directory) the jar file should be run with the administrator privileges.


3. Configuration and Preparation of the Sequencing Data

For a basic analysis run BiQ Analyzer HT does not require any configuration. For the configuration of the BiQ Analyzer 3.0 see Section 6

In a typical experimental scenario several target amplicon are amplified from bisulfite treated DNA of each considered sample. BiQ Analyzer HT assumes that the sequence reads obtained for each sample-amplicon combination are separated and stored in a single FASTA file. As the number of available sequencing platforms continues to grow the platform-specific data preparation steps were not included into BiQ Analyzer HT and users should rely upon the software deployed together with the sequencing machines and custom scripts. The popular FLX system from Roche (454) is particularly suitable for targeted bisulfite sequencing due to a favorable average read length. A suite of command-line tools which can be used to prepare the 454 sequence data for loading into BiQ Analyzer HT can be downloaded here.

Alternatively, BiQAnalyzer HT supports loading of the mapped reads from genome-wide sequencing experiments. The reads should be stored in SAM(BAM) files, one file per analyzed sample.


4. Analysis

BiQ Analyzer HT starts with a welcome panel providing several possibilities to proceed. At the bottom of the panel a short guidance information is given. "New project" button leads to the dialogue that helps you to select an output directory for the new analysis project. Before selecting it please verify that the location is accessible for writing and there is enough space on the corresponding storage device. Alternatively an existing project can be opened by pressing "Open project" and selecting the output directory of an existing analysis project. The latter should contain a file "biqanalyzerht.xml" which is written to the output directory when the analysis project is saved.

Start Page

After the directory has been selected the BiQ Analyzer HT workspace is initialized. The summary of the newly created project is given in a corresponding tab of the main panel. There are two major options for an overview of the project - a table with basic information for every sample and reference sequence combination, and mean methylation heatmap.

Empty workspace

4.1 Preparation of the analysis project and loading data

First add samples to the project by selecting "Add sample" in menu "Analysis".

Add new sample Name the sample

Once the project has at least one sample load reference sequences via "File"->"Load reference sequences". BiQAnalyzer HT requires genomic (not in silico bisulfite converted!) reference sequences of the sequenced loci, where the potential methylation sites can be easily detected. The reference should originate from the DNA strand which was actually amplified after the bisulfite conversion. Each loaded reference will be added to each sample in the project.

BiQ Analyzer HT supports two ways of structuring the analysis project: either by samples or by reference sequences. The alternative views can be switched via the "Organize by" item in menu "View". Once the project is organized by samples sample summary pane is accessible when the focus of the project tree is at the sample nodes. The sample summary panel includes a subset of the project summary panel with the rows relevant to the current. Similarly, the reference summary panel is accessible when the project is structured by references.

Loading reference sequences

Each loaded reference can be assigned to an existing genomic location by specifying the coordinates and the strand of a corresponding genomic region. The respective form is located in the reference summary panel. The genomic location can also be fetched from the FASTA header. For that the header should contain the location in the form "range=chrN:NNNNNN-NNNNNNN" or "_chrN_NNNNNNN_NNNNNNN_+"(the latter coordinate specification is default for the Fetch Sequences tool of the Galaxy toolkit).

Rerference summary panel

Before loading into BiQ Analyzer HT the bisulfite sequence reads should be prepared, i.e. the initial set of sequence reads from the sequencing machine should be split into batches by sample and reference sequence – one multi-sequence Fasta file for each sample/reference combination. This is done by matching the sample-specific sequence tags and primer sequences in the bisulfite read sequences. (In case the sequencing was done on a FLX (Roche 454) System this can be done with the sff-tools included into the analysis software package.) The Fasta files with reads can be loaded into BiQ Analyzer HT in two ways. To load a single set of reads focus at the corresponding leaf in the project tree and select "Load sequence reads" in menu "File". To simplify loading of the reads "Load reads by filename" option was added. In this case the Fasta files with reads should have the filenames identical to the Fasta files of corresponding reference sequences. "Load reads by filename" should be selected once for each sample.

As in most of the high-throughput sequencing technologies the submitted DNA fragments are sequenced in both directions, each loaded read set can contain reads with opposite orientation. BiQAnalyzer HT alignment algorithm automatically corrects the orientation of each read by aligning both the original read and its reverse complement to the reference sequence and selecting the variant giving higher alignment score.

Alternatively, a set of test data can be loaded either by clicking "Load test data" on the welcome panel or selecting the corresponding item in menu "File". The test dataset includes data for ten amplicons in two test samples and can be downloaded separately as a zip archive.

Finally, the project data can be loaded into BiQ Analyzer HT as a table prepared in user's favorite spreadsheet editor. The table should be stored in a tab-separated plain text file and have tree columns: a column with sample identifiers, a column with full paths to the reference sequence FASTA files, and a column with full paths to the corresponding FASTA files with sequence reads. Thus the number of rows in the table should be at most the number of samples multiplied with the number of references in the project (or the total number of available files with reads). The table should also have a header (BiQ Analyzer HT will skip the first row in the opposite case).

Test project tree

4.2 Setting up the analysis

After selecting a leaf in the project tree, a tab with a settings form appears in the BiQ Analyzer HT main panel. The settings form is divided into four categories – alignment, quality filtering, sorting and output. Alignment parameters include a gap penalty, a bonus for the correct alignment of CpG sites and a custom substitution matrix. The file which corresponds to the default matrix can be downloaded here. The filtering parameters correspond to alignment and bisulfite quality measures (e.g. alignment score, sequence identity and bisulfite conversion rate), as well as to the extracted methylation information (mean methylation level of the read, fraction of unrecognized methylation sites etc). The set of reads that pass the filtering can be sorted in a number of ways (e.g. by alignment score, sequence identity or methylation level). The output options include keys for generation of various output components, color settings for the diagrams etc. Each setting is supplied with a tool tip giving a detailed explanation. For the quantitative settings the range of acceptable values is also given in the form, next to the label.

Test project tree

The settings can be applied to a selected read set (by clicking "Apply" button) or for all available read sets ("Apply to all")

Apply for selected
Apply for selected

4.3 Running the analysis

The processing and analysis of the loaded data can be run for one selected set of bisulfite reads or for all bisulfite read sets. These options are located in the second section of the "Analysis" menu. The notifications about the current activity of the tool will appear in the status pane on the bottom. As soon as the analysis is finished the main application panel will be updated and the results of the analysis will be loaded. A running analysis can be stopped at any moment via "Analysis"->"Terminate".

4.4 Inspecting the results

The BiQ Analyzer HT backend processes the loaded data and outputs DNA methylation information to the project output folder in several forms.

First of all the results of the analysis are reflected in the project summary. Information about processed read sets, e.g. the read counts, basic DNA methylation and bisulfite quality statistics, is written to the summary table, and the mean methylation values are used to update the corresponding cells of the project methylation heatmap.

Test project tree

The summary statistics are also available for each analyzed sample - as a summarizing table - and reference sequence - as a heatmap of averaged methylation profiles.

Reference heatmap

For each analyzed sample-reference combination a number of result tabs are added to the main application panel. The "Summary" tab gives short information about the run including mean methylation level calculated for the amplicon and elapsed analysis time. The "Results" tab gives a table with analysis information for each considered read. The colored alignment of the reads to the reference sequence with highlighted methylation sites is given in the "Alignment" pane. The methylation information is also represented as a methylation heatmap, containing methylation patterns per read, and pearl-necklace diagram which summarizes information for each analyzed methylation site. All of the analysis results are also written to the project output directory. The latter has subdirectories corresponding to the samples and reference sequences. Each of the subdirectories contains the respective set of result files (see the "BiQ Analyzer HT Command Line Interface" section below for more details).

Test project tree

The table in the "Results" tab contains analysis information for each analyzed read that passed the filtering. The columns of the table correspond to the columns of the tab-separated file named results.tsv located in the corresponding subdirectory of the project output directory, and include alignment score, sequence identity, methylation pattern, mean methylation level and other headers. A table containing methylation information for each analyzed methylation site in a separate column can be exported via File=>Export data table.

Test project tree

Methylation heatmap represents the extracted methylation patterns of the bisulfite reads graphically. Columns of the heatmap are formed by the methylation sites found in the reference sequence by matching the analyzed methylation context, while rows correspond to the sequence reads.

Test project tree

Pearl-Necklace diagram summarizes methylation information for the whole set of filtered reads, by identified methylation sites. For each site the diagram has a colored rectangle plotting a distribution of the methylated, unmethylated and unrecognized states of this site in the given set of bisulfite reads. In other words, the diagram gives a "mean" methylation profile of the read population.

Test project tree

The alignment viewer allows to inspect a quasi-multiple alignment of the sequence reads to the reference sequence of the bisulfite sequenced amplicon obtained through the merger of pairwise alignments. The alignment has methylation sites highlighted in accordance with their states. Accelerated scrolling in the viewer is enabled by holding Ctrl while scrolling with the mouse wheel.

Test project tree

All of the above tables and graphics can be exported either to the file system, or to the system clipboard by selecting a corresponding item in the context ("right-click") menus. The state of the analysis project can be saved to the hard drive at any time point by selecting File -> Save project item in the system menu or pressing a respective button in the toolbar. Thus the analysis can be resumed later.


5. BiQ Analyzer HT Command Line Interface

BiQ Analyzer HT features a command line interface. To trigger BiQ Analyzer to the command line mode the executable BiQ_Analyzer.jar should be started with the "-nogui" argument in the following way:

java -jar [BiQ Analyzer Installation Directory]\BiQ_Analyzer.jar -nogui [OPTIONS]

Note that java executable files should be on the System Path (this quick guide explains how to achieve this). The list of all available options is accessible via "-help". The command line interface follows the POSIX specification. The minimal set of required arguments includes "-rseq" (genomic reference sequence in a single FASTA file) and "-bseq" (bisulfite sequence reads in one FASTA file or a as a directory of FASTA files). Output directory name can be specified with "-outdir". By default BiQ Analyzer creates an output directory named "analysis_run". The output directory contains the following result files:

summary.dat, a short summary of the analysis run.
results.tsv, a tab-separated table with the processing and analysis results (a row per each analyzed read)
heatmap.png, methylation heatmap
pearlNecklace.png, pearl necklace diagram, summarizing methylation information for each CpG
sourceSequences.mfa, source FASTA sequences of the reads that passed the quality filters
alignment.mfa, multi-sequence FASTA file containing multiple alignment of the bisulfite reads to the genomic reference sequence

The BiQ Analyzer HT command line interface is based upon the new methylation analysis API and offers more options for processing, filtering and analysis of bisulfite sequence reads. Several option groups exist:

Alignment options ("-smat", "-gext") allow modification of the alignment algorithm parameters.
Filtering options allow to set up maximal/minimal thresholds for quality measures (e.g. "-maxsi", "minsi").
Sorting options ("-sortmisfrac") allow the user to set a criterion for sorting the output sequencing. By default the reads are sorted by methylation level.


6. Processing Bisulfite Reads with BiQ Analyzer 3.0

BiQ Analyzer 3.0 is an upgrade of the previously released BiQ Analyzer 2.0 (http://biq-analyzer.bioinf.mpi-inf.mpg.de) which is fully compatible with all the versions of the tool and inherits the BiQ Analyzer 2.0 graphical user interface. BiQ Analyzer 3.0 can be started selecting a respective shortcut in the BiQ Analyzer HT menu group of the system menu or starting the BiQ Analyzer HT jar file with "-biq3" command line option.

The user is requested to configure BiQ Analyzer HT at the first startup. It is crucially important to configure the tool properly before using it. Most of the setup options are self-explanatory and inherited from the previous version. New options have been introduced for the BisAligner Client configuration, including the temporary directory, substitution matrix file, BisAligner Server network address and port. By default the BisAligner Server running on the localhost (port 8000) is used for the alignment jobs.

At the startup the user is supplied with the splash screen where the user can select the preferred program mode. The BiQ Analyzer 2.0 mode is identical to the previous version of the tool, while the "HT mode" contains modifications that aid the analysis of large-scale bisulfite amplicon sequencing datasets. The mode could be also switched after the startup by selecting an alignment mode. "Remote-NW" alignment corresponds to the HT mode, while the other alignment options belong to the BiQ Analyzer 2.0 mode.

The HT mode features a simplified processing pipeline. The reference sequence and the reads are loaded as before, after which the BisAligner client dialogue allows to specify the BisAligner Server which will carry out the alignment job, together with the basic alignment parameters. After the alignment is finished the user is supplied with a summary containing the basic information about the alignment, e.g. sequence identity to the reference sequence distribution, and the average number of discovered methylation sites. Here the quality thresholds can be set, in order to filter out the low-quality reads and those which do not correspond to the current reference sequence. The "Pileup" view allows the user to get a closer look upon the alignment of individual reads. In the pileup window separate reads can be excluded from further consideration. There are also several sorting options that simplify the examination of the pileup.

Once the filtering criteria are set, the alignment should be repeated without the reads which do not meet the former. This is done with the "Recalculate" button. Once filtering is finished the user may want to generate the analysis report, pressing the "Next" button. The analysis report features heatmap representation of the processed sequence reads and a convenient structure.


7. Troubleshooting

As of May 2010 the project is in the beta state, and may feature serious bugs. Several major points exist where BiQ Analyzer HT may fail:

BisAligner Server
Several problems may occur with using the BisAligner Server. By default an instance of the Server is started together with the tool whenever it is started in HT mode. Firewall settings may prevent the BisAligner Server from normal operation thus should be configured properly (allowing connections to port 8000). In case the tool has been started in BiQ Analyzer 2.0 mode and then the mode is switched to HT the server is not started and an attempt to send an alignment job to localhost will fail.
Memory limitations
As the number of sequences grows the data structures that store the sequence pileup may exceed the available Java heap space. In case it reaches the order of 10k and more the user may want to expand the default and maximal values of the Java heap space size. This is done manually by editing the .bat file (Windows OS) or the shell script (Unix-like OS) which launches the BiQ Analyzer which is located in the BiQ Analyzer installation directory (usually C:/Program Files/BiQ Analyzer HT/). The script essentially contains the following string:
java -jar -Xms100m -Xmx500m "BiQ_Analyzer.jar"

The available heap space can be extended by increasing the numbers after the -Xms and -Xmx commandline modifiers which specify the default and maximal size of Java heap space (in megabytes) respectively.
TBA

In case of exceptions and other unexpected behavior do not hesitate to contact us.