Cross-Tabulation tool

for Galaxy

Here are screen-shots of the tool's pages in Galaxy.

Simple version


What it does

This tool calculates a cross-tabulation for two variables in a input dataset.

Cross-Tabulation is commonly known as Data Pilot in OpenOffice Calc, and as Pivot Tables in Microsoft Excel.

For general information about cross-tabulations, see http://en.wikipedia.org/wiki/Cross_tabulation and http://en.wikipedia.org/wiki/Pivot_table

For advanced cross-tabulation options (e.g. more than two variables), see the hierarchical and the advanced versions of the cross-tabulation tool (available under the same category).


Example 1

Given the following (made up) list of intervals (columns are: chrom,start,end,strand,score):

chr2L       5109145 5158264 +       23
chr2L       5151748 5186162 +       84
chr2L       5109145 5205690 +       58
chr2L       2250532 2251270 -       19
chr2L       2156483 2178749 -       81
chr3L       6896829 6906661 +       17
chr2L       6686818 6709080 -       48
chr3L       6674636 6676635 -       15
chr2L       6663964 6674790 +       36
chr3LHet    6527448 6546972 -       49
chrU        60709   60739   +       91
...
...
...

We can use cross-tabulation to see a contingency table of chromosome vs. strand:

  • The strand information (column 4) will be listed as COLUMNS.
  • The chromsomes (column 1) will be listed in the ROWS.
  • We want to COUNT the occurences of each chrom/strand pair (when counting, it doesn't matter which column is used as the DATA column).
cshl_xtab_simple_count_intervals.png

We can use cross-tabulation to find the mean score for each chrom/strand pair:

  • The strand information (column 4) will be listed as COLUMNS.
  • The chromsomes (column 1) will be listed in the ROWS.
  • We want to calculate the MEAN score of each chrom/strand pair - the actual score value is found in column 5.
cshl_xtab_simple_mean_intervals.png

Example 2

With the Repeat Masker track for D. Melanogaster 3 (downloaded from UCSC's Table Browser):


585 1620    0       0       3       chrXHet 0       660     -203452 +       (TAGA)n Simple_repeat   Simple_repeat   1       657     0       1
585 211     259     12      24      chrXHet 1985    2104    -202008 +       A-rich  Low_complexity  Low_complexity  1       117     0       1
585 787     66      241     11      chrXHet 2860    3009    -201103 -       DNAREP1_DM      LINE    Penelope        0       594     435     1
585 1383    78      220     0       chrXHet 3012    3320    -200792 -       DNAREP1_DM      LINE    Penelope        -217    377     2       1
585 244     103     0       0       chrXHet 3737    3776    -200336 -       DNAREP1_DM      LINE    Penelope        -555    39      1       1
585 48      60      0       0       chrXHet 6431    6514    -197598 +       AT_rich Low_complexity  Low_complexity  1       83      0       1
585 29      47      0       0       chrXHet 6561    6604    -197508 +       AT_rich Low_complexity  Low_complexity  1       43      0       1
585 26      30      0       0       chrXHet 7625    7658    -196454 +       AT_rich Low_complexity  Low_complexity  1       33      0       1
585 2270    83      144     0       chrXHet 7907    8426    -195686 +       DNAREP1_DM      LINE    Penelope        1       594     0       1

We can use cross-tabulation to count how many families are in each chromosome:

  • The chromosomes (column 6) will be listed as COLUMNS.
  • The families (column 13) will be listed in the ROWS.
  • We want to COUNT the occurences of each chrom/class pair (when counting, it doesn't matter which column is used as the DATA column).
cshl_xtab_simple_count_chrom_vs_families.png

We can use cross-tabulation to see distribution of classes in the positive/negative strands:

  • The strand infromation (column 10) will be listed as COLUMNS.
  • The classes (column 12) will be listed as ROWS.
  • We want to COUNT the occurences of each family/strand pair (when counting, it doesn't matter which column is used as the DATA column).
cshl_xtab_simple_count_class_vs_strand.png

Hierarchical version


What it does

This tool calculates a cross-tabulation for two or more variables in a input dataset.

Cross-Tabulation is commonly known as Data Pilot in OpenOffice Calc, and as Pivot Tables in Microsoft Excel.

Please see the simple version of the cross-tabulation tool for basic usage and examples.


The following examples use the Repeat Masker track for D. Melanogaster 3 (downloaded from UCSC's Table Browser) as the input dataset:

585 1620    0       0       3       chrXHet 0       660     -203452 +       (TAGA)n Simple_repeat   Simple_repeat   1       657     0       1
585 211     259     12      24      chrXHet 1985    2104    -202008 +       A-rich  Low_complexity  Low_complexity  1       117     0       1
585 787     66      241     11      chrXHet 2860    3009    -201103 -       DNAREP1_DM      LINE    Penelope        0       594     435     1
585 1383    78      220     0       chrXHet 3012    3320    -200792 -       DNAREP1_DM      LINE    Penelope        -217    377     2       1
585 244     103     0       0       chrXHet 3737    3776    -200336 -       DNAREP1_DM      LINE    Penelope        -555    39      1       1
585 48      60      0       0       chrXHet 6431    6514    -197598 +       AT_rich Low_complexity  Low_complexity  1       83      0       1
585 29      47      0       0       chrXHet 6561    6604    -197508 +       AT_rich Low_complexity  Low_complexity  1       43      0       1
585 26      30      0       0       chrXHet 7625    7658    -196454 +       AT_rich Low_complexity  Low_complexity  1       33      0       1
585 2270    83      144     0       chrXHet 7907    8426    -195686 +       DNAREP1_DM      LINE    Penelope        1       594     0       1
...
...

Example 1

We can use hierarcial cross-tabulation to count how many families and classes are in each chromosome:

  • Classes and Families represent a hierarchy: Families are sub-categories of Classes.
  • The classes (column 12) will be listed as the first field for the ROWS.
  • The families (column 13) will be listed as the second field for the ROWS.
  • The chromosomes (column 6) will be listed as COLUMNS.
  • We want to COUNT the occurences of each chrom/class pair (when counting, it doesn't matter which column is used as the DATA column).
cshl_xtab_multilevel_chrom_vs_classes_family.png

Example 2

We can use hierarcial cross-tabulation to count how many classes are in each chromosome and strand:

  • chromosomes and strands represent a hierarchy: each chromosome has two sub-categories: positive and negative strand.
  • The chromosomes (column 6) will be listed as the first level COLUMNS.
  • The strand (column 10) will be listed as the second level COLUMNS.
  • The classes (column 12) will be listed as the ROWS.
  • We want to COUNT the occurences of each chrom-strand/class pair (when counting, it doesn't matter which column is used as the DATA column).
cshl_xtab_multilevel_chrom_strand_vs_classes.png

Example 3

We can combine the previous two examples into one, showing classes/families vs chrom/strands:

  • The chromosomes (column 6) will be listed as the first level COLUMNS.
  • The strand (column 10) will be listed as the second level COLUMNS.
  • The classes (column 12) will be listed as the ROWS.
  • The families (column 13) will be listed as the second field for the ROWS.
  • We want to COUNT the occurences of each chrom-strand/class pair (when counting, it doesn't matter which column is used as the DATA column).
cshl_xtab_multilevel_chrom_strand_vs_class_family.png

Advanced version


What it does

This tool calculates a cross-tabulation for two or more variables in a input dataset.

Cross-Tabulation is commonly known as Data Pilot in OpenOffice Calc, and as Pivot Tables in Microsoft Excel.

This tool allows finer control over the cross-tabulation of variables in the input dataset. Be sure to read the help section of the simple and hierarcial versions of the cross-tabulation tool, before using the advanced version.


Expression syntax for ROWS and COLUMNS

  • $N (dollar sign follwoed by a number) - fields in the input dataset (e.g. $6 = values from the sixth field in the input data set).
  • + - * / % ( ) ^ (basic arithmatic operators) - act as expected.
  • round(X), floor(X), ceil(X), frac(X) - returns the rounded, floored, ceil'ed and fraction of X.
  • , (comma) - separate levels of an hierarcial expression.

Expression syntax for DATA

The DATA expression must begin with an aggregation function, one of:

sum(X), count(X), average(X), min(X), max(X)

Where X is a valid expression (as detailed above).

Hierarcies are not supported in DATA expressions (so commas are not allowed).


Basic Expression Example 1

Cross-Tabulate column 10 vs. column 6, and sum the values from column 5:

This is equivalent to using the simple version of the cross-tabulation tool.

  • COLUMN expression: $10
  • ROW expression: $6
  • DATA expression: sum($5)

Basic Expression Example 2

Hierarcial Cross-Tabulation: column 10 and column 11 vs. column 6, and show the maximum value of each triplet from column 5: This is equivalent to using the hierarcial version of the cross-tabulation tool.

  • COLUMN expression: $10,$11
  • ROW expression: $6
  • DATA expression: max($5)

Advanced Expression Example 1 (using the dm3 repeat masker track):


585 1620    0       0       3       chrXHet 0       660     -203452 +       (TAGA)n Simple_repeat   Simple_repeat   1       657     0       1
585 211     259     12      24      chrXHet 1985    2104    -202008 +       A-rich  Low_complexity  Low_complexity  1       117     0       1
585 787     66      241     11      chrXHet 2860    3009    -201103 -       DNAREP1_DM      LINE    Penelope        0       594     435     1
585 1383    78      220     0       chrXHet 3012    3320    -200792 -       DNAREP1_DM      LINE    Penelope        -217    377     2       1
585 244     103     0       0       chrXHet 3737    3776    -200336 -       DNAREP1_DM      LINE    Penelope        -555    39      1       1
585 48      60      0       0       chrXHet 6431    6514    -197598 +       AT_rich Low_complexity  Low_complexity  1       83      0       1
585 29      47      0       0       chrXHet 6561    6604    -197508 +       AT_rich Low_complexity  Low_complexity  1       43      0       1
585 26      30      0       0       chrXHet 7625    7658    -196454 +       AT_rich Low_complexity  Low_complexity  1       33      0       1
585 2270    83      144     0       chrXHet 7907    8426    -195686 +       DNAREP1_DM      LINE    Penelope        1       594     0       1
...
...

What is the mean length of each class/family in each chromosome ?

  • COLUMN expression: $6 (the chromosome's column)
  • ROW expression: $12,$13 (hierarcial expression, with class and family columns)
  • DATA expression: average($8-$7) (arithmatic expression: average of the end coordinate [column 8] minus the start coordinate [column 7] = the length of the feature).
cshl_xtab_advanced_average_length.png

Advanced Example 2

Assume the following input dataset, representing intervals of small RNA sequences:

chr2L        1541319 1541354 -
chr2RHet     912306  912341  +
chrX         311331  8311366 +
chr3R        4825647 4825682 +
chr3R        1020972 1020994 +
chrUextra    7086730 7086751 +
chr2R        943547  943567  -
chr2R        943547  943566  -
chr2R        943543  943566  -
...
...

What is the length distribution of sequences for each chromsome and strand ?

  • COLUMN expression: $1,$4 (hierarcial expression, chromosome and strand)
  • ROW expression: $3-$2 (end coordinate - start cordinate = length)
  • DATA expression: count($1) (when counting, it doesn't matter which column is used).
cshl_xtab_advanced_length_distribution_count.png

Advanced Example 3

Similar intervals as in the previous example, but with an additional column: the number of readsmapped to this location (e.g. the first interval represents not one, but 502 sequences in the small RNA library):

chr2L        1541319 1541354 -       502
chr2RHet     912306  912341  +       11
chrX         311331  8311366 +       103
chr3R        4825647 4825682 +       402
chr3R        1020972 1020994 +       1034
chrUextra    7086730 7086751 +       95
chr2R        943547  943567  -       322
chr2R        943547  943566  -       8
chr2R        943543  943566  -       9
...
...

What is the length distribution of sequences for each chromsome and strand ?

  • COLUMN expression: $1,$4 (hierarcial expression, chromosome and strand)
  • ROW expression: $3-$2 (end coordinate - start cordinate = length)
  • DATA expression: sum($5) (sum the reads count from column 5)
cshl_xtab_advanced_length_distribution_sum.png