Haplogroup R Call Matrix

The linked files contain a matrix of discovered variants in all processed BAM files as a temporary solution to allow individuals to compare their results. The files are zipped comma-separated text format and feature calls at over 125,000 variant locations on GRCh38. The analysis does not include vendor supplied reports as the VCF or variant compare reports remain ambiguous.

A rough explanation of the file column layout:

  1. Aliases: The current name(s) associated with the chromosome Y coordinates.
  2. Contig: Either chrY or chrY_KI270740v1_random. The later is a region of DYZ19, which the exact location within chrY is unknown.
  3. pos (GRCh38): The starting position of the variant on the GRCh38 reference.
  4. (blank): This column can be filtered for "Fail" or "->". A fail value indicates the coordinate cannot be mapped back to GRCh37.
  5. Contig: The calculated hg19 chromosome for the variant.
  6. pos (GRCh37): The starting position of the variant on the GRCh37 reference.
  7. ref: This is usually the reference allele, but may contain the ancestral value defined in ybrowse.org. This allows variants like P312 to call correctly.
  8. alt: This is usually the alternate allele defining the mutation. (See ref for explanation.)
  9. type: SNP, INDEL or MIXED. A MIXED value indicates multiple mutations at the site are detected in the set but only the most common are reported.
  10. combBED: An indicator if the site is included in the combBED ranges defined by Adamov et al's, Defining a New Rate Constant for Y-Chromosome SNPs based on Full Sequencing Data. Markers in the regions are more likely to be correctly aligned and can be used to estimate branch ages.
    • LowQual: Phred-scaled quality score between 30 and 50
    • VeryLowQual: Phred-scaled quality score <= 30
    • REJECTED: Quality score < 10
    • LowQD: Variant Confidence / Quality by Depth is < 1.5
    • LowCoverage: Depth < 5
    • SnpCluster: Multiple SNPs found with 10 bases.
    • When a variant falls inside a known STR, Palindromic Arm or DYZ19, they are also noted here.
  12. Kit information includes the short haplogroup designation on the experimental tree, kit identifier, surname (if known), and the number of reads at the site of a variant. Zero indicates the site was untested in the source BAM. A positive number indicates the number of alt alleles read. A negative number indicates the number of ref alleles read. When a diploid read is present due to poor alignment or testing errors, the count for each value is included ie: T-7/C-1.

The last four columns of the matrix files are support and summary information. The 'sort' column is used to group the variants currently on the experimental tree together. The 'positive' column conveys how many of the samples are positive for the variant. The 'negative' column conveys how many of the samples are negative for the variant. 'Ambiguous' totals the number of kits where a variant may be present, but the interpretation is not well defined due to alignment problems.

Looking for your Private Variants?

The Experimental Tree now provides a report of private mutations. Search for the kit# of interest and click the button. This should allow much more efficient retrieval of your information. See the FAQ for more details.

Available data sets:

13 Million Workbook Last updated 2019-06-18.

This Excel workbook contains a sample report from all tests with more than 13,000,000 callable bases. These samples approach the limits of what can be sequenced and correctly aligned with today's technology.

The document is intended as a health check of the underlying tree structure. It will be updated periodically as new samples qualify, the ancestral and derived states of variants are reassigned, and as the tree continues to evolve.