Freitag, 11. Februar 2011

Maths in Paleontology (I): Data

''In every special doctrine of nature only so much science proper can be found as there is mathematics in it.'' - Immanuel Kant, Metaphysical Foundations of Natural Science (1786)

Warningly the maths professor who got the unthankful task to teach us first-semester scientists-to-be some basic basics of his field chose Kant's statement as the first in his first lecture on "higher" maths. However, when I started my studies in geology and paleontology, there was another saying among old school geology teachers: "A bad mathematician makes a good geologist."

Many a fellow student were rather willing to believe in these latter words than in the inconvenient alternative. (I always considered this believe as outdated and I got the feeling that geology as a science might have been shaped not only by the talents of its protagonists but also by their limitations in terms of exactness and rigorousity.)

Luckily you were not necessarily considered as a bad geologist if you were interested in maths and the notion that modern geoscience involves maths and exact methods (e.g. methods of quantitative data analysis, databases, multivariate statistics and geostatistics, geoinformatics and geographic information systems, 3D and 4D modelling, remote sensing) was clearly on the rise. Perhaps from a biologists' point of view this story would be different, but, to tell you the truth, some of the biology-based paleontologists I got to know are not much living on the exact side either.

Apart from microscopy seminars, field, and lab practicals which teach you ways of data acquisition some classes in statistics and data analysis during first semesters of study give you an idea about the structure of data and ways how to sample and how to deal with data in order to find new knowledge, e.g. a relationship between two phenomena previously not considered to be related.

At the very beginning you will learn that there are different types of data used in paleontology and that you have to bring your data into shape for any kind of mathematical analysis tools, i.e. arrange them as a data table such as the following:

SpecimenClassState of XYZNo. of UVWsize L [mm]size M [cm²]

Normally lines of the table represent samples (or groups of samples or taxa) whereas columns may represent various features or measures. Such features may be the belonging to a certain class or category or the presence, absence, or specificity of a feature. Measured values as entries may have a discrete contribution (e.g. natural numbers such as the number of teeth or segments or body chambers) or a continuous distribution (e.g. length, area, angle, temperature measurements).

Various data relevant for paleontologists can be arranged as tables, such as morphological and microstructural data, stable isotope and other geochemical data, geographical, sedimentological, and stratigraphic data, as well as taphonomic and paleoecological data. Some of these data have a special structure and can be referred to one of the following types:

Compositional data...

... add up to 100%. Chemical compositions of fossils or faunal compositions are compositional data:

A23 [%]4217513

These data require careful considerations and a special kind of maths because all variables are (necessarily) correlated and thus an alleged dependence, e.g. of brachiopod and echinoderm abundances, can be obscured by variation in another group.

Spatially or temporally correlated data

‘Spatial correlation’ means that values for data points close to each other are more similar than values of more distant data points – e.g. the faunal composition of an ecosystem from Arizona is rather like that of a Nevada community than that of a Massachusetts community.

LocalityEasting (X)Northing (Y)FaciesArchosaurs [%]Rhynchosaurs [%]

Geostatistics is the usual method to deal with spatially correlated data. Spatial correlation can also occur on much smaller scales, e. g. the shape and size of two skull bones in contact to each other can show a stronger dependence than the shape and size of bones that are more distant to each other.

In paleontology temporal correlation is quite abundant, especially if your study considers different stratigraphic ages or sedimentological field data:

PopulationHorizonAr/Ar age [Ma]Faciesδ18O [‰]Average size [mm]
A1210 ± 1deltaic-2.05.2
B2aN/Adistal shelf1.46.4
C2c207 ± 2?2.16.8
D4200 ± 1deltaic-2.26.0

As in stockmarket analytics methods of time series analysis can be applied to interpret temporally correlated data (i.e. time series). Such data may be relevant for your study as they often indicate evolutionary trends (biological evolution in the stricter sense but also evolution of paleoenvironments), cyclic processes with a certain periodicity, and/or they can form the basis for relating contemporaneous processes in the geological past (e.g. stratigraphic correlation of separate sedimentary successions).

Orientation data

For elongated fossils such as conical shells or long bones the orientation of the fossil long axis towards the geographical cordinate system can be measured using a compass (with inclinometer). In a similar way the orientation of bedding planes can be documented. Such measurements are often used for the purpose of deducing the former transport direction of a ancient sediment transport and depostion system (such as a river, delta, or alluvial fan). A data table with orientation data may look like that:

Specimen No.DescriptionLength [cm]HorizonAzimuth>Dip
1long bone211N 20° E
2rib121N 10° W
3calamite stem802a N 15° E

“Azimuth” refers to the angle towards north. Orientation data are distributed on a halfsphere. Mean values (e.g. the average orientation of long bones) and other distribution parameters cannot be derived directly from the averaging of orientation angles but vector arithmetics has to be applied.

Cladistic data

Phylogeny on the basis of morphology conventionally involves cladistic methods, especially in the field of vertebrate paleontology which deals with a particular character-rich group that is deemed suitable for cladistic approaches employing certain kinds of analysis software specialized for the calculation of phylogenetic trees (e.g. PAUP, WinClada).

In cladistic datasets lines represent taxa, mostly species or genera of the group of interest, and columns represent characters (ordered by number), i. e. features of the skeleton which are variable among the included taxa:


One of the main issues in cladistics is the definition of characters and the correct (unbiased) coding of morphological information. You can include qualitative differences ("bone X contacts bone Y but not bone Z" = character state “0”; "bone X contacts bones Y and Z" = character state “1”) and quantitative differences ("length of metatarsal 3 larger than or as large as length of metatarsal 4" = character state "0"; "mt3 is shorter than mt4" = "1"). Sometimes mixed character states like "0 or 1 [but not 2]" occur in a taxon and are coded accordingly.

Missing data...

...occur all the time in paleontology ... either because specimens are not complete enough or because their geological age cannot be exactly determined or because specimens are too rare or valuable to use them for a destructive analysis method or because they are for some reason no longer accessible. "N/A" ("not applicable") or empty entries or question marks often symbolize missing data.

Some introductory literature:

Borradaile, G. J. 2003. Statistics of Earth Science Data. Springer, Berlin, 280 pages. ISBN 3540436030

Swan, A. R. H. and M. Sandilands. 1995. Introduction to geological data analysis. Blackwell, Oxford, 446 pages. ISBN 0632032243