Open PhD student position in Digital Humanities, Helsinki/Finland

We are looking for an independent and  highly motivated PhD candidate to join our Digital Humanities research team in University of Helsinki, Finland in fall 2016. The position provides a unique opportunity to gain widely applicable research skills and make pioneering work in this rapidly emerging research area that takes advantage of massive data collections and modern data analytics to study classical questions in the humanities. The position is open for applications now and can be filled in as soon as a suitable candidate is found. For more info, see here.

Posted in Uncategorized | Leave a comment

Shakespeare! Cervantes! 400! A statistical analysis of the early modern printing of the Bard and Don Quixote

We are a team of an early modern intellectual historian (Mikko Tolonen) and a computational scientist and bioinformatician (Leo Lahti). Often on Friday nights we skype to discuss digital humanities, a field that nicely combines our mutual interests in science and philosophy. Today’s task was to do something to celebrate the 400th anniversary of Shakespeare & Cervantes.

We decided to build on our peculiar interest in library catalogues by focusing on CERL Heritage of the Printed Book Catalogue (HPB) and British Library’s English Short-Title Catalogue (ESTC). We then carried out a brief but revealing analysis concerning the early modern publishing (1593-1800) of Shakespeare and Cervantes. Check below for some graphs that we wanted to share as you might also find them interesting.

In the ESTC and CERL catalogues, we have metadata on roughly 0.5 and 5 million documents, respectively, between 1470-1800. With a combination of automated and manual data processing, we identified 1271 documents for Shakespeare (ESTC 1042; CERL 229), and 488 documents for Cervantes (ESTC 94; CERL 394). All illustrations are based on the combined data from these two catalogues unless otherwise mentioned.

Relative publishing activity: Shakespeare

One thing that we have learned about author lives when analysing publishing activity is that printing usually ends (more rapidly than you think) when the author kicks the bucket. That is, death is the end of popularity. Well, obviously this is not the case for Shakespeare. But do note that the new rise in publishing Shakespeare (based on ESTC data) begins in the 1730s with the input of the famous Tonson publishing house (see also Shakespeare publisher timeline below). The first graph illustrates the fraction of titles from Shakespeare relative to all other publishing activity in the ESTC catalogue.


Shakespeare play categories

We classified Shakespeare’s plays into tragedies, comedies and histories. Besides the 1730s peak, histories seem to be less popular than comedies and tragedies when published as single plays. Another observation: early 18th-century seems to be a more “tragedy-driven” era compared to few decades later in the high-Enlightenment when we witness the new rise of comedies.


Shakespeare title popularity

No real surprises here. Collected works and plays are of course an important source to access Shakespeare. But in the Top-5 list of single plays Hamlet, Romeo and Juliet, Macbeth and Othello are where you might expect to find them. Perhaps slightly surprising is that Julius Caesar beats Merchant of Venice and Merry Wives of Windsor.


Cervantes popularity

What is telling when comparing Cervantes on the continent and his popularity in the English-speaking world is that Galatea (Cervantes’s first published work) does very well on the continent, but is not published in English during the early modern period. At the same time it is very clear that Don Quixote is THE single work by any author in early modern Europe (including the English-speaking world).


Comparison of popular titles

On this timeline we see a very interesting contest. Don Quixote’s train-like rise throughout the early modern era is impressive indeed. Hamlet sees an interesting peak in English-speaking world in 1750s to be followed by the rise of the comedies and rapid sinking of the publishing of the Danish prince. Same thing happens to Romeo and Juliet shortly after. Macbeth on the other hand seems to follow a different, upward path towards the late eighteenth century.


Shakespeare publisher timeline

There exists great scholarship on Shakespeare’s copyrights in the eighteenth century by Terry Belanger. While we are well aware of the division of Shakespeare copyrights between different publishers and the use of printing congers, what we want to highlight here is the relevance of Tonson publishing house and the role played by John Bell towards the later eighteenth century in promoting Shakespeare in Britain (for Bell as ‘bibliographic nightmare’. The illustration is based on the ESTC catalogue, where we have manually cleaned up the publisher information, combining synonymous variants of the publisher names.


NB! Notes on methodology The trick to get this approach working is to harmonize the catalogued fields so we may trust the statistics that library catalogue data can provide us. For example, for this analysis most of our time was spent matching different entries of works in the ESTC and the HPB and removing hundreds of duplicate entries in the HPB data. We also took full advantage of our custom algorithms, implemented in the bibliographica R package, to automatize this cleaning process for any library catalogue as far as possible. The idea is not that this “big data approach” relying on library catalogue data would be perfect in terms of including every single translation of Shakespeare and Don Quixote on the continent or early modern Britain. But when the harmonizing of the catalogued fields is done properly, the approach gives us trustworthy results about the publishing trends that we are interested in. Whereas the raw data is not (yet) openly available, the full preprocessing workflows for ESTC and CERL are available via Github, as well as the full source code of this blog post.

Posted in Uncategorized | Leave a comment

RPA: fully scalable preprocessing method for short oligonucleotide microarray atlases

How to preprocess 20,000 CEL files (or more) on an ordinary desktop computer in a few hours? Our new Online-RPA algorithm – developed in collaboration with the EBI functional genomics group and recently published in Nucleic Acids Research (2013) –  enables full utilization of the most comprehensive microarray data collections available to date. We hope this will be widely adopted by the microarray community, and welcome feedback on the implementation.

Transcriptome-wide profiling data sets are now available on standardized microarray platforms (such as the Affymetrix HG-U133Plus2 array) for tens of thousands of samples, covering thousands of body sites and disease conditions through ArrayExpress and other genomic data repositories. The lack of scalable probe-level preprocessing techniques for very large gene expression atlas collections has formed a bottleneck for full utilization of these data resources.

The new online-version of RPA (Robust Probabilistic Averaging) now allows fully scalable analysis of contemporary (Affymetrix and other) short oligonucleotide microarray atlases of any size, up to arbitrarily large collections involving hundreds of thousands of samples. The scalability is achieved by sequential hyperparameter updates, circumventing the extensive memory requirements of standard approaches. Unlike fRMA, our method is readily applicable to all short oligonucleotide platforms. It also outperforms the standard RMA (a special case of the general RPA model) already in moderately sized standard data sets and can be used as the default preprocessing method for short oligo microarrays.


Online-RPA is freely available as a R/Bioconductor package. The wiki site provides installation instructions and usage examples. For feedback, issues, bug tracking, and pull requests, see the Github development version.


Posted in Uncategorized | Leave a comment

Selecting open license in the academia

Confusingly many open licenses are available. A key issue for academics is to ensure the widest possible reuse of the material by setting minimal restrictions on the end user. I  collected an Open Licensing Memo to help newcomers with selecting a license for scientific software, data and documents.

In summary, the FreeBSD and MIT software licenses are recommended since they set minimal restrictions on the end user, promoting the core scientific standards of publicity, transparency and unrestricted reuse.  The popular but more restrictive viral GPL licenses are for the same reason less preferable in the academia, unless licensing compatibility issues mandate their use.

Open licensing can help to guarantee your own rights to your work, encourage reuse in a legally sustainable manner and have advertisement value towards funding organisations, other scientists, fellow geeks, and laymen. It is really simple, and widely encouraged.

Unrestricted reuse of research results is a cornerstone of science. Explicit policies – in part enforced by the open licenses – are needed to realize the long-standing scientific standards in the evolving publication landscape where an increasing proportion of research details are embedded in code and data accompanying traditional publications. Research institutions should grant explicit rights for the researchers to publish their code under an open license to promote transparency and reproducibility. At the moment, the legal status is often unclear in this regard, although many research continue publishing their code under open licenses.

Posted in Uncategorized | 1 Comment

Defining the baseline for publication sharing

Recent blog posts by Bill Gasarch and improved version by Daniel Lemire are calling for an explicit manifesto to promote standard sharing practices: Store your work in an open archive such as arXiv; provide RSS feeds; Post improvements to your work; Make background material (slides) available; etc. An explicit manifesto would help to define the baseline and bring visibility. The discussion of the minimum recommendations is ongoing.

It is intriguing to consider an ever-evolving publication model where a manuscript can be constantly refined over time, referred to by version numbers. Peer-review is a key issue here. However, if the original author maintains responsibility, and the audience has a chance the comment on-line, in the PLoS One style, perhaps the problem of peer-review could be overcome. It is tempting to consider a culture of scientific publishing where the peer-reviewed publication marks the beginning of a refinement process, rather than an end.

Perhaps the publish-or-perish athmosphere in academic culture could be changed within the evolving publication landscape, where open publishing practices are providing new tools to advance quality and impact?

Posted in Uncategorized | Leave a comment

Deciphering the structure of life and society through open science

Understanding functional organization of genetic information is a major challenge in modern biology. Following the initial publication of the human genome sequence in 2001, advances in high-throughput measurement technologies and efficient sharing of research material through community databases have opened up new views to the study of living organisms and the structure of life.

High-dimensional genomic observations are associated with high levels of complex and largely unknown sources of variation. By combining statistical evidence across multiple measurement sources and the wealth of background information in genomic data repositories it has become possible to solve some the uncertainties associated with individual observations and to identify functional mechanisms that could not be detected based on individual measurement sources. However, measuring of all the aspects of genome function is far beyond the level any single research institution could afford. Therefore, sharing of research data, methods, code, ideas, and other research material through centralized databases has become a central element in investigating the structure and function of the human genome. Contributing to these global efforts was a key underlying motivation also with my recent thesis, where novel computational strategies with open source implementations were developed to investigate various aspects of genome function by integrative analysis of heterogeneous genomic data sources.

Similar computational challenges are encountered in quantitative social science, where the limited availability of observations and targeted computational tools form a bottleneck in investigating such extremely complex and poorly understood systems. Can the open data movement advance similar global collaboration that we currently have in modern genomics, to understand the structure and function of human societies?

Posted in opencomp | Leave a comment

Machine learning open source software – ICML/MLOSS workshop June 25, 2010

Computational solutions developed in one application domain are often easily generalized to new problem settings. However, the poor availability of algorithmic implementations slows down the flow from theory to practice. An increasing  proportion of research details in modern data-intensive science are embedded in the code and data accompanying traditional publications. The lack of widely adopted standards for sharing these resources form serious bottlenecks for transparency, reproducibility, and progress.

The Machine learning open source software (MLOSS) workshop provides a meeting point for the machine learning community to discuss potential solutions to these challenges. Victoria Stodden, a science commons fellow, gave fascinating keynote talk concerning intellectual property issues in modern data-intensive science. This and other presentations are freely available at Wide-spread adoption of open source policies will have remarkable impact on machine learning and its applications through accelerated scientific process and enhanced reproducibility (see the recent position paper advocating the need to make computational code available to other scientists, data analyst, and general public). The community website supports the movement towards open access by hosting >200 machine learning projects. Our Probabilistic Dependency Modelling Toolkit is one among the many.

Unrestricted access to algorithmic solutions will have wider implications in society through facilitating the emergence, flow, and application of computational ideas. The full potential of these resources will be realized only when the public at large will have convenient access to shared data resources, and tools for discovery.

Posted in opencomp | Leave a comment