Data Science and the Fourth Paradigm: Data-Intensive Scientific Discovery

Data is being collected at unprecedented scale and speed in all industrial sectors and scientific domains, e.g., the Large Hadron Collider at CERN. This means that scientific breakthroughs are increasingly dependent on advanced data analytics capabilities that can help them organize, explore, and analyze massive volumes of data in order to gain scientific knowledge. The process of doing so, termed data science, is quickly emerging as a key foundational discipline for most domain sciences, like mathematics has been for centuries.

Indeed, data scientists are said to have “the sexiest job of the 21st century”. This event will feature three exciting talks that explain the basics of data science and how it enables a new paradigm of data-intensive scientific discovery, coined “the Fourth Paradigm” by the late Jim Gray, winner of the Turing Award – the “Nobel” prize of computer science. We invite everyone with an interest in data science, especially domain scientists using data intensively, to participate in this event.


14:00-14:10: Welcome

14:10-15:10: Democratizing Data Science by Bill Howe
Why does data science remain so difficult? Despite years of focused research in systems and methods, data science remains a high-touch, attention-intensive exercise. There has been a “Cambrian explosion” of big data systems proposed and evaluated in the last eight years, but relatively little understanding of how these systems or the ideas they implement compare and complement one another.

At the UW eScience Institute and in the UW Database Group, we are building platforms to reduce the complexity of this landscape, aiming to democratize access to advanced data management and analytics across all fields of science and across all levels of expertise.

In this talk, I'll describe findings from a multi-year deployment of a database-as-a-service system called SQLShare, and current research in the context of the Myria project, a federated data management and analytics system that supports multiple backend engines, iteration as a first-class citizen, new algorithms, built-in visualization and performance profiling, and a language interface that balances imperative and declarative features.  I'll show use cases in oceanography, biology, and the social sciences.
15:10-15:25: Break

15:25-16:10: Data Science and some (classical) challenges by Søren Højsgaard
Once upon a time, statisticians (data scientists of the old age) often distinguished between data from planned experiments and data from observational studies. They also focused on writing down the questions to be asked (sometimes phrased as hypotheses) in carefully prepared protocols before data was collected. The purpose was, amongst other things, to protect against jumping to unwarranted conclusions based on spurious data. Modern data science together with powerful computers offers new possibilities on the type and scale of problems that can be handled. Perhaps some of the good old deeds should be revitalized in this context?

15:25-16:10: Extracting Value from Big Data – The Case of Vehicular Traffic Data by Christian S. Jensen
Almost all areas of everyday life are accompanied, guided, and influenced by computing and communication devices that are embedded in large networks, most notably the Internet. These devices produce increasing amounts of data. There is consensus that society and businesses can benefit substantially from the ability to base their functioning on these large amounts of data, often called big data. Notions such as “data-driven business” and “data-driven society” have been advanced, suggesting that entities that are able to base their decisions and operation on data are capable of being more competitive than those that are not. Big data is slated to have a profound effect on society. We are in the middle of a revolution of the way we live, work, and interact.

As a case study, this presentation offers insight into some of the research at Daisy that aims to create value from big vehicular trajectory data. A central theme is to gain a more detailed and up-to-date understanding of a transportation infrastructure, which may in turn be used for improving the utilization of the infrastructure and for improving the infrastructure itself.

16:10-16:30: Plenary discussion


Bill Howe is the Associate Director of the UW eScience Institute and holds an Affiliate Faculty appointment in Computer Science & Engineering. His research interests are in data management, curation, analytics, and visualization in the sciences. Howe has received two Jim Gray Seed Grant awards from Microsoft Research for work on managing environmental data, has had two papers selected for VLDB Journal's "Best of Conference" issues (2004 and 2010), and co-authored what are currently the most-cited papers from both VLDB 2010 and SIGMOD 2012. Howe serves on the program and organizing committees for a number of conferences in the area of databases and scientific data management, and developed a first MOOC on data science that attracted over 200,000 students across two offerings. He has a Ph.D. in Computer Science from Portland State University and a Bachelor's degree in Industrial & Systems Engineering from Georgia Tech.

Søren Højsgaard is associate professor of statistics and Head of Department of Mathematical Sciences. He has 20 years of experience with applied statistics, in particular related to the life sciences. He is a long time user and developer of the statistical program R; he has authored several R packages and a book about graphical models with R.

Christian S. Jensen is an Obel Professor of Computer Science at Aalborg University, Denmark. He was a Professor at Aarhus University for a 3-year period from 2010 to 2013, and he was previously at Aalborg University for two decades. He recently spent a 1-year sabbatical at Google Inc., Mountain View. His research concerns data management and data-intensive systems, and its focus is on temporal and spatio-temporal data management. Christian is an ACM and an IEEE fellow, and he is a member of the Academia Europaea, the Royal Danish Academy of Sciences and Letters, and the Danish Academy of Technical Sciences. He has received several national and international awards for his research. He is Editor-in-Chief of ACM TODS and was an Editor-in-Chief of The VLDB Journal from 2008 to 2014.


Arrangementet henvender sig fortrinsvist til domæne-forskere og private og offentlige virksomheder/institutioner. Og der vil være et meget begrænset antal pladser til studerende




Tid og sted

Dato:  22. juni 2015
Tid:  Kl. 14.00-16.30
Sted:  Institut for Datalogi, Aalborg Universitet, Selma Lagerløfs Vej 300, lokale 0.1.95, 9220 Aalborg Øst
Pris:  Arrangementet er gratis. Tilmelding (og evt. afbud) dog påkrævet.
Kontakt navn:  Marianne Bentzen
Kontakt e-mail
Tilmeldingsfrist:  18. juni 2015

Se alle Infinit arrangementer

Netværkets aktiviteter er medfinansieret af Uddannelses- og Forskningsministeriet og drives af et konsortium bestående af:
Alexandra Instituttet . BrainsBusiness . CISS . Datalogisk Institut, Københavns Universitet . DELTA . DTU Compute, Danmarks Tekniske Universitet . Institut for Datalogi, Aarhus Universitet . IT-Universitetet . Knowledge Lab, Syddansk Universitet . Væksthus Hovedstadsregionen . Aalborg Universitet