Please enable JavaScript to view the comments powered by Disqus. comments powered by Disqus

Is Data Science something?

Data has been around since people started counting and measuring, and systematic treatment of data has become the norm in recent decades with the rise of automated data processing. Yet Data Science seems to have arisen only very recently as a field of study and expertise. I was curious and wanted to understand the scope and objectives of this apparently new field of study.

Trying an MOOC

Data Science Johns Hopkins
This looked like an opportunity to combine investigating Data Science with trying out a Massive Open Online Course. My searches threw up a set of courses offered on by Johns Hopkins University. There are nine courses of four weeks each plus a ’capstone’ project of seven weeks offered in a ’specialisation’. I found this attractive because:
  • I could sign up course by course, so I could drop out at any point if it did not seem interesting or useful any more;
  • the courses can be taken free of charge, but, for a fee of €35 per course, Coursera also offers a ’signature track’ whereby your identity and is checked and linked to your course work so that they can issue a verified certificate when you complete the course;
  • this looked like serious stuff, Johns Hopkins is a highly ranked university, the intro videos by the teachers gave the impression that the course was accessible but would really teach something.

At this point I have completed the first three courses in the specialisation.

Back to the command line

Our teachers aim to make sure that we are well grounded in the basics before we get on to the pretty things like making graphs. So for a generation who only know their computers through GUIs there is an introduction to the power of the command line. It’s a lifetime ago since I was there myself. One of my fondest memories of the command line is on a DECsystem-10 where the operating system had its roots in the sixties and the command:
make love # to create a file named ‘love’
evoked a system response
not war?

Initially we had to install software for programming in the language R, implement version control using Git and be able to share work and results via Github.

Why R?

The primary Data Science tool used on the courses is the programming language R. To R everything is a vector or can be built from a set of vectors. Even a single variable is a single celled vector. It is free software in the same space as IBM’s SPSS.

R is very open language with a whole community not only maintaining the R base but also building new R packages that comprise new functions. Anyone can do that. So there are thousands of packages available. For the newcomer this adds to the difficulty of getting to the point where you feel at least some R is at your fingertips. We started off learning the R base, the essential R syntax. And we had hardly achieved any level of R mastery when we were introduced to powerful packages with a wide scope that have their own syntax.

And on top of that there are the weird-at-first-sight things like an R package that enables query of an array using SQL query syntax.

What have I learned?

I have hacked and puzzled my way through the exercises and tests in three four-week courses, spending much more time than the indicated five hours per week. What have I learned:

Technical skills

  • basic skills in R, but I need to do a lot of serious work in R to consolidate and stabilise that knowledge
  • version control techniques and use of Github
  • sources of technical help such as Stack Overflow
  • getting data from APIs on the web - that looks as though it could be fun
  • handling raw data in many formats

Data science

The concept of tidy data and the trajectory from raw data to tidy data.

One of the exciting revelations is the amount of data available ‘out there’ from all kinds of sources in the growing world of open data. But it is horrifying to see the messy ways in which some of such data is presented. And data needs to be tidy before it can be systematically analysed and queried. It was good to see that wheels are not being reinvented here. In his paper on tidy data, Hadley Wickham makes the link between the principles of tidy data and those of relational databases as described in Codd’s relational algebra. *

In tidy data (considered in the form of a table or array)
  • each variable should be in one column (and each column contains only one variable)
  • each observation of the variable is in one row
  • each table comprises one “kind” of data
  • if there are multiple related tables a column in each table provides a reference to another table

But, no matter how much data you have available and how skilled you may be in manipulating, analysing and presenting it, the key thing is the question. What question do you have whose answer can be found in this data? A good question is like a pearl of great price.

The course material

The main teaching material is in the form of short video lectures that are primarily a commentary on a fairly sparse slide presentation. At first it seemed to me that there was not much information being presented. But when it came to doing the weekly quiz tests it quickly became clear that there was a great deal covered in the lectures, although the points were only referred to once and sometimes very briefly. And there were references to sources with more detail.

As the courses progressed the weekly quiz tests shifted from a test of information acquired and retained to a test of R programming and data analysis skills. So that each question required work in R to discover the answer. This turned out to be a very effective part of the learning process. Such quiz tests were automatically evaluated on the Coursera system. But there were also projects where the work was submitted via Github for evaluation by fellow students. And after the submission deadline each student was required to assess the work of four anonymous fellows. That was also a good learning process, sometimes with the sinking feeling of ‘oh dear, I got that badly wrong in my submission’.

The Coursera platform includes a discussion forum for students signed up on a course. This is a good source of assistance from fellow students and tutors. I probably should have spent more time there. It also gave some insight into the wide diversity of participants, with some really struggling to understand the course material and what they should be doing in assignments and now and then a whizz kid flying by boasting that he’d done five courses in parallel with no problem.

R In Action Robert Kabacoff
Although there is a good deal of information about R available online, at some stages of the course I felt lost and uncertain and needed a tangible source to hand of everything I needed to know. So I acquired ‘Art of R Programming’ by Norman Matloff as an ebook and at a later stage ‘R in Action’ by Robert Kabacoff on physical paper. And was delighted to discover that the paper book also gave access to an ebook edition provided by Manning the publishers. There are many books on various aspects of R (an indication of the popularity of the language) and I found it difficult to make a choice among such bounty.

What’s next

I am currently evaluating whether to go further with the series of courses. The next one starts on 6 October, so I have a few days to think about it. And the question of the scope and definition of Data Science continues to intrigue me, so I am working on a wee blog post about that.
* Hadley Wickham, Tidy Data
EF Codd, The Relational Model for Data Base Management: Version 2

Updated 2014-10-02 12:13 CEST
Updated 2015-01-06 13:22 CET typos

blog comments powered by Disqus