This post was originally published by Charlie Kufs at Towards Data Science
Things have changed a lot in fifty years.
Anonymous feline in witness protection program.
Those who cannot remember the past are condemned to repeat it.
History isn’t always clear-cut. It’s written by anyone with the will to write it down and the forum to distribute it. It’s valuable to understand different perspectives and the contexts that created them. The evolution of the term Data Science is a good example.
I learned statistics in the 1970s in a department of behavioral scientists and educators rather than a department of mathematics. At that time, the image of statistics was framed by academic mathematical-statisticians. They wrote the textbooks and controlled the jargon. Applied statisticians were the silent majority, a sizable group overshadowed by the academic celebrities. For me, reading Tukey’s 1977 book Exploratory Data Analysis was a revelation. He came from a background of mathematical statistics yet wrote about applied statistics, a very different animal.
My applied-statistics cohorts and I were a diverse group-educational statisticians, biostatisticians, geostatisticians, psychometricians, social statisticians, and econometricians, nary a mathematician in the group. We referred to ourselves collectively as data-scientists, a term we heard from our professor. We were all data scientists, despite our different educational backgrounds, because we all worked with data. But the term never stuck and faded away for over the years.
Feline crunching on number keys.
Applied statistics had been very important during World War II, most notably in code breaking but also in military applications and more mundane logistics and demographic analyses. After the war, the dominance of deterministic engineering analysis grew and drew most of the public’s attention. There were many new technologies in consumer goods and transportation, especially aviation and the space race, so statistics wasn’t on most people’s radar. Statistics was considered to be a field of mathematics. The public’s perception of a statistician was a mathematician, wearing a white lab coat, employed in a university mathematics department, who was working on who-knows-what.
One of the technologies that came out of WWII was ENIAC, which led to the IBM/360 mainframes of the early 1960s. These computers were still huge and complex, but compared to ENIAC, quite manageable. They were a technological leap forward and inexpensive enough to become part of most university campuses. Mainframes became the mainstays of education. Applied statisticians and programmers led the way; computer rooms across the country were packed with them.
In 1962, John Tukey wrote in “The Future of Data Analysis”
“For a long time, I have thought I was a statistician, interested in inferences from the particular to the general. But as I have watched mathematical statistics evolve, I have had cause to wonder and to doubt…I have come to feel that my central interest is in data analysis, which I take to include, among other things: procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data.”
I read that paper as part of my graduate studies. Perhaps applied statisticians saw this paper as an opportunity to develop their own identity, apart from determinism and mathematics, and even mathematical statistics. But it really wasn’t an organized movement, it just evolved.
One of my cohorts and I discussing data science.
So as my cohorts and I understood it, the term data-sciences was really just an attempt to coin a collective noun for all the number-crunching, just as social-sciences was a collective noun for sociology. anthropology, and related fields. The data sciences included any field that analyzed data,regardless of the domain specialization,as opposed to pure mathematical manipulations. Mathematical statistics was NOT a data science because it didn’t involve data. Biostatistics, chemometrics, psychometrics, social and educational statistics, epidemiology, agricultural statistics, econometrics, and other applications were part of data science. Business statistics, outside of actuarial science, was virtually nonexistent. There were surveys but business leaders preferred to call their own shots. Data-driven business didn’t become popular until the 21stcentury. But if it had been a substantial field, it would have been a data science.
Computer programming might have involved managing data but to statisticians it was not a data science because it didn’t involve any analysis of data. There was no involved. At the time, it was called data processing. It involved getting data into a database and reporting them, but not analyzing them further. Naur (1974) had a different perspective. Naur was a computer scientist who considered data science to encompass dealing with existing data, and not how the data were generated or were to be analyzed. This was just the opposite of the view of applied statisticians. Different perspectives.
Programming in the 1950s and 1960s was evolving from the days of flipping switches on a mainframe behemoth, but was still pretty much limited to Fortran, COBOL, and a bit of Algol. There were issues with applied statisticians doing all their own programming. They tended to be less efficient than programmers and were sometimes unreliable. To paraphrase Dr. McCoy, I’m an applied statistician not a computer programmer.” This philosophy was reinforced by British statistician Michael Healy when he said:
No single statistician can be expected to have a detailed knowledge of all aspects of statistics and this has consequences for employers. Statisticians flourish best in teams-a lone applied statistician is likely to find himself continually forced against the edges of his competence.
M.J.R. Healy. 1973. The Varieties of Statistician. J. Royal Statistical Society. 136(1), p. 71–74.
So when the late 1960s brought statistical-software-packages, most notably BMDP and later SPSS and SAS, applied statisticians were in Heaven. Still, the statistical packages were expensive programs that could only run on mainframes, so only the government, universities, and major corporations could afford their annual licenses, the mainframes to run them, and the operators to care for the mainframes. I was fortunate. My university had all the major statistical packages that were available at the time, some of which no longer exist. We learned them all, and not just the coding. It was a real education to see how the same statistical procedures were implemented in the different packages.
Waiting for the mainframe to print out my analysis.
Throughout the 1970s, statistical analyses were done on those big-as-dinosaurs, IBM/360 mainframe computers. They had to be sequestered in their own climate-controlled quarters, waited on command and reboot by a priesthood of system operators. No food and no smoking allowed! Users never got to see the mainframes except, maybe, through a small window in the locked door. They used magnetic tapes. I saw ‘em.
Conducting a statistical analysis was an involved process. To analyze a data set, you first had to write your own programs. Some people used standalone programming languages, usually Fortran. Others used the languages of SAS or SPSS. There were no GUIs (Graphical User Interfaces) or code writing applications. The statistical packages were easier to use than the programming languages but they were still complicated
Once you had handwritten the data-analysis program, you had to wait in line for an available keypunch machine so you could transfer your program code and all your data onto 3¼-by-7⅜-inch computer punch-cards. After that, you waited so you could feed the cards through the mechanical card-reader. On a good day, it didn’t jam … much. Finally, you waited for the mainframe to run your program and the printer to output your results. Then the priesthood would transfer the printouts to bins for pickup. When you picked up your output sometimes all you got was a page of error codes. You had to decipher the codes, decide what to do next, and start the process all over again. Life wasn’t slower back then, it just required more waiting.
In the 1970s, personal computers, or what would eventually evolve into what we now know as PCs, were like mammals during the Jurassic period, hiding in protected niches while the mainframe dinosaurs ruled. Before 1974, most PCs were built by hobbyists from kits. The MITS Altair is generally acknowledged as the first personal computer, although there are more than a few other claimants. Consumer-friendly PCs were a decade away. (My first PC was a Radio Shack TRS-80, AKA Trash 80, that I got in 1980; it didn’t do any statistics but I did learn BASIC and word processing.) Big businesses had their mainframes but smaller businesses didn’t have any appreciable computing power until the mid-1980s. By that time, statistical software for PCs began to spring out of academia. There was a ready market of applied statisticians who learned on a mainframe using SAS and SPSS but didn’t have them in their workplaces.
Feline PC watching feline mainframe.
Statistical analysis changed a lot after the 1970s. Punch cards and their supporting machinery became extinct. Mainframes were becoming an endangered species, having been exiled to specialty niches by PCs that could sit on a desk. Secure, climate-controlled rooms weren’t needed nor were the operators. Now companies had IT Departments. The technicians sat in their own areas, where they could eat and smoke, and went out to the users who had a computer problem. It was as if all the Doctors left their hospital practices to make house calls.
Inexpensive statistical packages that ran on PCs multiplied like rabbits. All of these packages had GUIs; all were kludgy and even unusable by today’s standards. Even the venerable ancients, SAS and SPSS, evolved point-and-click faces (although you could still write code if you wanted). By the mid-1980s, you could run even the most complex statistical analysis in less time than it takes to drink a cup of coffee … so long as your computer didn’t crash.
PC sales had reached almost a million per year by 1980. But then in 1981, IBM introduced their 8088 PC. Over the next two decades, the number of IBM-compatible PCs that were sold increased annually to almost 200 million. From the early 1990s, sales of PCs had been fueled by Pentium-speed, GUIs, the Internet, and affordable, user-friendly software, including spreadsheets with statistical functions. MITS and the Altair were long gone, now seen only in museums, but Microsoft survived, evolved, and became the apex predator.
The maturation of the Internet also created many new opportunities. You no longer had to have access to a huge library of books to do a statistical analysis. There were dozens of websites with reference materials for statistics. Instead of purchasing one expensive book, you could consult a dozen different discussions on the same topic, free. No dead trees need clutter your office. If you couldn’t find website with what you wanted, there were discussion groups where you could post your questions. Perhaps most importantly, though, data that would have been difficult or impossible to obtain in the 1970s were now just a few mouse-clicks away, usually from the federal government.
So, with computer sales skyrocketing and the Internet becoming as addictive as crack, it’s not surprising that the use of statistics might also be on the rise. Consider the trends shown in this figure. The red squares represent the number of computers sold from 1981 to 2005. The blue diamonds, which follow a trend similar to computer sales, represent revenues for SPSS, Inc. So at least some of those computers were being used for statistical analyses.
Another major event in the 1980s was the introduction of Lotus 1–2–3. The spreadsheet software provided users with the ability to manage their data, perform calculations, and create charts. It was HUGE. Everybody who analyzed data used it, if for nothing else, to scrub their data and arrange them in a matrix. Like a firecracker, the life of Lotus 1–2–3 was explosive but brief. A decade after its introduction, it lost its prominence to Microsoft Excel, and by the time data science got sexy in the 2010s, it was gone.
With the availability of more computers and more statistical software, you might expect that there may be more statistical analyses being done. That’s a tough trend to quantify, but consider the increases in the numbers of political polls and pollsters. Before 1988, there were on-average only one or two presidential approval polls conducted each month. Within a decade, that number had increased to more than a dozen. In the figure, the green circles represent the number of polls conducted on presidential approval. This trend is quite similar to the trends for computer sales and SPSS revenues. Correlation doesn’t imply causation but sometimes it sure makes a lot of sense.
Perhaps even more revealing is the increase in the number of pollsters. Before 1990, the Gallup Organization was pretty much the only organization conducting presidential approval polls. Now, there are several dozen. These pollsters don’t just ask about Presidential approval, either. There are a plethora of polls for every issue of real importance and most of the issues of contrived importance. Many of these polls are repeated to look for changes in opinions over time, between locations, and for different demographics. And that’s just political polls. There has been an even faster increase in polling for marketing, product development, and other business applications. Even without including non-professional polls conducted on the Internet, the growth of polling has been exponential.
Statistics was going through a phase of explosive evolution. By the mid-1980s, statistical analysis was no longer considered the exclusive domain of professionals. With PCs and statistical software proliferating and universities requiring a statistics course for a wide variety of degrees, it became common for non-professionals to conduct their own analyses. Sabermetrics, for example, was popularized by baseball professionals who were not statisticians. Bosses who couldn’t program the clock on a microwave thought nothing of expecting their subordinates to do all kinds of data analysis. And they did. It’s no wonder that statistical analyses were becoming commonplace wherever there were numbers to crunch.
Big Data cat.
Against that backdrop of applied statistics came the explosion of data wrangling capabilities. Relational databases and Sequel (SQL) data retrieval became the vogue. Technology also exerted its influence. Not only were PCs becoming faster but, perhaps more importantly, hard disk drives were getting bigger and less expensive. This led to data warehousing, and eventually, the emergence of Big Data. Big data brought Data Mining and black-box modeling. BI (Business Intelligence) emerged in 1989, mainly in major corporations.
Then came the 1990s. Technology went into overdrive. Bulletin Boards Systems (BBSs) and Internet Relay Chat (IRC) evolved into instant messaging, social media, and blogging. The amount of data generated by and available from the Internet skyrocketed. Google and other search engines proliferated. Data sets were now not just big, they were BIG. Big Data required special software, like Hadoop, not just because of its volume but also because much of it was unstructured.
At this point, applied statisticians and programmers had symbiotic, though sometimes contentious, relationships. For example, data wranglers always put data into relational databases that statisticians had to reformat into matrices before they could be analyzed. Then, 1995–2000 brought the R programming language. This was notable for several reasons. Colleges that couldn’t afford the licensing and operational costs of SAS and SPSS began teaching R, which was free. This had the consequence of bringing programming back to the applied-statistics curriculum. It also freed graduates from worrying about having a way to do their statistical modeling at their new jobs wherever they might be.
Conducting a data analysis in the 1990s was nowhere near as onerous as it was twenty years before. You could work at your desk on your PC instead of camping out in the computer room. Many companies had their own data wranglers who built centralized data repositories for everyone to use. You didn’t have to enter your data manually very often, and if you did, it was by keyboarding rather than keypunching. Big companies had their big data but most data sets were small enough to handle in Access if not Excel. Cheap, GUI-equipped statistical software was readily available for any analysis Excel couldn’t handle. Analyses took minutes rather than hours. It took longer to plan an analysis than it did to conduct it. Anyone who took a statistics class in college began analyzing their own data. The 1990s produced a lot of cringeworthy statistical analyses and misleading charts and graphs. Oh, those were the days.
The 2000s brought more technology. Most people had an email account. You could bring a library of ebooks anywhere. Cell phones evolved into smartphones. Flash drives made datasets portable. Tablets augmented PCs and smartphones. Bluetooth facilitated data transfer. Then something else important happened — funding.
Funding for activities related to data science and big data became available from variety of sources, especially government agencies like the National Science Foundation, the Nation Institutes of Health, and the National Cancer Institute. (The NIH released its first strategic plan for data science in 2018.) The UK government also funds training in artificial intelligence and data science. So too do businesses and non-profit organizations. Major universities responded by expanding their programs to accommodate the criteria that would bring them the additional funding. What had been called applied statistics and programming were rebranded as data science and big data.
Donoho captured the sentiment of statisticians in his address at the 2015 Tukey Centennial workshop:
“ Data Scientist means a professional who uses scientific methods to liberate and create meaning from raw data. … Statistics means the practice or science of collecting and analyzing numerical data in large quantities.
To a statistician, [the definition of data scientist] sounds an awful lot like what applied statisticians do: use methodology to make inferences from data. … [the] definition of statistics seems already to encompass anything that the definition of Data Scientist might encompass …
The statistics profession is caught at a confusing moment: the activities which preoccupied it over centuries are now in the limelight, but those activities are claimed to be bright shiny new, and carried out by (although not actually invented by) upstarts and strangers.
Feline data scientist observing feline statistician.
The rest of the story of Data Science is more clearly remembered because it is recent. Most of today’s data scientists hadn’t even graduated from college by the 2010s. They might remember, though, the technological advances, the surge in social connectedness, and the money pouring into data science programs in anticipation of the money that would be generated from them. Those factors led to a revolution.
The average age of data scientists in 2018 was 30.5, the median was lower. The younger half of data scientists were just entering college in the 2000s, just when all that funding was hitting academia. (FWIW, I’m in the imperceptibly tiny bar on the upper left of the chart along with 193 others.) But KDnuggets concluded that:
“… rather than attracting individuals from new demographics to computing and technology, the growth of data science jobs has merely creating [sic] a new career path for those who were likely to become developers anyway.”
The event that propelled Data Science into the public’s consciousness, though, was undoubtedly the 2012 Harvard Business Review article that declared data scientist to be the sexiest job of the 21stcentury. The article by Davenport and Patil described a data scientist as “a high-ranking professional with the training and curiosity to make discoveries in the world of big data.” Ignoring the thirty-year history of the term, though not the concept which was new, the article notes that there were already “thousands of data scientists … working at both start-ups and well-established companies” in just five years. I doubt they were all high-ranking.
Davenport and Patil attributed the emergence of data-scientist as a job title to the varieties and volumes of unstructured Big Data in business. But a consistent definition of data scientist proved to be elusive. Six years later in 2018, KDnuggets described Data Science as an interdisciplinary field at the intersection of Statistics, Computer Science, Machine Learning, and Business, quite a bit more specific than the HBR article. There were also quite a few other opinions about what data science actually was. Everybody wanted to be on the bandwagon that was sexy, prestigious, and lucrative.
This post was originally published by Charlie Kufs at Towards Data Science