I was both excited and nervous about my first RStudio conference, which had just taken place in Austin, Texas. I was excited because RStudio kindly offered me a scholarship to attend, and I was nervous because most of the attendees were mid-career professionals with advanced STEM degrees – yes, I did make some charts and ran some regressions in R, but my background as a recent college graduate in political science still seemed insufficient for a conference of this caliber.
However, all my fears and anxieties dissipated after a few hours at the conference. Everyone I met was excited to talk about his or her work, some going to extreme length to explain what was probably the most basic aspects of their work. The rstudio::conf crowd was the most diverse and supportive group I have ever met, and I truly hope every student who works with data, be he or she in the humanities, the social sciences, or any other discipline, gets to attend.
When we talk about diversity, we typically refer to that in race and ethnicity, gender, sexual orientation, or (dis)ability. While minorities in these categories were well-represented at this year’s conference, I was most struck by the diversity in industry and methodology, i.e. what people collected data in and how people approached data analysis.
Diversity in Industry
Talking to people at the conference gave me a whole new perspective on “data.” A telling anecdote is that four fellow students who also received the scholarship all kept notes titled “Cool People I Met at the Conference.” Here’s my own, non-exhaustive list of “cool people I met”:
A senior data scientist at one of the world’s largest pump companies (yes, a company that makes water pumps)
This data scientist told me his company’s senior leadership recognized the power of data and built a data science team several years ago for “predictive maintenance.” Hardware would be installed on all new pumps sold, and the data scientists at company headquarters would analyze when and where things are mostly likely to go wrong in the pumps.
I found our conversation fascinating because I had never thought that “data science” would be a possibility at pump companies, let alone the idea that the senior leadership of such a company would take the initiative. I also talked to the data scientist about compliance issues in overseas markets, since data security and industrial espionage are major concerns for regulators worldwide.
The lead meteorologist at a renewable energy company
We struck up a conversation after I recognized the logo on his business card – it’s the same company that supplies the gas and electricity for my apartment!
The meteorologist said he collected minute-by-minute weather data and used it to predict the amount of energy that could be generated from wind and solar. It was very inspiring to hear him talk because his eyes lit up every time he started talking about the technical details of his work. “I’ve been wanting to do this since I was 14,” he said multiple times.
Private contractors conducting national surveys on behalf of the U.S. federal government
I met several statisticians and programmers working for a private company that conducts surveys on behalf of the U.S. federal government. As a student of social science, I had used several surveys conducted by this company, so there was lots for us to talk about.
Here’s a brief overview of their workflow: Once the company signs a contract with a branch of the government, the statisticians would work on the sampling strategy to ensure a nationally representative sample (often with oversampled minority populations). Field workers would then knock on doors to conduct the survey. Once the results are reported back to headquarters, a group responsible for data cleaning would clean the data (hats off to them!), and the statisticians would apply the appropriate weights. The company also had many full-time programmers who built internal tools for the statisticians.
Asked whether they had been affected by the shutdown, they said that their own teams were doing okay but that some of their colleagues were told to stop working for now, since they would not be compensated for their work.
One statistician who worked mostly with health surveys shared many interesting stories regarding self-reported data. For example, she said people routinely report higher-than-actual height and lighter-than-actual weight, so good health surveys now all require in-person visits by a field worker with measuring tapes and a scale.
Economists and data scientists at one of the U.S.’ largest job sites
I got to meet five people from this company, all working in different roles. Some use the site’s job posting data to analyze the U.S. labor market, while others study how employers use the site and who are more likely to place advertisements.
I was most impressed by the diverse academic training this group has received. Only one of the five people I met had an economics degree. Others studied biology, sociology, law, and information studies. They said the company was very proud of building diverse teams.
The data analytics director at an international charity organization
I was surprised to learn that her organization had nearly 20 people working on data analytics. However, all they were tracking was the organization’s spending. The director lamented the lack of resources in her department and wished that her organization could carry out rigorous evaluation of its charity programs.
On the brighter side, the director said all the statistics was reported to her in a database that the company built a few years ago, so her team no longer had to deal with Excel spreadsheets!
Many others I do not have space to document
- A research analyst at a non-profit funded by the U.S. Department of Defense
- A data visualization consultant who visualizes Canadian fisheries for local governments
- A lab scientist at a genomic lab who “probably runs millions of regressions a week”
- Marketers and software engineers at one of the U.S.’ largest mutual funds
- Turns out there are rules in finance that prohibit marketers from changing the format of their performance reports to mislead customers!
Diversity in Methodology
Besides the diversity in industry, I was also grateful for the diversity in methodology I was exposed to at the conference. The social sciences is primarily focused on causal inference, but speakers at the RStudio conference discussed many other approaches to data analysis.
Mindlessly regressing y on x is a cardinal sin in any social science discipline, but at the RStudio conference, I saw many scenarios where this approach is not only useful but also superior to all the alternatives. For the lab sciences, it is helpful for identifying potential relationships that could be tested for causation at a later stage in randomized experiments. For marketers, knowing that “A is a good predictor of B” is probably sufficient – finding out whether A causes B is likely overkill.
A/B Testing (or Online Experiments)
Social scientists conduct experiments as well, but the data science teams at Internet companies seem to have a different focus. Online A/B tests such as changing the color of a button is a lot less messy than the experiments many social scientists conduct in the field, so the identification is more straightforward. Instead, data scientists at Internet companies are more interested in:
- building the infrastructure and fully automating the workflow so that A/B tests can be conducted in a consistent manner rather than on an ad-hoc basis;
- improving the data literacy of non-statisticians such as designers and product managers, so that they can commit to a pre-analysis plan and not vary the duration of the experiment once the experiment has started.
Brooke Watson, a data scientist at the ACLU, gave a moving presentation on how her team collected the data on the children being separated from their parents at the border. For several weeks, she would receive Excel spreadsheets with different column names and missing data on these children from co-workers and collaborators around the country. Her team spent months cleaning and validating the data, and in the end, probably still missed a lot of children in their dataset.
Although all her team was able to produce was descriptive statistics on a subset of these children – their age distribution and current location – it was nonetheless crucial for the public understanding of the crisis. Ms. Watson said legislators from across the country called to get more information, and lots of citizens were inspired to act.
I gained a new appreciation for descriptive analysis after listening to the talk. It reminded me of social scientists who spend years and even decades working with archival sources and creating original datasets from materials they saved from the dumpster or found in a neglected corner of the local library.
Although 90% of our time is spent on data entry and data cleaning, we don’t typically get credit because they are seen as “mechanical” work that lacks the rigor of a mathematical model or a programming function. But in reality, the tenacity that’s required to plow through the grunt work of data entry and cleaning is admirable in and of itself. Knowledge can be found in models and code, but it can also be found in an Excel spreadsheet or a chart of summary statistics.
My very unscientific survey of fellow scholarship recipients shows that the RStudio conference is one of the friendliest gatherings they have ever been to. I’m not going to detail all the free career advice and generous emotional support people at the conference gave me. I will say, however, that the organizers made a great effort to create a collegial, inclusive environment.
Hadley Wickham, the host of the event, said during his keynote that we should always form a “Pac-Man” circle so as to leave some space for others to join an existing conversation. That is what exactly many groups did during the conference.
A code of conduct was repeatedly brought up by the organizers, and attendees were all given information on who and where to report should any unpleasant incident occurs. I also saw people using the Mother’s Room at the conference, and the lunch the organizers put together for scholarship recipients was also very well received.
All of the above are commendable practices I hope other conferences would adopt. Finally, there was also the throw-able microphones that greatly sped up the Q&A process. I would definitely recommend it anyone to hosting large events!
Apply for the scholarship!
The next RStudio conference will be held in San Francisco in January, 2020. The scholarship application opens in August, 2019 and closes in mid-September. Apply. Have fun!
- How Extraordinary Was the American 15-year-old Coco Gauff's R16 Run at Wimbledon? Will She Live Up to the Hype? A Data Analysis of WTA and ATP Teenage Successes
- What It's Like to Attend an RStudio Conference as a Student and Why You Should Apply for the Diversity Scholarship Too
- Preview of Black Friday 2018: Comparing Online Retail Sales Data of China, the U.S., and the U.K.