Thoughts on Open Science/Access/Data
As various scientific fields move forward, we are beginning to notice a trend: openness. Research used to be conducted in private (to an extent), and research articles have historically been difficult to obtain. In many ways, both of these points are still true today. But the norm has been changing, and you can now find more research that's discussed openly, more articles available to the public, and so on. Overall, I have been strongly in favor of this trend, though I do think there are some things that need to be figured out in regards to privacy. In this post I want to take some time to discuss the benefits of conducting research in the open, and what some of my specific concerns are.
To really get a sense of what we're talking about, it's important to be clear about what we mean when using some of these terms. First and foremost is the general concept of "open science." While "open access" is probably a term more of you have heard (I'll get to it below), I think it's helpful to start here. The idea of open science is to make the research process itself open to the public. This can mean publicly-available notes, research proposals, and so forth. Everything about the process is open to the public in some format. Or at least, some components of the process are open to the public.
The idea is similar to an "open source" approach, which is when software developers make all (or portions) of their source code available for public viewing (such as most of the code for my emotion and coping game on github). There are several benefits to doing so, including easier contribution from those "outside" the project, more eyes to catch problems, easier sharing and learning from one another, and so on. It's not without its challenges (e.g., code can be improperly taken/used, there are debates as to whether open source code is more secure or less). But overall, many developers are noticing the benefits of an open source approach and have at least some of their code publically available.
The benefits for science can be similar. By having things more open, "citizen science" (or non-scientists contributing to the research process) becomes more feasible. Researchers can help critique one another and thus boost the quality of the research methods. Successful techniques and processes can be more easily replicated and improved upon so research can become more efficient. The list can go on.
But there are some challenges that open science faces. First, there isn't a platform equivalent to github that researchers can easily use for this purpose. Sure, some academic "social networks" have tried to move in that direction, but more so as an attempt to get researchers on their platform rather than to truly be a useful and open tool. Second, there is a lack of interest in the broader research community. Between those who are averse to change, those who feel a need to compete for funding, and so on, open science faces an uphill battle in terms of adoption within the research community. And third, there are some concerns related to privacy within some fields (which I'll come back to).
That all being said, the actual "open science" piece of the puzzle can be relatively easy and beneficial, at least in terms of keeping notes, the research process itself, and so on open. Where things get tricky is when we look at access and data.
When talking about "open access," the access is to published peer-reviewed articles. The kind found in academic journals, like Nature or the Journal of Pediatric Psychology. Overall, very few people seem against the idea of published research articles being freely accessible to the public. Even for those of us in academia who are doing research, finding articles that are behind paywalls can be a huge pain and waste of valuable time.
So why is open access not the norm? It's slowly moving in that direction, but there's an entire industry of journal publishing that needs to shift in that direction. Journal article costs used to be necessary to help fund the publishers who helped to coordinate the process of getting research collected and published in a physical format. But as things move to digital, it's unclear how this system will be sustained. Some journals give researchers an opportunity to pay upfront for their article to be open access, but the amounts can be unrealistic for researchers without decent grant funding (e.g., most students).
Nonetheless, things have been moving in that direction. Even when articles are technically behind a paywall, it has become increasingly easy to find free copies somewhere online (e.g., from university sites where the first author works, from sites like ResearchGate). This has forced the field to move in this direction, whether the publishers like it or not. The main thing that still needs to be figured out is copyright and who owns the rights to the article.
But as part of this movement towards open access, there has been a growing trend towards updating the publication process overall. For example, some journals now give approval to research studies in advance, based on their methods rather than their results. This can help to counteract the fact that articles are more likely to get published if they have "significant" results (which is, to my knowledge, still true in the case of traditional publication processes that examine articles after the full research study is completed and written up).
So overall, open access is a great thing, and scientific fields are moving in that direction. Over the next several years, we should continue to see a lot of (hopefully) good changes in this area.
Arguably one of the most beneficial, but also most difficult, aspects of an open science approach is having "open data." As the name implies, the idea is to make databases publically available. This can allow others to run analyses that the researchers may not have considered, can make it easier for people to identify errors or outliers that need to be corrected, and so on.
But there are some valid concerns when it comes to open data. First, it's very easy to misuse data. Statistical analyses can be powerful, but can also be manipulated in ways to create "significant" results that are misleading (even, unfortunately, by professionals). If you've taken a statistics course, you've probably been told many times over how you need to be skeptical of reported results of statistics (e.g., charts in news that use scaling to exaggerate differences). If data is open, it's easier to take that data and keep running analyses until there are results that help to support whatever viewpoint a person has.
The second concern, and one that is relevant to me personally, is related to privacy. Many databases, at least in the social sciences, are comprised of data provided by human participants. While this information can be "de-identified," that only removes obvious identifiers (e.g., name, date of birth). But even without those things, it may be possible for people to put response patterns together in a way that allows them to identify who a participant likely is. Depending on what is being studied, this can be risky. For example, participants in a research study for people with a specific diagnosis (e.g., schizophrenia, HIV) probably don't want the public to know about their diagnosis, at least not outside of their control.
To help illustrate how this might happen, let's consider and example. People with access to the data may know it was all collected at a certain university, so they'll know that participants likely live in that area. Maybe there's information about family structure, which can narrow things down a bit further. Keep going with level of education and other demographics, plus some responses on other measures, and you can see where a devoted person could start to make connections to real-world people. In an era when we have some people online (sometimes with the best of intentions) trying to track people down for various reasons, this is a major concern.
As another example, consider your own data. If a lot of data about you that is stored by a company (e.g., your internet provider) was made public, but it's anonymized (i.e., de-identified), how would you feel? Would you expect people to be able to track down which "anonymous" person is you? It's a serious concern, and something we need to figure out before making a lot of this data available.
But do the risks outweigh the benefits of sharing the data? Is there anything we, as researchers, can do to fully de-identify a database? I'm not sure, and I think that's something we need to actively try to figure out.
This is a bit of an abrupt end to the post, but that's because I'm not really sure where to go from here. I wrestle with this quite a bit as I debate how much information about my own research to share. If the data didn't represent people, I think it'd be much easier to justify making things open. It may be easier in cases when there's a national sample and limited geographical identifiers, but perhaps there are still ways to connect data to a person.
In the software world, there are "white hat" and "black hat" hackers. "White hat" hackers look for vulnerabilities in software with the goal of fixing them to improve security. "Black hat" hackers use software bugs/vulnerabilities to cause damage, steal data, and so on. Do we need "white hat" hackers in open science who can try to identify ways data may be linked to specific participants, or used for other inappropriate purposes? Maybe.
But let me know your thoughts. Do you think these concerns are valid? Do you have ideas about how open data can be best handled in the social sciences? Let me know in the comments!