Blog Archives

New publication: Increasing the equitability of data citation in paleontology: capacity building for the big data future (Smith & Raja et al., 2023; Paleobiology)

1/2/2024

Title: Increasing the equitability of data citation in paleontology: capacity building for the big data future
Authors: J.A. Smith, N.B. Raja, T. Clements, D. Dimitrijević, E. M. Dowding, E.M. Dunne, B.M. Gee, P.L. Godoy, E.M. Lombardi, L.P.A. Mulvey, P.S. Nätscher, C.J. Reddin, B. Shirley, R.C.M. Warnock, Á.T. Kocsis
Journal: Paleobiology
DOI: 10.1017/pab.2023.33

Figure 1 from the paper: "The current balance of credit distribution in paleontology (A) and a reimagined dynamic in which data-provisioning publications are equitably cited (B)."

General summary (pulled direct from the non-technical summary available with the paper): Researchers often use large databases to conduct their studies; however, they do not always provide credit, through citations, to the people who produced the data in the databases. In the field of paleontology, researchers use a large database called the Paleobiology Database (PBDB) to study global patterns and processes over millions of years. These studies use data from the PBDB and typically receive a greater number of citations than the original data-producing papers. This creates a situation where the hard work of collecting the data is not credited and rewarded in a fair way, even though this work is equally important to the field of paleontology. By fixing this issue and giving proper credit to data-producing papers, paleontology itself can be strengthened by increasing the incentives for producing data and at the same time creating more high-quality data for everyone to use.

Remember your roots

Paleontology has historically been a descriptive discipline, focused on describing new fossils, which in turn may represent new species and/or new occurrences (in space and/or time). This is hardly exclusive to paleontology – all natural history disciplines (e.g., geology, other life science sub-disciplines like herpetology or ichthyology) are rooted in simply describing observations. Even today, scientists are constantly reporting new modern occurrences, from new living species (e.g., the Lady Elliott Shrimp Goby) to finding new occurrences, like the deepest known occurrence of a fish below 8,000 m. In paleontology, descriptions of new species and new occurrences are particularly common, even today, because of the incompleteness of the fossil record and the continued erosion and exposure of new fossil-bearing rocks. Such descriptive work is a great example of primary data, which are defined as being unique and novel, or data-generating in the parlance that is used in the above figure (data-provisioning for most of the rest of the paper). This is contrasted with secondary data, which are existing data that are being reused in some form, such as by downloading data from a database that has aggregated primary data from many different sources; a good summary of the distinction can be found here.

Because primary data collection involves novel data generation, it is typically (though hardly always) more time-intensive and costly (this should not be misconstrued to mean that secondary data analyses are not or cannot be time-intensive or costly). In paleontology, the official publication of a novel occurrence involves a lot more than just the examination of the fossil and writing it up; it requires someone to have found the fossil to begin with, which is often a resource-intensive process that involves identifying potential fossil-bearing areas, surveying said areas, actually finding a fossil, collecting it, and preparing it. As a result, many data-generating paleontological studies are at a very specific scale (e.g., a single fossil or a set of fossils from one site). Some examples of my primary-data-generating studies from 2023:

Hart, L.J., Gee, B.M., Smith, P.M. and McCurry, M.R. 2023. A new chigutisaurid (Brachyopoidea, Temnospondyli) with soft tissue preservation from the Triassic Sydney Basin, New South Wales, Australia. Journal of Vertebrate Paleontology 42(6): e2232829. DOI: 10.1080/02724634.2023.2232829
Gee, B.M., Beightol, C.V. and Sidor, C.A. 2023. A new lapillopsid from Antarctica and a reappraisal of the phylogenetic relationships of early diverging stereospondyls. Journal of Vertebrate Paleontology 42(6): e2216260. DOI: 10.1080/02724634.2023.2216260
Kligman, B.T., Gee, B.M., Marsh, A.D., Nesbitt, S.J., Smith, M.E., Parker, W.G. and Stocker, M.R. 2023. Triassic stem caecilian supports dissorophoid origin of living amphibians. Nature 614: 102–107. DOI: 10.1038/s41586-022-05646-5

The next frontier

The vast majority of paleontological work has historically centered on generating primary data because that was the only thing that was really feasible until computers and other technological advancements permitted more comprehensive, and often intensive, analyses that are based on aggregating data together from many sources. Today, such studies are far more tractable and thus far more common, not just for paleontology but across all disciplines (AI being the most prominent recent development that may further expand the research horizons). We also have a variety of openly available databases that aggregate hundreds of thousands of published records, such as the Paleobiology Database (PBDB). A secondary data source could thus be generated in a matter of mere minutes but end up comprising thousands of unique records, each requiring hundreds of hours to have produced. These secondary data studies can take on a variety of forms, such as meta-analyses, bibliometric analyses, or the umbrella term of 'big data' studies; all share a commonality of collating data from many sources in order to create a larger sample size that can be used to tackle questions at larger scales. Some examples from Emma Dunne, one of the other authors on this paper:

Dunne, E. M., Thompson, S. E., Butler, R. J., Rosindell, J., & Close, R. A. (2023). Mechanistic neutral models show that sampling biases drive the apparent explosion of early tetrapod diversity. Nature Ecology & Evolution 7(9): 1480–1489. DOI: 10.1038/s41559-023-02128-3
Dunne, E. M., Farnsworth, A., Greene, S. E., Lunt, D. J., & Butler, R. J. (2021). Climatic drivers of latitudinal variation in Late Triassic tetrapod diversity. Palaeontology 64(1): 101–117. DOI: 10.1111/pala.12514
Dunne, E. M., Close, R. A., Button, D. J., Brocklehurst, N., Cashmore, D. D., Lloyd, G. T., & Butler, R. J. (2018). Diversity change during the rise of tetrapods and the impact of the ‘Carboniferous rainforest collapse’. Proceedings of the Royal Society B: Biological Sciences 285(1872): 20172730. DOI: 10.1098/rspb.2017.2730

A numbers game

Academics, like many people, prefer hard numbers (quantitative) to relative assessments (qualitative). Therefore, we often rely on various imperfect numerical metrics and proxies to attempt to assess research (and researcher) quality. Two examples are impact factor (IF) and citation count. Impact factor is a journal-specific metric and is calculated as follows:

For a time period of X years, the X-year impact factor is the ratio between the number of citations received in that year for publications in that journal that were published in the preceding X years and the total number of "citable items" published in that journal during the preceding X years. There are various common periods of 'X' (e.g., a two-year impact factor).

Journals with a higher IF are viewed as being more prestigious (selective about what they publish), and even though there are many reasons why a journal's IF does not have much of any bearing on the direct quality of a single article or the researchers who published it, remains a popular metric. For example, Nature and Science, two journals considered among the most prestigious journals across many disciplines, have two-year IFs of 64.8 and 56.9. By comparison, journals like the Journal of Paleontology and the Journal of Vertebrate Paleontology have two-year IFs below 3.0 and are thus considered much less prestigious. Because secondary data studies aggregate data, they can tackle questions that are "bigger picture" and are thus more appealing to selective journals that aren't interested in publishing a description of a new fossil (unless it's a dinosaur or something very cool). As a result, data-generating studies are often published in society journals (e.g., Journal of Paleontology, Acta Palaeontologica Polonica, Journal of Vertebrate Paleontology) that are not considered prestigious.

IF is based on citation count, but citation counts can also be used on their own (e.g., a researcher can list the citation counts for all of their articles on their CV) – this can also be used as a semi-quantitative assessment of the quality of research (on the premise that high-quality work is cited more often), but there are also many flaws with correlating citation counts with research(er) quality.

The problem

Most researchers are predominantly either data generators or data reusers; this does not mean that one cannot be both, but it is rare for people to be frequently engaged in both forms of analysis such that they are more or less a 50-50 split. Because of the aforementioned reliance on numerical metrics like journal IF and citation counts, researchers with a publishing portfolio that is skewed towards secondary data analysis are more likely to have publications in journals with higher IFs (read as more prestigious) and thus be more competitive / attractive in everything from hiring decisions to grant awards because of a perception that their work is more important, rigorous, "big picture," or applicable when compared to researchers who are primarily data generators publishing in lower-IF journals. However, if primary data generators become increasingly less likely to be funded, the generation of primary data grinds to a halt, and this has downstream effects on what analyses can be done using secondary data when people get stuck with the same old data. Many papers, for example, have demonstrated the growing crisis associated with a lack of funding for taxonomy (e.g., Agnarsson & Kuntner, 2007; Drew, 2011; Löbl et al., 2023).

There are a multiplicity of reasons why data generating studies and researchers tend to be undervalued in contemporary academia, but one of them are practices around properly citing sources. As with the citation of previous articles' findings or conclusions, secondary data studies should also be citing the data-generating studies that they rely on for the analysis, yet many fail to properly do so, leading to these gaps in quantitative metrics that lead to the devaluation of data generating research(ers).

The data

What this study did is put numbers on the qualitative observation that data-generating papers tend to be undercited (undercredited). In paleontology, many of the secondary data analyses rely on aggregating data from the Paleobiology Database, but many of them do not cite the data-generating studies in a way that can be tracked and thus properly credited (I think that some of my own publications have been used in PBDB studies but frankly have no clue which ones). Studies that use the PBDB (secondary data studies) are supposed to "register" in order to receive a PBDB publication number, so these can be easily identified and have their total and per year citation counts obtained/calculated via Google Scholar (orange bar below). From those studies, the data-generating studies that were used by the PBDB studies could be identified (nearly 50,000 unique studies from just 151 PBDB studies for which data could be recovered). The total and per year citation counts for those data-generating studies could then be obtained/calculated via Google Scholar (blue bar below). However, we know that Google Scholar (and every other scholarly aggregator) isn't checking things like Supplemental Information for citaitons, if they exist in the SI to begin with. The number of missed citations (i.e. not tracked by Google Scholar) of data-generating papers could subsequently be calculated from the datasets of those 151 studies (e.g., Smith et al., 2000, was used by 5 PBDB studies) and added to the total citation count (the blue-green bar below). Because not all of the PBDB studies in the focal interval (2001-2021; n=396) had their data recovered, the hidden usage/citation rate of data-generating papers could be extrapolated from the 151 PBDB studies that did have their data recovered, assuming similar usage rates over all 396 studies (the extra blue-green bar below). These extrapolated numbers result in a mean citation rate that is nearly the same as that of PBDB studies, rather than the presently recorded rate that is only about one-third of the rate.

Figure 2 from the paper: "Citation rates for official Paleobiology Database (PBDB) publications and the data-provisioning publications used in those PBDB publications. Only data-provisioning publications from the same time frame (since 2001) as PBDB publications are included to standardize for temporal effects. Citations to data-provisioning publications (i.e., primary literature) are presented as the current rate (i.e., no additions for neglected citations), the projected rate when including citations from PBDB publications where data were available (k = 112; i.e., additions), and the projected rate when making those additions and extrapolating to the entire set of PBDB publications (k = 396; i.e., additions and extrapolated)."

Predictably, if the true citation count of data-generating papers can be demonstrated to be higher than what is presently tracked across scholarly aggregators like Google Scholar, the journals in which these papers are typically published should also have higher impact factors than what is presently reported, and this is shown below (see the discussion in the paper for more context about different year-intervals in impact factor calculations and other historical conditions for publishing). Collectively, these analyses provide clear evidence that data-generating studies are systemically undercited, which can produce misleading perceptions of interest in and quality of these studies and the journals they are published in (even if that is in large part because we rely too much on flawed metrics to assess these).

Figure 3 from the paper: "The effects of adding neglected citations from data reuse on journal impact factor (JIF; A, B) and general patterns in publishing trends in paleontology (C, D). A, The increase in JIF for the 55 journals categorized to paleontology by Clarivate, for the period of 2010 to 2019. Note, an outlier value of 172% in 2018 for PalZ was not plotted. B, Increases in JIF for the 10 paleontological journals most affected by neglected citations, only including those with complete data for the duration of 2010 to 2019. For raw data for all 55 paleontological journals from 1997 to 2021, see “7_paleo_journal_JIFcalculation.csv” in Smith et al. (2023a). C, The number of citable items published in paleontological journals each year. D, The number of citations to items published in paleontological journals each year."

Don't hate the player, hate the game

A lack of proper citation of data-generating studies by data-reusing studies can and certainly does result from poor research practices around citation (I often see people remove license/copyright and citation information from database downloads in my day job), but even well-intentioned researchers can frequently have their hands tied by various other factors, mainly on the end of the journals that we publish in. For example, these are some policies found across various journals that can lead to a lack of citations of primary data:

Page/text length limits: This is in part a holdover from the days of print-only articles where journals were naturally inclined to avoid accepting excessively long articles, as they would have to print hundreds to thousands more pages for an encapsulating volume. Even though electronic distribution now predominates, and in some cases, is the only medium through which articles are disseminated, the majority of journals retain limits on length, which can be imposed in a variety of ways, such as total page count, number of words, or other caps (e.g., limits on the number of figures). This is particularly common in high-profile journals; that guidelines from Nature are listed below as an example:
- The typical length of a 6-page article with 4 modest display items (figures and tables) is 2500 words (summary paragraph plus body text). The typical length of an 8-page article with 5-6 modest display items is 4300 words. A ‘modest’ display item is one that, with its legend, occupies about a quarter of a page (equivalent to ~270 words). If a composite figure (with several panels) needs to occupy at least half a page in order for all the elements to be visible, the text length may need to be reduced accordingly to accommodate such figures.
Reference limits: 'References' are the full list of scholarly items that an article cites. Some journals make implicit restrictions on references by including them in another limit (e.g., page limit); others will have an express limit. An example from Science, another high-profile journal:
- Research Articles should not exceed 5 printed pages in the journal. This length can accommodate 2000 to 3000 words of main text, in addition to an abstract, 3 to 5 display items (figures or tables) with brief legends, about 50 main-text references, and a structured acknowledgments section.
Restrictions on which references can be included: Numerical limits on content are highly variable across journals, but one attribute that any journal worth its salt will enforce is which references can be included in the Reference / Literature Cited section of the article, which is what is scraped to count citations on scholarly search engines like Google Scholar. All references listed in this section have to be included in the main text of the article itself, which means that any citations that are only found in the Supplemental Information (alternatively Supplementary/Supporting and Materials) cannot have its full reference listed in the formal Reference section. Especially for articles published in journals with page limits, the Supplemental Information is a valuable and often rather lengthy complement to the main text; in journals like Nature and Science, the article itself may only be a few pages, but the associated Supplemental Information can be hundreds of pages long with dozens to hundreds of citations/references that are not found in the main-text. An author who wants to properly cite potentially hundreds to thousands of primary data sources would almost certainly have to relegate this information to the Supplemental Information. This matters because...
Scholarly indexers do not scrape Supplemental Information: In order to track citations efficiently, various scholarly indexers (e.g., Google Scholar, Web of Science) crawl the internet and look for matches to a known article. However, in order to be efficient, they only crawl certain types of pages and only certain types of content to avoid picking up casual references, such as mention in a Wikipedia page (or a blog like this one). However, this means that if a reference is only listed in the Supplemental Information of an article and not in the main text of said article, it will not be "counted" as a citation. Of course, someone could manually search for citations of an article in such supplemental documents if they really wanted a full citation count, but these documents in their entirety are also not typically indexed, so you would have to already have a good idea of which supporting documents potentially cited an article of interest.
- To give an example, we can look at a paper I was on that was published a year ago in Nature (Kligman et al., 2023). The main text, in typeset PDF format, is just 6 pages long; the Supporting Information (SI) text, in Word format, is over 180. There are just 42 references in the main text, compared to a whopping 377 in the SI; of those 377, just 29 are also cited in the main text. In other words, there are nearly 350 unique references that were only cited in the SI (implying that they were useful for some point or another), most of which are descriptive anatomy (i.e. primary data-generating), that are not tracked as having been cited by any scholarly aggregator.

A way forward

Many of these hindrances are structural at the level of a journal or publisher, so it may seem that there is little to be done to more accurately and equitably track data reuse. However, there are a variety of actions that authors can take now to try and work around these existing limitations. One example:

Posting Supplemental Information to preprint servers like bioRxiv. Even though preprints are not peer-reviewed, they are indexed by scholarly aggregators, and thus anything that is cited in the preprint will be tracked as a citation. Although preprints by definition refer to a document that is not yet accepted for publication following peer review, preprint servers are also used to host 'post-prints,' or the accepted version of the manuscript. Sharing post-prints is a popular way of ensuring compliance with open-access publishing mandates when the journal to which an article is/will be published is not inherently open-access. Preprint servers are free to use and can be accessed by anyone, so they offer one workaround with the current infrastructure.

More broadly, proper data sharing is a precursor to proper data citation (data/clarification on data could not be obtained for 21% of sampled PBDB publications). While there is some optimism around funder/journal mandates for data sharing, similar mandates for open-access publishing of articles have not necessarily yielded the predicted or desired gains, in large part because the sheer volume and rate of growth of research output vastly exceeds the capacity of agencies to actually assess compliance and enforce policies. I have long been of the opinion that best practices have to be established, maintained, and self-policed by researchers, not by (semi-)legal apparatuses, especially in light of the volatility associated with resourcing for many national funding agencies. In many instances, even when data are shared and thus "check the boxes," they may not be sufficient (or even relevant) for reproducing the results of an associated study, and this too falls on the research community, rather than on external entities or policies, to address; if regulating bodies or publishers cannot even ensure a consistent check for a binary 'data shared/data not shared,' they certainly will not be able to assess whether the shared data are sufficient for reproducibility. Promoting healthy data sharing practices, including by according proper credit to data generators as a means of incentivizing data sharing, remains an essential part of a sustainable research ecosystem.

Finally, the biggest challenges lie in advocating for structural changes. For example, as research output rises, the incentives to perform peer review, which is practically never compensated or rewarded in any form, decline. Peer review is the primary mechanism that should be responsible for ensuring rigor of a study, which includes proper data sharing and data citation, but there is not much incentive for reviewers to even check supplemental information files if doing so would require substantial time investment. Other target areas include advocating for changes in journal policies in how citations can be provided in a format that is discoverable and indexed by the aggregators that we all rely on to track citations is the most crucial and alternative means of evaluating research quality and performance. A few journals (e.g., Global Ecology & Biogeography) do allow for references cited only in Supplemental Information to be listed in the main-text References so that they are picked up by indexers. As journal editorial boards are made up of normal (well, somewhat normal) researchers, they also have some ability to direct journal policies in a variety of ways, from formatting prescriptions to journal scope. Development and adoption of other metrics of tracking and recognizing scholarly contributions in merit-based processes, in which "normal" researchers are also directly involved, will also be critical.

0 Comments

Temno Talk: a blog about all things temnospondyl