You are currently viewing Why Google Scholar is now getting in the way

Why Google Scholar is now getting in the way

Once upon a time, Google Scholar was a godsend to the scientific community. Around that time (I started using google scholar around 2007 during its “beta” phase), the dominant search platform for scholarly articles was probably Web of Science. It only had access to basic article meta data, not the text of the article content. Google Scholar was free, lightning-fast, and capable of text search. It really enabled scientists to “stand on the shoulders of giants,” such a fitting slogan I thought.

Fast forward to 2022, and what was once a boon is now becoming a bane. I think Google Scholar is the culprit for why we have barely seen any progress in bibliographic databases and academic search engines. It’s still the best thing out there, good enough to prevent any competitors from gaining traction. But anyone can easily sense that they are satisfied with maintaining the status quo. I can imagine why. Until you spend quite some time using Google Scholar like your career depends on it, Google Scholar is, on the surface, quite a well-functioning satisfactory service. After all, Google Scholar developers probably don’t eat their own dog food, so the gravity of these problems (or even the existence of them) might be something only perceived among selected academics.

Google Scholar doesn’t provide related-term searching

With all the user data Google Scholar has access to, you would imagine it would be straightforward to know how certain keywords are related, or even intended to mean the same thing. But Google Scholar works very poorly in this regard.

In materials science and chemistry, we often need to search for papers related to a certain compound or material composition that is spelled in several different ways. For example, NMC (also commonly abbreviated as NCM) is one of the most widely deployed Li-ion battery positive electrode materials. Googling its chemical formula is a nightmare. The formula for NMC111 (which could also be NCM111, NMC333, or NCM33), is LixNi1/3Mn1/3Co1/3O2 but it could also be LiNi1/3Mn1/3Co1/3O2 (1 for Li instead of x). Co and Mn can switch order depending on the author. 1/3 could be denoted as 0.33 or 0.333. Some people like to use parenthesis for transition metals such as Li(Ni1/3Mn1/3Co1/3)O2, occasionally with commas in between transition metals. These combinations alone gives 2 x 2 x 3 x 3 = 36 different ways of expressing the same chemical formula. How does google fare here?

Google Scholar is completely ignorant of chemical formulae, so you will only get results when a paper wrote the formula exactly in the way you spelled it (see Figure). It gets worse because these formulae often get spaces in between when parsed into plain text. Google doesn’t suggest any related keywords either; if you only searched for “NMC” it doesn’t even suggest “NCM” as a related or alternative search.

This problem could be tackled at many different levels. Even for the simplest solution, which is to just add some chemical formula-cognizant tools without touching the existing search engine, I would pay a subscription fee to have it (I know this is not Google’s business model. More on that below). There is so much more that could be done.

Google Scholar: same material different results
Figure: Googling papers about NMC using various chemical notations. All three notations refer to the same material but completely different results are returned.

Google Scholar reinforces the citation-based bias in the literature

Google Scholar uses citation information to quantify the connectivity between papers. This connectivity determines the importance (as viewed from the search engine) of a paper, making it more likely to appear earlier in search results. In simple words, highly cited papers are favored in search results.

Citation-based judgement of papers is already a serious issue plaguing academia. Google Scholar is reinforcing that trend because citation-based weighting has a “the rich get richer and the poor get poorer” effect (another example of autocatalytic phenomena). Viral publications are favored in this scheme, which incentivizes researchers to do controversial or trendy research as opposed to rigorous or basic research.

Google Scholar has no obligations to customers because it’s “free”

Google Scholar being free was such a cool thing, especially when many scholars felt that Web of Science was so expensive and borderline predatory. However, we are now learning the lesson once again that nothing is really free after all.

When Google Scholar inadvertently discredits your work by not properly parsing it, there is no way to actively resolve the issue other than trying to make it as Google-compatible as possible. You can only hope that the web crawler one day figures it out and updates the database. Below are two of my personal examples, but this is a general problem especially damaging for early-career scientists.

One of my earlier papers was uploaded to ArXiv prior to journal publication. The problem is, even after journal publication, Google Scholar somehow decided that my published paper is an unimportant replicate of the ArXiv version. So, on a Google Scholar search, my ArXiv version would appear rather than the published version. This issue continued for over two years despite me doing everything I can. I put the published paper DOI on ArXiv and also contacted Google Scholar support (who never replied). Apparently, it was a glitch that happened only to a subset of ArXiv preprints. It was a great loss in visibility, leaving a sour taste especially because the paper was published in quite a prestigious journal.

Another example is my PhD thesis. Although I have written my thesis using LaTeX and a widely used template, following all the Google Scholar guidelines for successful indexing, Google Scholar somehow failed to parse my thesis. Any citations I made in my thesis do not appear in Google Scholar. What is more puzzling is that any papers that cited my thesis don’t appear either. This is despite Google Scholar having fetched the metadata of my thesis via my institution’s library server. After more than 4 years, this problem is still unresolved. I have no channel to communicate this problem and learn why this is happening. I don’t even know if it’s a problem on my thesis somehow, or if it’s another glitch on Google’s side.

These issues could be viewed as small user issues, but it points to a more fundamental problem that Google Scholar is a free service. Google has no responsibility to improve the service because nobody is explicitly paying for the service. For example, we cannot demand or request, as customers, to implement chemical formula-cognizant searching. It is up to the user to find workarounds for missing features. What if one day Google decides to pull the plug? (check out Google Cemetary)

Google Scholar still does not recognize DOIs

Although DOI (digital object identifier) has become the de facto standard for online publishing, Google Scholar seems to completely ignore this information. For example, if you cited a paper through its DOI but without full bibliographic info (which often happens when a paper is cited when it appears online in a “just accepted” form before official inclusion in an issue), then Google Scholar will fail to associate that citation with the published paper. This behavior is astonishing because DOIs allow unambiguous identification of a publication.

Where is the competition?

My best guess is that, as of today, academic search engines do not make much sense from a business perspective. However, recall that enormous funds go into the publication process of papers. Publishing giants like Elsevier make billions of profit annually with extraordinary profit margins, most of which can be traced back to taxpayer’s money. Academics are publishing so many papers (more than we should, in my opinion) that it is becoming increasingly overwhelming to find relevant and valuable information. Given this context, shouldn’t it be the highest priority to use public funds in making information more searchable, rather than to multiply the apparent quantity of information?

We should be taking at least a portion of those wasted funds and using it to create a better search platform and bibliographic database. If you have a good idea of how to make this happen, please let me know!

Last modified: Jul. 19th, 2022

Photo Credit: Rabie Madaci