Biblint: A utility to clean BibTeX files

BibTeX, combined with LaTeX, is a very good system for managing references when writing papers. It’s widely supported and very flexible, and it relieves the user from much of the tedium of getting citations correct.

However, it has a lot of rules and quirks that can make it difficult to produce a correct BibTeX entry. For example, many BibTeX style files lowercase titles, which can lead to words like “mRNA” being converted to “mrna”. BibTeX has several different ways of expressing author names, some of which are unintuitive. These and other features often lead to mistakes in formatting references. Manually checking and fixing these problems can be time-consuming, boring, and itself error prone.

To partially solve this problem, I wrote a utility called biblint that aims to automatically correct as many common BibTeX errors as possible, and to format BibTeX entries in a consistent way.

biblint is a command line tool, with 3 subcommands:

  • biblint clean in.bib > out.bib will produce a new .bib file with many common mistakes fixed.
  • biblint check in.bib will report on other errors that the clean command can’t automatically fix.
  • biblint dups in.bib tries to identify and report duplicate entires.

This software has a distinct viewpoint: .bib files processed by it should be used for reference information only. That means that extraneous information is filtered out by clean. For example, clean removes abstract fields (and in fact removes any non-standard field). The reason for this is that there are better ways to store notes and information about papers than BibTeX entries and the presence of these fields clutters the .bib file. This viewpoint means that biblint is not for everyone. It also means that it typically removes data from an input .bib file, so you’ll want to be sure to keep your original .bib file around.

For the fields it keeps, biblint tries hard to preserve what it thinks is the intended meaning of the data. However, since much of what it has to do is interpret a combined “language” consisting of BibTeX mixed with a natural language like English, biblint can sometimes get confused or miss corrections that should be made. For example, while biblint will correctly ensure that the word “DNA” in a title is inside braces (to avoid it being converted to lowercase by a style file), it can’t do the same for “Hi-C”, since it can’t distinguish that from “Good-Natured” or other hyphenated phrases.

You can see in the README what kinds of transformations biblint undertakes. In summary, it attempts to:

  • make sure that titles are coded to be correctly capitalized
  • consistently format authors (always using the “von Lastname, First Middle” format, changing et al. to “and others”)
  • fix a number of formatting inconsistencies (“.” at the end of titles, extra spaces)
  • output .bib files in a consistent format (always using {}, ordering entries and fields consistently, putting @string and @preamble definitions at the top)

It does several other transformations as well. See the complete list here.

We have used biblint in our group for a little while, and it seems to be useful without too many bugs, but it is clearly alpha software, since it hasn’t been widely tested at this point. Please submit bug reports on GitHub if something doesn’t seem to work right.

One feature of BibTeX that biblint does not support at present is the # concatenation operator. This is unfortunate, since it is a valid part of the BibTeX language. I may add support for it in the future, but right now it is low priority — it seems like it will require a lot of changes to the parser and the entry transformation code.

A number of tests to validate the code are included (runnable on Mac or Linux). Use the biblint/test.sh script to run these tests. Another future goal is to add more unit tests to check individual functions, but this is something that will require more time to complete.

biblint is written in Go, which is an excellent language for this type of project: it’s expressive, easy to write, and has a standard library that is comprehensive enough that we can avoid any other dependencies. biblint can be obtained via GitHub, and it is distributed under a BSD license (see LICENCE.txt).

This project also allowed me to play around with writing a parser in Go, which was a fun little goal. I used Thorsten Ball’s book “Writing an interpreter in Go” as a reference for
that.

So, check biblint out, and let me know if you find biblint useful, encounter any bugs, or have any feature requests!