Biblint: A utility to clean BibTeX files

BibTeX, combined with LaTeX, is a very good system for managing references when writing papers. It’s widely supported and very flexible, and it relieves the user from much of the tedium of getting citations correct.

However, it has a lot of rules and quirks that can make it difficult to produce a correct BibTeX entry. For example, many BibTeX style files lowercase titles, which can lead to words like “mRNA” being converted to “mrna”. BibTeX has several different ways of expressing author names, some of which are unintuitive. These and other features often lead to mistakes in formatting references. Manually checking and fixing these problems can be time-consuming, boring, and itself error prone.

To partially solve this problem, I wrote a utility called biblint that aims to automatically correct as many common BibTeX errors as possible, and to format BibTeX entries in a consistent way.

biblint is a command line tool, with 3 subcommands:

  • biblint clean in.bib > out.bib will produce a new .bib file with many common mistakes fixed.
  • biblint check in.bib will report on other errors that the clean command can’t automatically fix.
  • biblint dups in.bib tries to identify and report duplicate entires.

This software has a distinct viewpoint: .bib files processed by it should be used for reference information only. That means that extraneous information is filtered out by clean. For example, clean removes abstract fields (and in fact removes any non-standard field). The reason for this is that there are better ways to store notes and information about papers than BibTeX entries and the presence of these fields clutters the .bib file. This viewpoint means that biblint is not for everyone. It also means that it typically removes data from an input .bib file, so you’ll want to be sure to keep your original .bib file around.

For the fields it keeps, biblint tries hard to preserve what it thinks is the intended meaning of the data. However, since much of what it has to do is interpret a combined “language” consisting of BibTeX mixed with a natural language like English, biblint can sometimes get confused or miss corrections that should be made. For example, while biblint will correctly ensure that the word “DNA” in a title is inside braces (to avoid it being converted to lowercase by a style file), it can’t do the same for “Hi-C”, since it can’t distinguish that from “Good-Natured” or other hyphenated phrases.

You can see in the README what kinds of transformations biblint undertakes. In summary, it attempts to:

  • make sure that titles are coded to be correctly capitalized
  • consistently format authors (always using the “von Lastname, First Middle” format, changing et al. to “and others”)
  • fix a number of formatting inconsistencies (“.” at the end of titles, extra spaces)
  • output .bib files in a consistent format (always using {}, ordering entries and fields consistently, putting @string and @preamble definitions at the top)

It does several other transformations as well. See the complete list here.

We have used biblint in our group for a little while, and it seems to be useful without too many bugs, but it is clearly alpha software, since it hasn’t been widely tested at this point. Please submit bug reports on GitHub if something doesn’t seem to work right.

One feature of BibTeX that biblint does not support at present is the # concatenation operator. This is unfortunate, since it is a valid part of the BibTeX language. I may add support for it in the future, but right now it is low priority — it seems like it will require a lot of changes to the parser and the entry transformation code.

A number of tests to validate the code are included (runnable on Mac or Linux). Use the biblint/test.sh script to run these tests. Another future goal is to add more unit tests to check individual functions, but this is something that will require more time to complete.

biblint is written in Go, which is an excellent language for this type of project: it’s expressive, easy to write, and has a standard library that is comprehensive enough that we can avoid any other dependencies. biblint can be obtained via GitHub, and it is distributed under a BSD license (see LICENCE.txt).

This project also allowed me to play around with writing a parser in Go, which was a fun little goal. I used Thorsten Ball’s book “Writing an interpreter in Go” as a reference for
that.

So, check biblint out, and let me know if you find biblint useful, encounter any bugs, or have any feature requests!

Go 2.0 for Teaching

Carl Kingsford and Phillip Compeau

In the Computational Biology Department in the School of Computer Science at Carnegie Mellon, we have piloted using Go as an introductory programming language in our course 02-201: Programming for Scientists. We chose Go for a number of reasons. Its well-defined syntax, similarity to C, Java, and related languages, its web-based playgrounds, its cross-platform compiler, and its easy-to-use build environment all make it an attractive language for an introductory course. It is also useful that Go has explicit pointers, which allow that important concept to be introduced, and the treatment of pointers is very nice — students can quickly learn to use pointers, understand them, and then mostly not worry about them. Go’s built-in parallelism allows us to get even novice programmers to experiment with parallel programming a bit by the end of the course. Students who take the course largely appreciate Go.  Nevertheless, there are some aspects of Go that could be improved to increase its suitability as an introductory programming language.

Why should Go care about teachability? The first is natural selection: languages that are easier to learn have more programmers and often last longer. The second is Richard Feynman’s classic proverb that if one can’t reduce something “to the freshman level. That means we really don’t understand it.” A Go that maximizes teachability likely has had its rough edges well sanded. Two recent changes in Go were steps in this direction: a default GOPATH value, which eliminates a setup step that occurs early in the class and trips up some novices, and setting GOMAXPROCS to the number of processors by default — allowing parallel programs to actually run in parallel without an additional “make it work” step.

Challenges Using Go for Teaching

Below are a few of the main sticking points we’ve encountered that confuse students or make explaining Go more difficult than it needs to be. Most of these aspects of Go make sense when considering Go as a production language — generally, these are not “flaws” in the language. But they are suboptimal features when employing Go as an instructional language.

1. The “package main” statement. The first program a student writes is some kind of “hello, world” program. The first instruction of that program in Go is `package main`, and this requires the first statement of a teacher going through a short example program to be “don’t worry about packages, we’ll cover them later.”  Starting your explanation with an unexplained mystery is immediately off-putting. Of course, a short diversion into the value of packages and what they do could be undertaken, but this would fall on deaf ears — packages solve a problem a truly beginner programmer doesn’t yet know exists.

2. The term “slice”. Slices are so fundamental in Go that, for interesting assignments, it makes sense to introduce slices early on. However, the name “slice” raises the question: slice of what? So that necessitates introducing arrays — which in our experience are rarely used directly in Go. The ideal situation is that slices can be introduced as an abstract list-like data structure and the implementation detail that they are based on arrays covered later.  The name “slice” is the only thing stopping cleanly following that order now. An instructor is either forced to drop down one level of abstraction to explain how slices are backed by arrays or say something like “the name slice will make sense eventually.”

3. int vs. int32 vs. int64. The `int` type in Go is strange. It is an integer of undetermined size (depending on the machine) and not the same type as either int32 or int64. When first starting programming, it does not make sense to delve immediately into “bits” or word sizes, so naturally it makes sense to introduce the `int’ type in isolation without mentioning `int32`, etc. However, variables of type intXX creep into student code (usually due to the use of a standard library routine) and then confusion arises since `int` and `int64` (or int32) are not the same type.

4. Easy number-string conversions. Early on, we want students to write programs that take user input. Often this input is in the form of numbers (e.g. how many steps of a random walk to take) on the command line. This requires the introduction of the strconv package and the Atoi / Itoa functions. We C programmers from 20 years ago appreciate seeing these functions live on (though the “A” in Atoi is somewhat unintuitive to new programmers), and crucially they allow instructors to talk about the difference between a string representing “42” and the integer 42. But that topic is prematurely forced since `int(“42”)` doesn’t work, while string([]byte), float64(int), float32(float64(x)), and rune(int64(x)) all work fine. Note that `float64(int8)` may allocate new memory, `float32(float64(x))` may lose data, and `string([]byte)` fundamentally changes how a language construct (`for … range`) operates on the data, so data loss, memory allocation, and language semantics are not inviolate reasons to disallow `int(string)` or `string(int)`. For the later case, one might object that strings are large data structures, and the compiler can’t just create them without some explicit memory allocation like `new` or `make`. But this is not generally true in Go: `f(a)` where f expects an [x]int array type, and `s1 + s2` where s1 and s2 are strings both create new strings or arrays from existing data “behind the scenes”. In addition, the type assertion `x.(t)` may panic, so catastrophic failure of the conversion is not disqualifying in Go.

5. Lack of built-in precondition, postcondition, assertion and loop_invariant mechanisms. When getting students to think about breaking a big problem down into smaller steps, programmatically-checked preconditions and postconditions for functions are helpful. Several other courses at CMU use a variant of C called C0 that includes these features explicitly. Preconditions and assertions can be provided via a library. Postconditions are harder. Go’s defer mechanism is tantalizingly close; assuming assert() were defined:

defer func() {assert(final_value < 10)}()

provides a way to check postconditions. This has two problems: first, it’s long-winded. Second, it requires any variables involved in the postcondition to be declared before the defer statement executes, and named return values must be used if they are to be referred to. This is sometimes limiting, since postconditions often involve variables created during the course of the function (though C0 has the same limitations on its postconditions).

6. Assignment to fields in a map of structs. Currently, the following code is illegal in Go:

type point struct {
  x,y int
}

m := make(map[int]point)
m[0] = point{}
m[0].x = 10 // *** illegal

This gives the error “cannot assign to struct field m[0].x in map”. Maps of structs arise when teaching using Go because we’d like to be able to have assignments that involve maps and data elements (like points) before teaching pointers. The strangeness above impedes this. Why? Non-addressability prohibits &m[0], but there is no reason that `m[0].x = 10` can’t desugar into

tmp := m[0]
tmp.x = 10
m[0] = tmp

This is the current workaround for the disallowed m[0].x=y, but it creates unnecessary ugliness in this common case.

7. Serialization of random number generation. A natural assignment when introducing parallelism is a Monte Carlo simulation of some sort, where each goroutine does one trial, making a series of calls to obtain a random number from “math/rand”. When using the convenience functions, the calls to (e.g.) rand.Int() are serialized because they access the same source of randomness. So each goroutine needs to create its own source of randomness using:

r := rand.New(rand.NewSource(time.Now().UnixNano())

which is somewhat unwieldy.

8. Linguistically, the statement `for x < max_x` is less than ideally clear. `For` is being used here in its C-derived meaning, not its English meaning. A small point, but when students are trying to grapple with flow control for the first time, it’s not clear that `for` introduces a loop, or what the statement `for x < max_x` means when read aloud. (A similar problem exists with the assignment operator “=” vs. “==”: to a novice `x = 10; y = 20; x = y` seems like a set of contradictory statements, and `if x = 30` seems completely reasonable. The use of = in this counterintuitive way is so ingrained now in programming that we have to live with it; we tell students that “equals-equals equals equals”). While it’s nice in some ways that there is only a single type of loop in Go, it would better for teaching if the syntax more closely matched the pseudocode that is introduced at the start of a programming course. Pseudocode invariably uses the term “while”, providing evidence that this term more closely matches the concept.

Suggested Changes to Go

We suggest the following modifications, none of which break existing Go programs:

1. A file with a `main()` function that omits the `package main` declaration is assumed to be `package main`. The omission of `package main` could also be a signal to enable other language modifications such as those discussed below if they are considered not appropriate for common use. It could also by default import several packages (e.g. “fmt” and “edu”) if they are used in the code. This also reduces the friction for Go’s use as a tool for 1-off scripts in a minor way.

2. Rename “slices” to “lists”. This is a documentation change only. An additional advantage of the name “list” is that it further abstracts what are now called slices from their implementation (as parts of arrays) — this encapsulation would allow the idea of lists to grow more independent of their implementation.

3. Make int be a type alias to intXX (where XX is the machine word size) similar to rune and byte. For consistency, uint should similarly be an alias to uintXXX. The behavior of programs that use `int` would be unchanged. `int` would still correspond to the word size of the machine and still corresponds to a undetermined size. Programs that rely on a known int size would still need to use intXX. On (say) 64-bit machines int and int64 would be interchangeable. This is more consistent with `byte`.

4.1. Support int(string) and floatXX(string) conversions that assume base-10 encoded integers. These would panic if the string could not be converted.  `v = int(s)` would be syntactic sugar for the following code:

v, err = strconv.Atoi(s)
if err != nil {
   panic(strconv.ErrSyntax)
}

The conversion v = floatXX(s) from string s would be equivalent to the code:

v, err = strconv.ParseFloat(s, XX)
if err != nil {
  panic(strconv.ErrSyntax)
}

For consistency, intXX(string) should probably work as well.

4.2. Support `string(int)` and `string(floatXX)` conversions that assume base-10. These are syntactic sugar for the calls to `strconv.Itoa()` and `strconv.FormatFloat(f, ‘g’, -1, XX)`. For user-defined types that support Stringer, string(x) should call x.String().

Both the “string->number” and “number->string” conversions do not break any existing programs since they currently generate a compiler error. The downside for supporting `number(string)` is error handling: by panicking on error, it requires the programmer to ensure that the conversion will succeed before applying the conversion (it’s similar in this respect to a type assertion x.(t).).  This conversion is not ultimately the right way to do number parsing from user input data since typically the error should be dealt with rather than raising a panic. In this respect it’s like `println` in Go: a feature that exists only as an onramp to the right way to do things.

5.1. Add package `edu` to the standard library with the following functions:

  • edu.Assert(bool, string, ...interface{})
  • edu.LoopInvariant(bool, string, ...interface{})
  • edu.Requires(bool, string, ...interface{})
  • edu.Ensures(bool, string, ...interface{})

If the first argument is false, these functions print the given message (formatted as in Printf), likely with some information obtained from runtime.Caller(1) and panics.  The last of these functions allows post-conditions to be expressed using:
defer func() {edu.Ensures(...)}()

The package also should contain the following functions:

  • edu.AssertFunc(func()bool, string, ...interface{})
  • edu.LoopInvariantFunc(func()bool, string, ...interface{})
  • edu.RequiresFunc(func()bool, string, ...interface{})
  • edu.EnsuresFunc(func()bool, string, ...interface{})

That operate the same as the previous 4 functions if the passed-in function returns false.

5.2. Add a built-in function ensures(boolean, string, ...interface{}) that desugars into defer func() {edu.Ensures(boolean, string, ...interface{})}(). As with other builtin functions, `ensures` would not be a reserved word. Programs that use `ensures` as an identifier now would continue to work (but obviously won’t be able to use the new ensures function in the same scope).

6. Allow `m[key].field = value` as a conceptual shorthand for `tmp:=m[key]; tmp.field=value; m[key]=tmp`. Since the original statement is illegal currently, no existing Go programs break. One might object that copying an entire struct to change a field is expensive, but this is (a) no more expensive than the current workaround, (b) no more expensive than passing a struct to a function now, and (c) the compiler is free to optimize this statement.

7. Add a convenience function to “math/rand” that creates a new, non-concurrency-safe random number generator using the current Unix time:

func NewFromTime() *Rand {
return rand.New(rand.NewSource(time.Now().UnixNano())
}

We also propose the following change, which does break existing Go programs:

8. Make `while` be a synonym for `for`. Alternatively, `while` could require that the initialization and increment parts of the loop be empty. I.e. `while i := 0; i < 10; i++` would be illegal, and only `while COND` would be supported. From a teaching perspective, it is sufficient if this change is enabled only when `package main` is omitted.

Conclusion

Go is a great teaching language — it grows from introductory course to production systems well. Many of the points above deal with allowing better topological sorting of the concepts of Go: being able to introduce things in a logical, linear order, while continuously expanding the interesting assignments that can be given to the students.