October 2014 – Joshua Tan

Say that we are given a single room in Borges’ library, and that we would like to say something about what those books are about—perhaps we can cluster them by subject, determine a common theme, or state that the selection is rigorously random. One way to start would be to scan every book in the room, represent each one as a long string (an average book has about 500,000 characters, though each book in Borges’ library has 410 pages x 40 lines x 80 characters per line = 1,312,000 characters), then perform some sort of data analysis.

In this case the number of books in each room (32 per bookshelf x 5 shelves per wall x 4 walls = 640 books) is far less than the length in characters of each book, i.e. the dimension of the book if we represent it in vector form. We would like to pare down this representation so that we can analyze and compare just the most relevant features of these books, i.e. those that suggest their subjects, their settings, the language they are written in, and so on.

Typically, we simplify our representations by “throwing out the unimportant dimensions”, and the most typical way to do this to keep just the two to three dimensions that have the largest singular value. On a well-structured data set, singular value decomposition, aka PCA, might leave only 2-3 dimensions which together account for over 90% of the variance… but clearly this approach will never work on books so long as we represent them as huge strings of characters. No single character can say anything about the subject or classification of an entire book.

Another way of simplifying data is to look at the data in a qualitative way: by studying its shape or, more precisely, its connectedness. This is the idea behind persistent homology.

To be continued…