Archives for posts with tag: File naming

Building on Part 1 and Part 2, below are all of the special pagination codes currently in use at UNT for MagickNumbering the individual scans of book pages followed by the text displayed in the drop-down list of our page-turning system followed by a possible filename. These are only examples of possible file names, the actual file name would be whatever is correct to represent the book digitized.

####00fc - Front Cover - 000100fc.tif
####00fi - Front Inside - 000200fi.tif
####00tp - Title Page - 000500tp.tif
####r### - <Roman numeral ###> - ####r###.tif
####r001 - I - 0005r001.tif
####r004 - IV - 0008r004.tif
####pt01 - Plate 1* - 0013pt01.tif
####0000 - <Blank in the drop-down> - 00360000.tif
####00bi - Back Inside - 003700bi.tif
####00bc - Back Cover - 003800bc.tif

**Plates are only notated in the book if the plate is numbered in the book. Unnumbered plates are named with using the blank ####0000 code.

In the last update of the MagickNumbering system, Roman numerals were made much easier to deal with using the r### pagination code. Previously, Roman numeral filenames had to be hand-entered and this functionality still lives in the system.

####000i.tif - ####r001.tif - I
####00iv.tif - ####r004.tif - IV
####xviii.tif - ####r018.tif - XVIII

One interesting feature of the system is it does not break on the 9 character file name needed to describe XVIII using the legacy Roman numeral pagination code. This is due to some intelligent parsing of MagickNumbers by the system. The first 4 digits of the (ostensibly) 8-digit code are grabbed for the sequence. The system ignores any leading zeroes, but displays anything else. Of course, the special pagination codes display something different than entered, but I have yet to encounter a page paginated “fc” or “tp”. I can see r100 as a possible pagination (gov docs are often funky) , but in such a case I would probably resort to another feature in the system. It will also accept a text file with the pagination of each page written on a new line. This isn’t as useful as MagickNumbering the scans themselves because we using MagicNumbering as a quality control (QC) step. Performing a QC check of the images against a separate text file is an added headache when visually inspecting the scans to make sure the one-thousand-and-fourteenth item really is page nine-hundred-eighty-three, or 10140983.tif.

This leads to another quandary, which is what happens if you have a book with over ten thousand pages (unlikely) or a serial set, which is paginated to over ten thousand pages (highly likely with gov docs!). MagickNumbers easily scale to handle this by switching from an 8-character file name comprising two 4-character codes to a 10-character file name comprised of two 5-character codes. Such a code handles a sequence and pagination up to 99,999. I have yet to deal with anything requiring a 12-character file name, but that would be the next step up.

There are a few codes I think could help expand the system and make the drop-down page lists more useful. The first would be tc for “Table of Contents”. In many books, the table of contents is numbered and, like an index, not referenced as often when you have full-text search capabilities, but many gov docs don’t number their tables of contents. The other is something to delineate a blank page from a page with content, which is not paginated, such as an illustration. This can get confusing for the user quickly, though, and there are edge cases like gov docs with “page intentionally left blank” printed on them — is this a blank page or a page with content?

In part 4, I’ll mention some common mistakes I see people making and some ideas for programmatically fiddling with MagickNumbers. I also need to fully flesh out a workshop on manipulating images from the command-line using ImageMagick and am thinking this is a perfect place to do so.

Continuing from part 1, such a simple example as 12 images paginated 1 through 12 introduces MagickNumbers, but does not showcase the extensible nature of the file naming standard. The next example shows how the files for  a hard cover book with end pages, title page, and 14 pages (4 Roman numerals, 10 numbered) totaling 22 scans is named followed by the pagination info displayed in our page turning system. The pagination info is derived from the 4 character pagination code.

000100fc.tif - Front Cover
000200fi.tif - Front Inside
00030000.tif - 
00040000.tif - 
000500tp.tif - Title Page
0006r002.tif - II
0007r003.tif - III
0008r004.tif - IV
00090001.tif - 1
00100002.tif - 2
00110003.tif - 3
00120004.tif - 4
00130005.tif - 5
00140006.tif - 6
00150007.tif - 7
00160008.tif - 8
00170009.tif - 9
00180010.tif - 10
00190000.tif -
00200000.tif -
002100bi.tif - Back Inside
002200bc.tif - Back Cover

The following new pagination codes are introduced above:

####00fc - Front Cover
####00fi - Front Inside cover
####0000 - unnumbered page that displays no information
           in the drop-down list of page numbers
####00tp - Title Page
####r### - Roman numeral
    r002 - II
    r003 - III
    r004 - IV
####00bi - Back Inside cover
####00bc - Back Cover

Note in the example the Title Page is numbered Roman numeral I in the book, but instead of ####r001 the file has the special pagination code ####00tp. This is because we assume most users prefer having the ability to jump directly to the title page of a book.

Still to come RE: MagickNumbers:

  • a complete list of our current pagination codes
  • possible additions to the pagination codes I have been considering suggesting
  • common problems easily solved with how MagickNumbers are created and used
  • common problems MagickNumber novices make which reduce file naming consistency
  • programmatic possibilities for writing and validating MagickNumbers

Setting consistent and intelligent file naming standards is always smart in any digitization project, but books raise unique problems. In most cases, books have an overall sequence and an internal pagination and at UNT both must be delineated before a user can access the item online. The overall sequence may or may not contain the front and back covers, inside covers, end pages, title page[s], blank pages, and many other possibilities in addition to pages actually paginated in the book. How does one begin organizing such a morass of items?

At UNT we use a system called MagickNumbers that utilizes an 8-character code to notate both the sequence, or order, and pagination of digitized items.* The first 4 digits set the sequence of the items while the last 4 characters denote the pagination code used by our page-turning system to display the pagination online.

SSSSPPPP.tif, where SSSS are the sequence digits and PPPP is the pagination code.

The example below depicts a simple scenario of 12 TIFF scans of 12 pages, which are paginated 1 through 12 followed by the pagination info displayed in our page turning system:

00010001.tif -  1
00020002.tif -  2
00030003.tif -  3
00040004.tif -  4
00050005.tif -  5
00060006.tif -  6
00070007.tif -  7
00080008.tif -  8
00090009.tif -  9
00100010.tif - 10
00110011.tif - 11
00120012.tif - 12

There’s more to come, but I was sidetracked by starting a script to aid in MagickNumbering. We currently use the commercial program, ACDSee Photo Manager to MagickNumber our book scans and being able to provide an open source alternative may help others get on the road to file naming nirvana.

* MagickNumbers are used when we want to display the item as a series of “pages” as opposed to a more generalized “series”, such as the 2 images comprising the front and back of a scanned photograph.