I got a call from a lawyer I don’t know
on Sunday evening. He reported that he’d received production of ESI
from a financial institution and spent the weekend going through it.
He’d found TIFF images of the pages of electronic documents, but
couldn’t search them. He also found a lot of “Notepad documents.” He’d
sought native production, so thought it odd that they produced so many
pictures of documents and plain text files.
Print Page
As it’s unlikely a bank would rely on
Windows Notepad as its word processor, I probed further and learned that
that the production included folders of TIFF images, folders of .TXT
files (those “Notepad documents”) and folders of files with odd
extensions like .DAT and .OPT. My caller didn’t know what to do with
these.
By now, you’ve doubtlessly figured out
that my caller received an imaged production from an opponent who blew
off his demand for native forms and simply printed to electronic paper.
The producing party expected the requesting party to buy or own an
old-fashioned review tool capable of cobbling together page images with
extracted text and metadata in load files. Without such a tool, the
production would be wholly unsearchable and largely unusable.
When my
caller protests, the other side will tell him how all those
other files
represent the very great expense and trouble they’ve
gone to in order to
make the page images searchable, as if furnishing load files to add
crude searchability to page images of inherently searchable electronic
documents constitutes some great favor.
It brings to mind that classic Texas come back, “Don’t piss in my boot and tell me it’s raining.”
It also reminds me that not everyone
knows about load files, those unsung digital sherpas tasked to tote
metadata and searchable text otherwise lost when ESI is converted to
TIFF images. Grasping the fundamentals of load files is important to
fashioning a workable electronic production protocol, whether you’re
dealing with TIFF images, native file formats or a mix of the two. I’ve
been wanting to write about load files for a long time, but I’d avoided
it because I just hate the damn things! So, this post is a load (file) off my mind.
In simplest terms, load files carry data
that has nowhere else to go. They are called load files because they
are used to load data into, i.e., to “populate” a database. They first
appeared in discovery in the 1980s in order to add a crude level of
electronic searchability to paper documents. Then as now, paper
documents were scanned to TIFF image formats and the images subjected to
optical character recognition (OCR). Unlike Adobe PDF images, TIFF
images weren’t designed to integrate searchable text; consequently, the
text garnered using OCR was stored in simple ASCII[1]
text files named with the Bates number of the corresponding page image.
Compared to paper documents alone, imaging and OCR added
functionality. It was 20th century computer technology improving upon 19th century printing technology, and if you were a lawyer in the Reagan-era, this was Star Wars stuff.
Metadata is “data about data.” While we
tend to think of metadata as a feature unique to electronic documents,
paper documents have metadata, too. They come from custodians, offices,
files, folders, boxes and other physical locations that must be
tracked. Still more metadata takes the form of codes, tags and
abstracts reflecting reviewers’ assessments of documents. Then as now,
all of this metadata needs somewhere to lodge as it accompanies page
images on their journey to document review database tools, also called
“review platforms,” like Concordance or Summation (venerable products
which survive to this day). This data goes into load files.
Finally, we employ load files as a sort of road map and as assembly instructions laying out, inter alia,
where document images and their various load files are located on disks
or other media used to store and deliver productions and how the
various pieces relate to one-another.
So, to review, some load files carry
extracted text to facilitate search, some carry metadata about the
documents and some carry information about how the pieces of the
production are stored and how they fit together. Load files are used
because neither paper nor TIFF images are suited to carrying the same
electronic content; and if it weren’t electronic, you couldn’t load it
into a review tool or search it using a computer.
Before we move on, let’s spend a moment
on the composition of load files. If you were going to record many
different pieces of information about a group of documents, you might
create a table for that purpose. Possibly, you’d use the first column
of your table to give each document a number, then the next column for
the document’s name and then each succeeding column would carry
particular pieces of information about the document. You might make it
easier to tell one column form the next by drawing lines to delineate
the rows and columns, like so:
Those lines separating rows and columns
serve as delimiters; that is, as a means to (literally) delineate one
item of data from the next. Vertical and horizontal lines serve as
excellent visual delimiters for humans, where computers work well with
characters like commas, tabs and such. So, if the data from the table
were contained in a load file, it might appear as follows:
BEGDOC,ENDDOC,FILENAME,MODDATE,AUTHOR,DOCTYPE 0000001,0000004,Contract,01/12/2013,J. Smith,docx 0000005,0000005,Memo,02/03/2013,R. Jones,docx 0000006,0000073,Taxes_2013,04/14/2013,H. Block,xlsx 0000074,0000089,Policy,5/25/2013,A. Dobey,pdf |
Note how each comma replaces a column
divider and each line signifies another row. Note also that the first
or “header” row is used to define the type of data that will follow and
the manner in which it is delimited. When commas are used to separate
values in a load file, it’s called (not surprisingly) a “comma separated
values” or CSV file. CSV files are just one example of standard forms
used for load files. More commonly, load files adhere to formats
compatible with the Concordance and Summation review tools. Concordance
load files typically use the file extension DAT and the þ¶ þ characters
as delimiters, e.g.:
Concordance Load File
Summation
load files typically use the file extension DII, but do not structure
content in the same way as Concordance load files; instead, Summation
load files separate each record like so:
Summation Load File
; Record 1
@T 0000001
@DOCID 0000001
@MEDIA eDoc
@C ENDDOC 0000004
@C PGCOUNT 4
@C AUTHOR J. Smith
@DATESAVED 01/12/2013
@EDOC \NATIVE\Contract.docx
; Record 2 @T 0000005 @DOCID 0000005 @MEDIA eDoc @C ENDDOC 0000005 @C PGCOUNT 1 @C AUTHOR R. Jones @DATESAVED 02/03/2013 @EDOC \NATIVE\Memo.docx ; Record 3 @T 0000006 @DOCID 0000006 @MEDIA eDoc @C ENDDOC 0000073 @C PGCOUNT 68 @C AUTHOR H. Block @DATESAVED 04/14/2013 @EDOC \NATIVE\Taxes_2013.xlsx ; Record 4 @T 0000074 @DOCID 0000074 @MEDIA eDoc @C ENDDOC 0000089 @C PGCOUNT 15 @C AUTHOR A. Dobey @DATESAVED 05/25/2013 @EDOC \NATIVE\Policy.pdf |
Just as placing data in the wrong row or
column of a table renders the table unreliable and potentially unusable,
errors in load files render the load file unreliable, and any database
it populates is potentially unusable. Just a single absent, misplaced
or malformed delimiter can result in numerous data fields being
incorrectly populated. Load files have always been an irritant and a
hazard; but, the upside was they supplied a measure of searchability to
unsearchable paper documents.
Fast forward to a post-personal computer, post-Internet era.
The overwhelming majority of documents
and communications are created and stored electronically, and only the
tiniest fraction of these will ever be printed. Electronic documents
are inherently searchable and do things that paper documents can’t, like
dynamically apply formulas to numbers (spreadsheets), animate text and
images (presentations) or carry messages and tracked changes made
visible or invisible at will (word processed documents). Electronic
documents also have complements of information within and without called
metadata that tends to be lost when electronic documents are printed or
imaged. Some of this metadata has evidentiary value (e.g., date and
time information) and some has organizational value (e.g., file names).
Because electronic documents are
inherently electronically searchable, there’s no need to image them or
use optical character recognition to extract searchable text. Moreover,
there’s less need for error-prone load files to populate review tools.
Despite these advantages, many lawyers prefer to approach electronic
documents in the same way they handled paper documents. That is, they
convert searchable electronic documents to non-searchable,
non-functional TIFF images and then attempt to graft on electronic
searchability by extracting text and metadata to load files.
So, why is an old, error-prone method of
data transfer still used in electronic discovery? Good question;
because it’s not cheaper, and it’s certainly not better. Mostly, it’s
just familiar.
To be fair, there’s a lingering need for
load files in e-discovery, too. Even native electronic have
outside-the-file or “system” metadata that must be loaded into review
tools; plus, there’s still a need to keep track of such things as the
original monikers of renamed native files and the layout of the
production set on the production media. In e-discovery, load files—and
the headaches they bring–will be with us for a while; understanding load
files helps ease the pain.
[1]
ASCII is an acronym for American Standard Code for Information
Interchange and describes one of the oldest and simplest standardized
ways to use numbers—particularly binary numbers expressed as ones and
zeroes–to denote a basic set of English language alphanumeric and
punctuation characters.
source;http://ballinyourcourt.wordpress.com/2013/07/17/a-load-file-off-my-mind/
source;http://ballinyourcourt.wordpress.com/2013/07/17/a-load-file-off-my-mind/
No comments:
Post a Comment