My Data, my Device and Myself
Some simple rules on historical data storage to electronic devices
Storing data to an electronic device means creating a system of three interacting components: data, e-device and user. Each of these components has specific characters and requirements. A good combination of the three makes an efficient tool. A bad combination, at best, an unmanageable one.
The requirements and characters of these three components are contradictory; so that any data storage system is a compromise between conflicting demands. Consequently, no perfect data storage system will ever exist. Global efficiency can only be measured against the purpose for which the system was designed. Design must maximize the side which is most important in the designer's view; while preserving the fundamental requirements of the other two.
I. Interwoven requirements
Data have an internal structure which conveys information. Storage must obviously preserve this structure, as far as the electronic device allows.
Electronic devices do not understand concepts. Data for them are just sets of coding strings which they manipulate as black boxes: appending them to one another, arranging them along various patterns, in accordance to the user's demands, but always managing them as external objects. Data storage, from the device's point of view, means providing empty boxes, endowed with formal properties, into which users input what they want. E-devices have no idea of what this content may be. They only know the location of the box; and at best some external attributes of the data, for instance if they must be processed as character strings or numbers, and little more.
Inputing a continuous flow of information into a storing e-device means fitting information into such boxes. Pouring all available information into one box is the easiest solution. It preserves all the richness of this information, but adds little as far as data-analysis and management are concerned. Using various boxes means splitting the data, and makes possible creating more information by combining the parts in a different way. To do that, users must split information in accordance with its internal structure, and label each box as containing a different class of information, corresponding to a different concept. By instructing the device to arrange them into different arrangements, or to select among them one class or another, they are able to disclose hidden patterns and to draw conclusions about the content and structure of the original information; or simply to manage it in the most simple possible way for any purpose they have in mind.
Obviously, an accurate analysis demands numerous boxes; the thinest the basic pieces, the keenest the conclusions. The choice of the granularity of the information items which must be inserted into each box depends on the user's goal. For scientific purposes, granularity must be fine, to maximize the possibilities of analysis. But fitting information into numerous boxes means cutting it into as many pieces as boxes. Breaking into independent pieces a continuous flow of information means breaking its structure, that is changing its meaning. An operation which goes against the main requirement of a scientific work, the purpose of which is to account for observed phenomena and not for artifacts created by the tools used for observation.
The danger of breaking structures which convey information while managing them, is nothing new. It is inherent to any scientific process. Every research community addressed it by elaborating sets of hermeneutical rules, in accordance to its purpose and to the nature of the data it uses to handle. Any danger of misinterpretation can be reduced, if not wholly avoided, by a strict compliance with these hermeneutical rules. The operation of splitting data must be carefully thought out and, as far a research is concerned, must be executed by or under the control of fully trained scientists. From which we must conclude that loading data to a database for research purposes is not a task for beginners and requires previous training.
We call the operation of breaking down information into discrete pieces for storage and management purposes atomization. Atomization transforms coarse information into ready-for-computer-handling data. The choice of the granularity of the pieces, that is of the module of atomization, is the first and probably most important step in any process of data storage. Any further use of the concerned data depends on the chosen module and on the adequacy of the atomization with the nature of the data.
We analyzed till now the interwoven requirements of data and e-devices. Users' requirements are no less exacting. For one part, users decide the purpose for which data have been gathered and are being processed: we made clear that this factor strictly conditions all data-processing operations. For another part, users, as recipients of the final results, usually are the last link in the string of operations through which data are to be processed. Their physiological limitations become those of the system, especially when data must be visualized, either while loading them to the e-device, or later, for control purposes, as is almost always the case with historical data. Visualizing means displaying on a screen. This display is limited and conditioned by the readers' physiology: letters, for instance, must be big enough to be easily read. Displays are also conditioned by the reader's cognitive capacity. One cannot embrace more than a limited amount of information at a time; human mind has a strictly limited short-term memory; data must be displayed in a strict order and along predictable lines in order to minimize cognitive efforts, specially when huge amounts of data must be scanned at one sight. Environmental conditions are equally important. Working in an archive deposit with no or insufficient Internet connexion, or working at full speed because hotel bills are pilling up, is one thing; to be seated in one's office with a 1Go connexion, while enjoying the quiet and friendly surroundings of a group of admiring students is another.
After this global theoretical introduction, we can address the practical problem of making all three mentioned actors work together in the most efficient possible way when processing historical text data1.
II. Storing historical data: a practical view
a) Images of book and manuscript pages
Digital images of texts are a storing device. Broaching on the comparison we introduced above, they are one-box storing devices. The information conveyed by the image is stored as a block, just identified by a label indicating the source and, if necessary, the relevant credits. Its content, just as it is, cannot usually be further analyzed by digital means. It is coarse information, and little more.
In many cases, nevertheless, digital images are the only practical way of storing historical information, fundamentally when one is hastily working in a deposit he cannot regularly access. It is also the way archives store information to be put on-line, because publishing on-line must leave open any possible use of the documents and images are, by far, the media which less tamper with the content of the source. Provided an integral reproduction of the same: selecting means tampering.
Images are obviously a useful, but provisional storage tool. They provide a good answer to environmental constraints, but do not make possible internal analysis.
b) Text processing
Text processing packages also work as storing devices. Like images, they are one-box storing devices. They are more manageable than images, because text-search, copy and paste commands, adding comments, changing colors and typographical disposition, make possible to distinguish and visually to handle thematic blocks of information.
They also provide information formated in a way which makes it ready for text-analysis devices. Anyway, the same as text-images, text processing must be considered as a provisional storage means; to be used in conditions when images are unpractical, either because getting a summary is more efficient than copying the whole document, either because catching images is not possible.
The same as images, every set of text matter must be carefully labeled and stored in strict order. A table of contents is usually most useful. Note that images and text processing packages use to generate many small files, with a strong propensity to get lost over time: handle them with care.
c) Text analysis devices
Text analysis devices are able to count occurrences and co-occurrences of words, of expressions and of stylistic figures inside a given text. We are not interested here in such skills, but in their ability to divide and label a text into various independent parts, to be extracted on demand from the document and handled at the user's will.
Let us take for instance the two previous paragraphs of the present text (Fig. I). Let us imagine that some researcher is interested in the way I use the concept of "device" in relation to that of "image". Within my text, he would select all parts in which I manage the concept of "device" (or "tool", which he considers equivalent), and all parts in which I manage the concept of "image", he would mark them and would label each of block with the label "Image" or "Device". It is possible to mark any part as belonging to both, or more, sets.
Our researcher would be able in such a way to call by means of a simple query on the label all text-parts marked with that same label. The query "Device" would for instance provide the following result:
(1) Digital images of texts are a storing device
(2) Broaching on the comparison we introduced above, they are one-box storing devices.
(3) Images are obviously a useful, but provisional storage tool
(4) Text processing packages are also storage devices
(5) which makes it ready for text analysis devices
That is, all text-parts I previously marked with the label "Device". He would go further and display in blue all these parts which feature in places which he also labeled "Image".
Fig. I. Creating labeled text blocks with a text-analysis package
In such a way users can extract from any text, sets of labeled blocks. If we translate the label terminology into the "box" terminology, each labeled block would be a box, defined by a label and by its location inside the text.
This method makes unnecessary materially splitting the document into parts. Boxes remain virtual. The process must be used when preserving the form of the source is of vital interest. Its main drawback is that preserving the form of the documents makes impossible to develop its content to make explicit the implicit allusions it carries on, a process which, in our view, must be the first step of any analysis. Another drawback is that, for the same reason, it does not allow the elimination of redundancy, so that the resulting database may be cumbersome and unwieldy to the point of unmanageable. Such drawbacks can be fairly alleviated by using this strategy in conjunction with databases proper (see further, part "d")
Text-analysis packages were for a long time plagued by poor ergonomics. Things changed these last years with products like "NVivo", a standard package for sociologists which we found highly useful for historical studies too.
Spreadsheets simply do not serve as databases. Point-blank. Queries are too difficult to formulate, the rigidity of the cell frame does not allow a correct handling of complex data. They simply do not serve.
BUT… the cells of the spreadsheet materially provide ready-for-use boxes to store data into. Each cell is set on a line, and this line makes an individual (a person, an object, whatever you want), in other words defines the subject where to hang data from. Each cell is set inside a column, and each column makes a different predicate with which to characterize the subject until you get a full description of the same. Fig. II shows as the particulars of a list of suspects would look like in a police database. Horizontal bars describe individuals. Vertical bars, predicates.
Fig. II. Spreadsheet of police particulars (date: 1859; ficticious data, of course)
Take notice that the nature of the predicate is defined by the position of the column within the array. All first columns are names, all sixth columns are Noses, etc.
The sheer simplicity and concreteness of spreadsheets make that many users use them as databases. A confusion easy to understand, because spreadsheets work along the same principles as databases: the lines and rows of the spreadsheet are the same as the records and fields of the database (see further).
Given the similarity between both kinds, and the closest familiarity of our readers with spreadsheets, we shall use examples from the latter to explain some basic principles which rule for both.
1) One field, one data
Fig. III. Pole refugees in Turkey – Mid-XIXth century (real data)
The problem lies with the "Biographie" column. Columns (3), (4) and (5) are also biographical events, and one does not understand why they stay out of column (6) (Biographie). But this is not the main point. The really questionable fact is the accumulation of various biographical data in column 6. Electronic devices handle words, not concepts. Such as things stay here, no electronic device would be able to separate the at least fifteen biographical events mentioned in Hieronim Abramowicz's Biographie cell; and less still to handle them. One data, one field.
2) From columns to lines
In order to comply with this rule, the author of this database should have created fifteen columns to account for Abramowicz's Biographie. But we also saw that the position of the column within the array determines the nature of the data displayed in the same. That means that to account for Achmetowicz's biography, the events of which are different from those which form Abramowicz's life course, more column should be added. And so forth for every one of the hundreds of individuals mentioned in the database. Let us imagine, a spreadsheet of, let us say 1500 columns… In fact, I recently saw one of almost 500 columns. Unmanageable, beyond any hope of salvation. Common sense tells us something's wrong here. We broke the constraints imposed to the database by the user's limitations.
If the number of columns is a problem, the number of lines is not; or far less, for many reasons: because lines use less space on the screen than columns, just because we write horizontally, so that more lines than columns can be displayed on a same screen, and so that e-devices browse lines faster than columns; because users have been trained from childhood on to read lists disposed as lines, too. Let us then make our columns lines.
Abramowicz's biography becomes:
Fig. IV: Abramowicz's biography, converted to lines, first version
The problem is that in the one-line version, the line defined the individual. In the various-lines version, this reference is lost. Such a loss can be made up by creating a new field, assigned to every entry, in which we input the name of the actor concerned (Fig. V). By the way, we also transfer to a specific field the date, a fundamental information in historical documents: storing it apart makes chronological sorting easier.
Each biographical event is made a line. Name and date are converted into predicates of the event. Users may list as many biographical events as they want, belonging to as many actors as they wish, in the order they chose. A query on the name column provides all items assigned to a same actor.
Fig. V. Abramowicz's biography, converted to lines, revised version, with some clues to Achmetowicz's biography
As far a biographies are concerned, we can do with three columns, because sources, apart from names and dates, only provide biographical events, that is a same kind of data: values change, but all pieces of information can be considered under the sole heading of biographical data. The police particulars of Fig. II raise a different problem: every column has a different meaning. The three-columns model we use in fig. V does not take this difference into account.
When a similar question was raised about names, we solved it adding to each record a column with the name of the concerned individual. Let us add, in the same way, a column describing the nature of the predicate. We may call it "Nature of the predicate", or "Descriptive dimension" or, using a more usual although less precise term, "Feature" (Fig. VI).
Fig. VI. Police particulars. Fig. II converted to lines
This model can be used to describe absolutely anything. A first leg names the subject, a second leg the class of the predicate (feature), a third leg the value of the predicate. More fields may be added to store specific data made necessary by the context in which the information will be used. Here, a date; in real research conditions we should possibly add the mention of a source, which we do no display on Fig. VI, but which could be easily added on the right side of the same. Every time a new predicate, or a new value, or a new subject are mentioned, they generate a new line. The e-device builds on demand any universe of data by means of simple queries on one or various fields.
This model can be interpreted in accordance with the concepts we defined in the first part of this paper. We started with data atomized on the basis of the subject: one individual, one line, and everything about the same individual on the same line. We ended with data atomized on the basis of the predicate: one predicate, one line. An actor, in this last configuration, can be defined as the list of the predicates which describe him.
This structure works on spreadsheets as well as on databases. A short presentation of databases will make clear the reasons why we recommend to avoid spreadsheets for data storing.
Spreadsheets use cells. Databases use fields. Both are similar from many points of view, as we saw. Cells are defined as the intersection of a row and of a column. Fields are segments carved inside a record. The main difference lies in the fact that cells are attached to a relative position in the screen and cannot be moved, while fields can be displayed anywhere within the space allocated to the record on the screen, allowing the creation of specific layouts for specific purposes which make the display of data far more flexible… to the user's satisfaction, specially when the database has grown big. Another difference lies in the fact that, for reasons derived of the basic structure of both, queries are far easier in databases. When one handles complex data – and historical data use to be complex – and makes thousands of queries each day to make sure that such and such a data piece has not already been loaded, this is an all-important point. Last but not least, a decent database allows exporting data as one chooses. Far more easily than a spreadsheet.
Data storage makes necessary a specific formating of stored data. Storage has one specific function. It must make access easy, and nothing more. Processing data with analysis packages to extract conclusions from the same, usually means a transformation of this format. The storage format must be thought of so as to make this transformation easy. A good database package allows users to pick easily one by one the pieces of information they need and to copy them to any other package.
Spreadsheet are nevertheless fundamental tools for historians. Not to store data, but for statistical calculation. Learn how to get the best of them as calculation tools, and do not try to use them to do things they were not designed for.
We referred, by the way, to "decent database" packages. Disgracefully not all of them meet what we consider basic standard requirements. The most famous among them, Access, is simply outdated and does not serve our purpose because of its lack of flexibility. As a general point, we consider that no fully-relational databases is flexible enough to process historical data, always fuzzy, always incomplete and dubious. We personally use FileMaker, not because we fill congenial with Apple (we don't, and FileMaker is a subsidiary of Apple), but because it is the only one which met our criteria.
Flee from ad hoc packages specially created for you by a friend who knows or an engineer "maison". They may be marvelous. But you must not depend on anybody but yourself. Building a database is a long term venture. Use much-sold commercial packages. The numeric world passes through a revolution every twenty / thirty years. May be that within ten years nothing will be left of what we are presently used to. A package of general use by commercial firms will be updated consequently. A not so usual one will be left to die from sheer obsolescence. In the same way, learn all you need to know to manage your database yourself. Engineers die and retire, like every mortal. You must be able to do everything for yourself when times come.
Summing up, think for yourself! And remember this most basic rule: technological improvements will never solve your problems with data management. Solution lies in a closer analysis of your data, not in any magic trick technology could do for you.
Conclusion: Think for yourself!
We suggested in this paper some useful tools and, so we believe, a master key to database structure; but in no way a set of solutions to be blindly implemented as a panacea. Before, during and after the making of your database, think!
Remember that you must keep content three partners: your data, your e-device and yourself. Let us analyze, for instance, fig. III from these three points of view. The computer never felt happy with this solution, as we saw. Data felt neutral at first: information had been correctly extracted from the source. The user felt perfectly happy when she left the library where she consulted the biographical dictionary from where she got her information; far less when she tried to process her data and when her computer made clear that it could not help. That means that as far as her goal was to collect information, the structure she used was adequate to the task, although inefficient for other purposes. Which meant that the first version of her database, in which she atomized her data on the basis of the subject (one individual, one line), had to be transformed later and atomized on the basis of the predicate. When deciding what to do while first seeing the documents in the library she was working in, our user had to integrate all these parameters: an immediate loss of time in the library implied by the more complex structure of the database, against a future loss of time implied by the necessity to transfer the data to another format. Not an easy-to-resolve question.
There never is a unique solution. Everything depends on the context. Think for yourself!
1We stick to text data in this concise introduction. Images and statistical arrays raise special issues which demand specific developments.