Visit Homepage
Skip to content

Preparing your dataset for network analysis: a general introduction

Last week, your beloved Data Ninjas helped organize a workshop on how to get started with databases at the HNR2014 conference in Ghent. Due to popular demand, we’ll summarize the basic principles here. There’s nothing really SNA’y about building a database of course, but before you can go crazy with SNA, you have to have your data organized, so this is pretty important if you want to get something out of it easily. We’ve never really given this much thought actually: we’re pretty spoiled in this respect, because the database we work with, Trismegistos, was practically handed to us on a silver platter. Well, to Silke at least, two years ago. I actually helped build a large part of it for my PhD research, resulting in nightmares, insomnia and eventually a temporary ban on parsing names. Oh, those were the days… But we realize that others are not so fortunate and have to build theirs from scratch, which prompted our inner Mother Teresas to spread our miraculous database transubstantiation skillz among our peeps.
For starters, we use Filemaker, cuz, well, that’s what Trismegistos rolls with. It’s really easy to work with: in fact, when you create a new database (File > New database; see: it’s that easy!), you automatically enter ‘layout mode’ and a ‘field picker’ pops up, in which you can start creating a list of new fields, which you can then drag to your new database (fig. 1 & 2). 
Fig. 1: Filemaker ‘field picker’


Fig. 2: Filemaker layout mode
There are tons of options to tweak your layout, so you can play around with that if you want (we’ve got staff for that, so Trismegistos looks pretty neat!). 
Fig. 3: Trismegistos People
Once you’ve created all the fields you need, go to ‘browse mode’ (View > Browse mode), hit cmd+N (Mac) or ctrl+N (Windows) to create the first file, and start entering that data! If you’ve already got your stuff stored in Excel, for example, you can even import it directly and save you the hassle of retyping everything. Given the gazillion features available in Filemaker, it would take us ages to explain all the possibilities here. Luckily, they’ve got great tutorials of their own, so if you’re serious about getting started with Filemaker, get them here!
Of course, there are other database products out there that work just fine. Whatever software you’re comfortable with, the most important thing to consider, BEFORE you start building your database, is how you’re going to structure your data. And this of course depends on the type of data (texts, objects, people, places, dates, …) you’re using, and what you want to use it for.
Depending on the extent of your data, you can opt for a flat file or for a relational database. Flat file databases are simple: you make one record for each “thing” you’re studying. You can structure it in different ways though, so you have to think about which format is the most practical for you, because they all have some downsides. If, like us, you’re working with people that appear in texts, you’ve basically got four main options. To illustrate the possibilities at the workshop, we took a very very very very very small part of the Zenon archive (if you wanna learn more about this dashing gentleman, check out this blog post): four texts, each with a different date and provenance, with 12 different people (some are mentioned more than once).
So the first option is to make one record per text, with a separate field for the text, the people mentioned, the date and the provenance. In case of the Zenon texts, you’ll get four records, with four different fields. 
Fig. 4: option 1

This is probably the worst solution EVER. Stuffing all info on people in one field will make you miserable when looking for specific information later on. Same goes for texts in our field: they can be referred to by their inventory number, but also by their publication(s). Not all texts are precisely dated, so how to handle date ranges? Or places? Texts weren’t necessarily written, sent to, and found all in one place.
Which brings us to option number two. This also consists of on record per text (so four in total), but this time you create separate fields for each individual, for the different types of places, and for the dates (fig. 5).

Fig. 5: option 2


Downside: if you’ve got texts with, say, 2,000 + individuals, you’ll be needing A LOT of fields. And what do you do if someone’s mentioned more than once? 
Fig. 6: option 3
Let’s try option 3 then (fig. 6), the reverse solution: one record per person, listing all the texts he’s mentioned in. So for the Zenon example, you’d get 12 different records, with multiple text fields (3 for Zenon, for example, since he appears in three out of the four texts). Same downside as with option 2 though: if a person appears in a lot of texts, you’re going to need a lot of extra fields, not only to list the text itself, but presumably you also want info on the date and provenance of each text and stuff like that, so you’ll be needing extra fields for those things as well. Can get messy. 
Option 4 (fig. 7): a separate record for each time a person is mentioned. This means that you might be copying your text info a couple of times, but since Filemaker’s got a nifty duplicate shortcut, this might just be the right choice for you. So, since our hottie Zenon appears in three different texts, and is mentioned twice in one of them, you’d need to make 4 records for Zenon, each with a different text, except for one, where only the line number will (probably) be different.
Fig. 7: option 4
Now, if you’ve got texts with a lot of names, and people that tend to appear in lots of texts, like we do, it might be wise to start thinking about a relational database. This way, you set up separate databases for each entity (texts, people, places, …) and then link them together where needed. So for those Zenon texts, you’d need a database of texts consisting of four records, one for each text, and separate fields for the different places, dates, publications, and whatever info you want to include. A second database would then contain all 21 references to the individuals in the texts. To link these two databases, you need to have unique identifiers for each text and each reference to an individual. In Trismegistos, we use numeric identifiers (tex_id, ref_id, per_id, …). These numbers don’t actually mean anything, they just help structure your data and avoid ambiguities. So, say you have a text with 200 names, instead of copying the publication, date, provenance and all that to each of the 200 records referring to an individual, all you need is the text ID and Filemaker can pull out all the relevant information from your text database automatically. If you want, you can take things a step further and even create a separate database for places, one for dates, another for titles, you name it. Just make sure it’s worth the effort though! To give you an idea of how far you can go, this is what Trismegistos’ structure looks like in a simplified flowchart: 
Fig. 8: the Trismegistos platform
To scare you off completely, this is what the relational structure of just Trismegistos People (the three green databases in the above screenshot) looks like: 
Fig. 9: TM People
In any case, whatever database format you decide on, the key to success is STANDARDIZATION. As our mighty Trismegistos overlord likes to say: computers are autistic. Cleopatra is not the same as Kleopatra, P. Oxy. 3 484 is not the same as P.Oxy. III 484. Make sure you’re consistent when entering data right from the start, this saves you a lot of trouble in the long run.
Once you’ve got your database set up, and you have the data you need for your network analysis, there’s just one final step: exporting your data so you can load it into your preferred software. Excel is the best go-between program, as you can create xls files from Filemaker (File > Export records; make sure to select the right format of course!), and most SNA software support these (or csv). And then you’re ready to roll! We’ve already given some tips on how to get started with these files in different SNA programs, check them out here and here and here.
And remember boys and girls:

This post would not have been possible without the moral support, the spiritual guidance, the nerves of steel (while the rest of us bounce around flapping our arms in blind last-minute panic), the infinite wisdom and the bountiful benevolence of our mighty Trismegistos overlord. Your database dexterities are unrivaled. We tremble in your presence, oh omnipotent one.

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *