Discussion group for the members and faculty of the NEH Funded Institute for Digital Archaeology Method & Practice (http://digitalarchaeology.msu.edu) organized by Michigan State University’s Department of Anthropology and MATRIX: The Center for Digital Humanities and Social Sciences
Smith- Archaeology for Everyone: Digital Repository
December 13, 2015 at 5:23 pm #314
I’m starting my own thread here as I get some momentum. The past few months have mainly been a sloooow process of gaining institutional support for the project approach (with the ultimate goal of scaling this up to a digital repo for ALL of our agency digital archives). But I’m making distinct headway.
Here are my next steps as I see them:
– Nail down the true scope of the MSUDAI “bite sized” version of this project with help from mentors.
– Get KORA set up and ready for ingestion.
– Work with metadata schema to decide what’s right for this project.
– Decide what level of LOD is appropriate and figure out what I need to learn in order to make it happen.
I’m still soaking up basic skills like there’s no tomorrow. Looking at KORA installation (which I’m not even sure I’ll need for the smaller MSUDAI project), it looks like PHP will be a must. If anyone can think of other skills or topics I should be learning, please let me know. THANKS!!!December 15, 2015 at 2:04 pm #319
The first 3 deliverables (in your visions document) are completely doable — KORA repository, metadata fields, processing PDFs reports (sample). Some comments and suggestions about these first…
I recommend that you identify a subset of approximately 5 items for each type of digital object you envision including in the repository, for example, 5 scanned photographs, 5 reports in PDF format, 5 maps, and 5 of any other types of materials that will go into the repository. These samples will help you define metadata fields as well as workflows for moving a variety of materials into the repository.
In terms of defining metadata, each record should have a logical focus and be about that one thing.
During the summer institute, Jon Frey and I raised some important points about this related to Frey’s ARCS project. In ARCS, the primary record is the archival object (field journal, inventory card, map, photograph) created during the archaeological research process. It took us a while to decide that the inventory card itself (and not the sherd the card describes) is the focus of the primary record in ARCS. To capture the broader archaeological context, ARCS also has records describing the Excavation/Survey Unit (field data collection unit, for example the trench), the Season (period of time), and the Project (overarching archaeological enterprise). These other records locate the archival object (for example a field journal) within its archaeological context.
You will have to home in on the focus or purpose of your collection(s) (sherd or inventory card) in order to define the metadata fields you will use to describe the materials in your repository.
Regarding the last 3 deliverables/outcomes…
I need some clarification on what VCRIS is before I can comment on it.
Depending on what you mean by “laying the groundwork for integration with linked open data frameworks” like OpenContext and DINAA, I think this is doable. Eric Kansa should jump in here but I am thinking that you should be able to write documentation describing how your data might need to be transformed so that it can be published as LOD. That would be a very useful first step.
Finally, the plan for the “permanent, archival hosting of… digital material” is a project in itself. But from the opening paragraph in your initial thread, it is obvious that gaining institutional the support for your project might require this piece. Therefore I suggest you research preservation options for the repository and add the full blown planning as a next step in the project.December 15, 2015 at 3:28 pm #320
Your project is definitely interesting and indeed achievable to a certain degree. We’re now using the LEAN project management principles for our digital work at the BM. Some of these could be used for your work on this; have a look at this link. The main one for you, I think is build and iterate. Don’t try and do the whole lot in one go maybe, as Catherine says, go for a subset of data and go from there. Measure the impact and iterate up.
LOD – I think you cannot really define this until you have a metadata model in place. Things you maybe able to think about straightaway though could be units of standard measurement, geographical entities, people. You can learn lots from some accessible resources such as Matthew Lincoln’s work on SPARQL. If you have questions specifically about LOD, fire away.
Learning PHP is not too hard, I recommend learning framework based systems to really get to grips with the principles. I use Zend 1 & 2 for my work.
Formats of info – can you try and stay with open formats and not go for locked own ones?
DanDecember 15, 2015 at 4:05 pm #322
Hi Catherine and Dan,
Thank you for this great feedback.
Your suggestions re: scale are very helpful. I plan to dive in to this next week. As I start to organize the files, I’ll definitely have a better handle on the schema. Bound excavation reports are at the center of this project, so I’ll be starting from there.
VCRIS is our web-based site inventory system (sorry about the lack of explanation there). This is where the basic information about archaeological sites in VA is managed: geospatial data, description of sites, all of the bureaucratic goings on that affect each site, etc. It’s a new system, but it wasn’t really built to integrate well with anything else as far as I can tell. But all this is for further down the road.
Thanks for the PM advice as well as the plan for understanding metadata before getting into LOD weeds. Very helpful.
The documents I’m working with are to this point scanned into dreaded PDFs. I can definitely see moving the tabular data submitted into delimited text, but is there any better way to request final narrative reports from outside archaeologists that aren’t in PDFs moving forward?December 15, 2015 at 4:11 pm #323
You can strip pdfs down to more usable data quite easily. I think there are various tools in the links list from the August event. Are they scanned to images or can you copy and paste text out?December 15, 2015 at 4:24 pm #324
I’ve definitely amassed a lot of great tools to get data out of PDFs. I’ve got a few different scenarios: documents written a long time ago, (typwriter, early word processor, dot matrix printed) and scanned. They are OCRed, but results are expectedly variable. The modern stuff is easy. Docs that are born in Word, etc. get OCRed right on the spot and converted to PDF, although it’s still hellish to get the formatting back out again.
So I guess my question is, should I process the old ones in some other way before I ingest the records? I will definitely investigate alternative ways to receive future documents, but this will be a culture shift, for sure.December 15, 2015 at 4:26 pm #325
You could consider crowdsourcing the transcription of tricky pdfs via MicroPasts if it helps. There’s already modules for that included. Or correction of OCR.
December 15, 2015 at 5:04 pm #328
- This reply was modified 4 years, 1 month ago by Daniel Pett.
^ We’ve got BINDERS and BINDERS of old artifact catalog printouts with truly amazing data trapped therein. I’ve been concocting all sorts of plans, thinking of MicroPasts.
I think we might need a MSUDAI spinoff projects thread. Aside from this repository project, I’ve got so much other fun stuff in the works based on what I have learned since August.January 6, 2016 at 10:44 pm #393
Sorry to be late to the party. Jury duty and DINAA grant writing (speak of the devil) got in the way.
Jolene, I guess the entity reconcilation tool for DINAA would be useful so you can match trinomials in your documents to trinomials in DINAA. I’d like to DINAA to point to documents in Kora that you archive. So, I’ll need a spreadsheet that has a list of KORA documents, their URIs (and page numbers, if linking to a PDF), and I can add those to the appropriate records in DINAA.
Beyond that, are you coming up with some sort of classification scheme for your documents that may be more specific than something like the Library of Congress Subject headings (http://id.loc.gov/authorities/subjects.html) and vocabularies available from the Getty (http://www.getty.edu/research/tools/vocabularies/lod/)? I’m asking not about periods or place (covered by PeriodO, DINAA + Geonames), but also about more conceptual categories that may describe these documents (“slavery”, “industrialization”, “tobacco”, etc.). Those sorts of themes may be useful to organize and help guide users in browsing. If you define your own organization, it would be great to relate your concepts to more widely used standards like the LOC or a Getty vocabulary.
Enjoy the SHA!
-EFebruary 4, 2016 at 4:06 pm #492
I’ve been doing a lot of high-level thinking and planning, but I’m in to KORA and about to set up my schema. Funny thing: I had kind of discounted the process of selecting which projects to use as an easy task. BUT IT HAS NOT BEEN! To recap, I’m aiming to select ~10 high-interest, low-risk data recovery reports and their accompanying media to include in the repository. I always knew that our data on data recovery-level work was uneven, but it is actually horrible. At least, it was until I spent a whole lot of time in a database rabbit hole cleaning it up.
The good news is that now I have a big long list of contenders. I’ve been immersed in statewide Virginia data for 10 years now, but I realized I still had no idea how to choose. So I made a very informal google forms poll and sent it out to a bunch of archaeologists. Totally worth it! And seeing all these truly amazing projects in one big list also really got me motivated to make this repository happen.
Now I’ve got at least a starting point. I’ve got to screen out the projects that include sensitive material (human remains, ceremonial practice) and I’m limiting this project to sites that have been destroyed or are otherwise well-known enough that publishing material wouldn’t jeopardize them any further. It’s been surprising how hard it has been to take this concrete step of just choosing some files.
The next overwhelming task ahead of me is designing schema for the repository. This is another case of “just get something down,” I know. One foot in front of the other.February 28, 2016 at 9:55 pm #581
Thanks to Kate for the Commons reminders :). I’m still a little bit stalled for good reasons and more challenging ones. Like I mentioned before, I’m really trying to design this repository to be the framework for something much, much bigger. While I’m confident in that strategy, the big questions are proving to be mental obstacles for my moving forward. Ultimately, I just need to pick some darn files and upload them.
In other news, my superiors have seen all this new digital promise in me and have given me some [really big] projects. These include making an interactive web version of the Virginia Landmarks Register (paragraphs and images about historic places that used to be a print book) which I hope to make with Omeka, a location-aware mobile app for state highway markers (looking at KORA and mbira for this TEENSY little project- ha!), and some kind of interactive web map for historic resources along the Appalachian Trail. I’m trying so hard to not get completely over-committed and to adjust expectations when necessary.
So, back to acute project management mode. Time to make lists and check off tasks. I really love working independently in a lot of ways, but I find myself longing for a like-minded digital archaeology collaborator. Ah, well. The world is my oyster. I’m not complaining.
I may not be responding to all of your posts, but I really love the reflective parts. It’s helpful to get an idea of other people’s process. Keep it coming.May 23, 2016 at 5:30 pm #700
I’m transferring my blog post questions here, since this forum is a little more user friendly for questions and answers.
- Am I in the right place with how I’m using Dublin Core fields? I’ve read a ton, but actually implementing a schema using Dublin Core isn’t covered (because it’s so basic). See scheme linked below.
- Are all of my specialized fields formatted in the most efficient way?
- I’m linking to the DINAA/OC uri for each site. Am I doing this right?
- Do I have to host my front end site on the same server as my KORA repository (currently it’s at MSU)? Or can the front end site be hosted somewhere else?
- What’s my first step in making a frontend using PHP (besides the Codecademy PHP course)?
Dan answered on Twitter already, with a “no” on q. 4 (good news) and advice on PHP frameworks. So I’m going to experiment with those this week.
Thanks!May 23, 2016 at 6:56 pm #701
First off, read this:
UTF-8 will save you lots of pain.
Then read this:
You have to namespace your XML.
Make sure you validate!
DanMay 23, 2016 at 7:51 pm #702
Thanks, Dan! These are great links. Just what I needed to fill in some skill/understanding gaps. I’ve got some good time today where I can dig into this.May 23, 2016 at 8:07 pm #703
You must be logged in to reply to this topic.
active 1 year, 10 months ago
active 1 year, 10 months ago
active 1 year, 11 months ago
active 2 years, 5 months ago
active 2 years, 5 months ago
active 9 months ago
active 9 months, 1 week ago
active 10 months ago
active 1 year, 5 months ago
active 1 year, 12 months ago