• If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!


Frequently Asked Questions (FAQs)

Page history last edited by Dan Zambonini 15 years, 9 months ago

If your question isn't answered below, feel free to contact us.



Q: Why might this be useful?

Cross-collections search, and cross-collections mapping can be easily performed with no resource/cost to the participating museums. It follows-on that participation in this type of project is therefore not limited to the larger museums, or by technology or skill. Any museum with collections online can be aggregated.
The system is also designed for generic data aggregation, not just museum objects. A powerful multi-data-type database can therefore be built that allows sophisticated mapping between the different items of content, adding value to individual records.

Q: How does the extraction work?  

A "template" is manually defined for each site, that describes how to crawl the site, and how to extract object data from relevant pages. There are two spiders that run concurrently; Spider A starts at a starting URL, and looks for other valid URLs to crawl, together with valid URLs for objects (these are both defined in the template). It logs both types of valid URL (other pages to crawl, and object pages), then goes to the next valid URL to crawl.
Spider B loops through any valid 'object' pages (as logged by spider A), and uses the rules defined in the "template" to extract object data, recording it to the database. Object data can be identified in the template using different pattern-matching techniques, including regular expressions, xpath expressions (all downloaded pages are converted to XHTML), meta tag content, and more.

Q: How long does it take on average, per object/site?

The length of time needed to define a template for a site (which needs to be done only once) depends on how well-structured the HTML is for the object pages. For semantic, well-structured object pages, a template can be defined in about 10 minutes. For less structured pages, a template will normally take less than 30 minutes to create.
Once the template is defined, the spiders can crawl individual pages/objects at an arbitrary speed, but are currently limited to one page per 3-4 seconds to avoid "Denial Of Service" style attacks on the remote servers (i.e. abiding by good spider etiquette). Therefore, the data is currently aggregated at about 15-20 objects per minute, per site. Multiple spiders can be run at once; we often have 4 or 5 sites being spidered simultaneously.

Q: What are the legal issues in spidering someone elses content?

To be honest, we really don't know, and in the research we’ve done online it appears that as usual “it’s a woolly area”. Needless to say we come in peace and have no intention to do anything that damages reputation, content or whatever. If your museum is on our list of those that have been spidered, and you’re unhappy, then please get in touch and we’ll do what we can to respond in a timely manner. Before you leap to your email, though, please bear in mind that this is a prototype and we expect it to be consumed by a very, very small segment of museum professionals. In other words, please don’t consider this a threat – more an opportunity in looking at a new way of aggregating, querying and exposing museum collections online. Please also bear in mind that Google, Yahoo! and pretty much any other web spider is doing exactly the same already and will have a far greater impact on your traffic figures!   

Q: What are you going to do with it next?


  • Clean the data we've aggregated to-date (improve the quality)
  • Normalise the data (e.g. dates, locations) so that they can be easily mapped/compared regardless of origin (mostly done now...)
  • Create a more sophisticated interface for querying/exposing the data
  • Create prototype applications/interfaces that demonstrate the value of this data repository
  • Create a simple interface that allows anyone to define templates for the system, therefore adding more data to the repository

Q: I thought screen-scraping was a dirty word? What if the template changes?

The patterns that identify where each field is placed on the page can be defined so that they are 'relative', and therefore may still be valid even if the template changes. If the template changes dramatically, it should only take another 10-30 minutes to re-define the template, and this should be a rare occurrence.

Q: Shouldn't this stuff be left to OAI / API type functionality?

Yes and No. Ideally, this data would be available through simple, standards-compliant, publicly marketed APIs. This would prevent duplication of data, and would allow each museum to control access to, and measure usage of, their data.
However, we've seen - on a number of occasions - that OAI style 'cross collection search' projects can be costly to implement, manage, and deliver, and rely on some technical know-how (or funding to outsource the know-how) at each individual museum. OAI APIs are also, typically,hidden' from the public, and are only known to those directly involved in the project.
This 'bottom up' approach requires no resource from the participating museums, no changes to technology or website, and no pre-project agreement on schemas. It also allows a single URL/place to be 'marketed' to give potential developers/application-desginers access to all the data.

Q: Doesn't this take traffic away from the individual sites?

We don’t think so, but not many studies have been done into how “off-site” browsing affects the “in-site” metrics. Already, users will be searching for, consuming, and embedding your images (and other content) via aggregators such as Google Images. This is nothing new.
Also, ask yourself how much of your current traffic derives from users coming to explicitly browse your online collections? 
The aim is that by syndicating your content out in a re-usable manner, whilst still retaining information about its source, an increasing number of third-party applications can be built on this data, each addressing specific user needs. As these applications become widely used, they drive traffic to your site that you otherwise wouldn't have received: "Not everyone who should be looking at collections data knows that they should be looking at collections data".


Comments (0)

You don't have permission to comment on this page.