Why OpenScraper has been created ?


Scraping can quickly become a mess, mostly if you need to scrap several websites in order to eventually get a structured dataset. Usually you need to set up several scrapers for every website, configure the spiders one by one, get the data from every website, and clean up the mess to get from this raw material one structured dataset you know that exists...

Yes, similar solutions already does exist... but...

So you have mainly three options when it comes to scrap the web :

  • either use a proprietary and quite expensive service (like Apify or import.io) and depend on an external service ;
  • ask a friend if you are lucky, ask a developer or a company to do it for you if you have money for that...
  • or if you have the know-how write your own code (for instance based on BeautifulSoup or Scrapy), adapt it for your own purposes, and usually be the only one (I mean the only developer around) to be able to use/adapt it.

A theoretical use case

So let's say you are a researcher, a journalist, a public servant in an administration, a member of any association who want to survey some evolutions in the society... Let's say you need data not easy to get, and you can't afford to spend thousand of euros in using a private service for webscraping.

You'd have a list of different websites you want to scrap similar information from, each website having some urls where are listed those data (in our first case social innovation projects). For every information you know it could be similarly described with : a title, an abstract, an image, a list of tags, an url, and the name and url of the source website, and so on...

So to use OpenScraper you would have to :

  • specify the data structure you expect ("title", "abstract", etc...) ;
  • add a new contributor (a source website) : at least its name and the start_url from which you'll do the scraping ;
  • configure the spider for every contributor, i.e. specify the xpaths for every field (xpath for "title", xpath for "abstract", etc... );
  • save the contributor spider configuration, and click on the "run spider" button...
  • the data will be stored in the OpenScraper database (MongoDB), so you could later retrieve the structured data (with an API endpoint or in a tabular format like a .csv file)

An open scraper for more digital commons

To make that job a bit easier (and far cheaper) OpenScraper aims to display an online GUI interface (a webapp on the client side) so you'll just have to set the field names (the data structure you expect), then enter a list of websites to scrap, for each one set up the xpath to scrap for each field, and finally click on a button to run the scraper configured for each website...

... and tadaaaa, you'll have your data : you will be able able to import it, share it, and visualize it (at least we're working on it as quickly as we can)... OpenScraper is developped in open source, and will provide a documentation as much as a legal framework (licence and CGU) aiming to make the core system of OpenScraper fit the RGPD, in the letter and in the spirit.