How to use
CIS | Open Scraper v.1.4
OK, you need some structured data but you are just discovering the marvelous world of scraping.
So let's see - step by step - how we do it with Open Scraper...
- I/ Get prepared to scrap : familarize yourself with what scraping is and is not
- II/ Define a datamodel : what is the expected structure of the data you'll scrap
- III/ Configure a scraper : how does a scraper adapts itself to scrap the website you'll target
- IV/ Request you data : how do you consume the dataset once the scraping is finished
I/ Get prepared to scrap
If you are really using a scraper for the very first time we invite you to check our presentation : "An open source tool to scrap'em all. An introduction to scraping and to Open Scraper" .
This presentation aims to summarize the main purposes of scraping,
- What is a data and a datamodel?
- Why scraping data from internet could be useful and for who?
- What are the main problems when it comes to scrap several websites?
- How Open Scraper tries to offer a solution ...
Another notions to get used to are the ones related to the use of XPATH , which is the main language used by Open Scraper to select items while scraping.
To learn more about XPATH check this section.
II/ Define a datamodel
The first thing to do before doing any scraping is to define the structure of the dataset you will work on.
By default everybody can see the structure / datamodel but not edit it by clicking on "List of the fields you want to extract" leading you to the "data structure overview".
If you are an
admin or member of the
staff you will be able to access
"Edit your data model"
If you are an
admin or member of the
staff you can then click on
"edit data model"
from the fields list to make changes to your datamodel.
You can click on "add a new field" . Every field is composed of :
- the name of the field ;
- the type of the field :
date, ... ;
- an option to see or delete the field ;
- the degree of openess of the field :
By clicking on "Edit your data model" you can also edit any of the existing fields.
At the end of this process you have to save your fields.
III/ Configure a scraper
Once you have defined your datamodel you can configure your scrapers.
"List of all contributors"
view you will have a table listing all spiders already configured in Open Scraper,
but you will be able to modify only those you'd created yourself (except if you are an
admin, then you'll be able to modify every spider)
the "global fields"
Define here what is the website you want to scrap : the name of the website, its root_url, if you need to go from a list to a detailed page, if the website is reactive or an API, ...
The "Global fields" describe the website you want to scrap from :
name: the name you want to give to your spider
licence: the licence the content of the website you will scrap is under
page_url: the shortest url route of the website (the domain name)
logo_url: an url ending by .jpg, .png, .svg to the logo of the website if any (use "copy the image link" on your browser and paste)
start_urls: the page(s) containing the list you want to scrap
item_xpath: the xpath selecting each item's block in the list (must give in a list of results)
next_page: the xpath selecting the button leading to the next page of results
parse_follow: describe here if all the data you need to scrap is present in the list or if you need to go to more detailed page
follow_xpath: the xpath selecting the link leading to the detailed page (must end by an
parse_reactive: describe here if the url route is changing when you go to another page (not reactive) or if the url stays the same (reactive)
parse_api: mention here if the url you are scraping returns a normal page (no API) or a JSON (API Rest)
api_pagination_root: the url route used as a base for going to the next page, if the website has an API or/and if different than the
follow_api: the url route used as a base for going to the detailed page based on an item_id, if the website has an API or/and if different than the
The "Custom fields" describe where you will find the values for each field of your data model and store :
< the name of your custom field >: the xpath selecting the value corresponding to the field. must end by either :
text()if the value you need is some simple text,
@hrefif the value you need is a link hidden in an "a" tag,
@srcor similar if the value you need is an image,