Advanced Sourcing for Monitoring
While most of your monitoring needs will be meet by the main sourcing methods (query sources, the URL Store and custom sources) from time to time you may need to use the advanced settings for sources. This may be because the source requires a login and a password, or perhaps because you would like to track only a specific zone on a webpage for example.
This article explains why and how to set up advanced sourcing for monitoring in DI.12. To find out more about adding simple sources to a monitoring please read this post on setting up a project.
Access Advanced Set Up
Inside a monitoring agent of a project, click on Advanced settings, in the top right corner, and then select Configure advanced sources:
This will open an lightbox with all the possible ways of setting up advanced sources:
Which one should I select?
This depends on the type of source you need to monitor.
sites that require a login and password
|choose an option with authentification|
information from a search box on a site
|choose WebNews source behind a search form|
a defined part of a webpage
|choose webpage source with area tracker|
several pages or information from the same website
information contained in a newsletter you receive
information on a cluster of invisible web search engines
|choose invisible web 2|
forums, original postings and subsequent messages on the thread
In addition, you will notice that some sources with authentification can be either webnews, RSS feeds, websites or webpages.
To identify if the source is an RSS feed, webnews or a webpage, paste the URL into the custom source box on the monitoring agent. Digimind will detect which kind of source the URL is.
You will then see:
|for RSS feeds you will see that an RSS news feed has been found|
|for webnews you will see a feed has been detected on this page|
|for webpages you will see that a page has been found|
Setting up a source with authentication
This is very useful for any sites to which you have a subscription.
The first step is to identify if the source is an RSS feed, webnews or a webpage (see the notes above).
Then on the advanced configuration page, click to select to add either a Webnews source with authentication, an RSS source with authentication or a webpage with authentication:
The first step of the process is to indicate to Digimind the URL you would like to track, and also the page on which you log in.
Click on next to proceed to the next step, where you should see the login page opening insdie Digimind.
Enter your login details. This step is crucial so that Digimind can access the page you would like to monitor.
The URL you want to track should now open in a new lightbox.
Click on next. This will show a page on which you can check that the feed has been correctly entered. Click on validate to finish.
Setting up a web news source behind a search form
If you would like to get to information that is hidden behind a search request on a website, you will need to use this option. The output will systematically be webnews. Click on add a webnews source behind a search form:
On the next page enter the URL on which the search box is located:
Then click on next. On the page that opens, locate the search box and enter your search terms:
Check that you are getting results back:
Then click on Next at the top to check that the results can be correctly extracted. If so, you will see something like this:
Then click on validate to track any new articles added to this site that contain your keywords.
Track only a certain part of a webpage with area tracker
This will help you when there is just a small section of the page you would like to track, or if the changes are on a static page that does not include an RSS feed or webnews feed.
Go to the webpages section and select to add a webpage source with area tracker:
Paste the URL into the correct field:
In this case, the site I want to track (see below) contains some text in the middle about Coca-Cola's sponsorships, but also some related stories on the left, and further down the page there are some other details I am less interested in.
When I paste this URL into the area tracker, I will then be able to select the parts of the page I would like to track. To do so, click on the frames on the page and Digimind will track changes only to the sections that are pink:
Then click on validate to start tracking.
Track several pages from the same website
This tracking method is very useful for tracking sections of your competitors' sites (such as their product offers) and institutional sites (such as Europa, the site for the European Union). You can select specifiy the type of file format you would like to pick up (pdf, ppt, doc, etc.) which makes this perfect for tracking changes to legislation or product specifications.
Go to the website section of the advanced sources:
Note: you can add websites that require a login and password. See the notes above on adding with authentication.
At the top of the page that opens, specify the limits of the crawl:
Digimind will track up to 200 pages at one time. As a result, it is a good idea to give the URL that is as close as possible to the information you need.
The depth indicates how many clicks down Digimind will perform. The URL you start with is level 1. The next level is level 2 and the following level is level 3. In other words, depth 3 means Digimind will click twice to open layers of the page.
The next section of the page, criteria for page filtering, allows you to specify more closely which pages to open. This will help you to get to the information you need within the 200 page limitation.
The URL pattern will help you to limit the clicks based on the URLs. Oftentimes websites are well structured, so pages that are of the same type, or are subpages of another have URLs that begin in the same way.
For example, on the Europa site, I want to track changes to European laws on custom agreements.
I start with this site: http://eur-lex.europa.eu/summary/chapter/customs.html?root_default=SUM_1_CODED=12
On this site, there are lots of places to click:
- the top bar (home, official journal, EU law and related documents, etc.)
- the section on the left (insitutions and bodies, summaries of EU legislation, EuroVoc, etc.)
- the section on the right (My EUR-Lex, sign in, register, etc.)
- and the section I am interested in at the bottom (customs coopreation, custom controls, etc.)
In order to limit my crawl just to this section, I need to identify the root of the URL for these pages. They all start with the same string: http://eur-lex.europa.eu/summary/chapter/customs/
I can add this to the URL patten field, finishing with a *, this give me: http://eur-lex.europa.eu/summary/chapter/customs/*
Digimind will only open and track pages that start with this part of the URL.
The next section of the add website source page, lets me specify when I want to be notified:
When I have made my choices, I can click on OK to start tracking this section of the website. Digimind will visit these pages once a week and let me know if there are any changes.
Track newsletters you receive, invisible web search engines and forums
For these kinds of sourcing, you need to contact our support team, so that they can set up your base correctly the first time you add this kind of source.
They will also be able to guide you on how to track these sources.