This is a simple instruction that trying to crawl news content from an authenticated ASP.NET website using Scrapy.
Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.
1. Find out the login url
The first thing we need to do is finding out the url of login session, the url look like in structure:
2. Find out the archived links
Usually the archive page for a specified date will list all the news articles with title and links, but with only preview session of the article :
Simple crawl strategy :
+ crawl the links for all articles
+ crawl the article contents based on the crawled links
The url structure for a date may look like:
We could use Pandas to generate the range of the dates we are interested and replace the date portion in the archive links. We can then generate the urls we needed :
The start_date and end_date need to be type datetime as date_range is a function of pandas.
3. Locate XPath location for the information we want to store
We can use Firebug plugin in Firefox to locate the element location :
And also, we may copy the XPath location directly in the Firebug:
If the copied XPath does not work for parsing after filling the path in the following code, we will need to locate the element using Selector:
Here are two situations that on how to find the element location:
A. If there are identifiers for the html tag
For example, for <span class="textMed">some text</span> or <span id="middle block">some text</span>, we can select the content with :
B. If there are no identifiers for the html tag
Luckily, html is hierarchical. We can use selector list to locate the tag. For example, <div><div>some text</div></div>, we want to select the inner div tag: