Crawl News Archive Using Scrapy

This is a simple instruction that trying to crawl news content from an authenticated ASP.NET website using Scrapy.

Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.

1. Find out the login url

The first thing we need to do is finding out the url of login session, the url look like in structure:

https://www.abc.com/Interactivex/Default.aspx

2. Find out the archived links

Usually the archive page for a specified date will list all the news articles with title and links, but with only preview session of the article :

placeholder

Simple crawl strategy : + crawl the links for all articles + crawl the article contents based on the crawled links

The url structure for a date may look like:

https://www.abc.com/interactivex/archive.aspx?Industry=0,18,3&TimePeriod=0&Refreshed=1&ArticleType=0&BeginDate=01/01/2005

We could use Pandas to generate the range of the dates we are interested and replace the date portion in the archive links. We can then generate the urls we needed :

1 def generate_urls_on_date_range(start_date, end_date):
2   datelist = pd.date_range(start_date, end_date).tolist()
3   dates = [dat.strftime('%m/%d/%Y') for dat in datelist]
4 
5   urls = []
6   for dat in dates:
7   urls.append('https://www.abc.com/interactivex/archive.aspx?Industry=0,18,3&TimePeriod=0&Refreshed=1&ArticleType=0&BeginDate='+dat)
8   return urls

The start_date and end_date need to be type datetime as date_range is a function of pandas.

3. Locate XPath location for the information we want to store

We can use Firebug plugin in Firefox to locate the element location :

placeholder

And also, we may copy the XPath location directly in the Firebug:

placeholder

If the copied XPath does not work for parsing after filling the path in the following code, we will need to locate the element using Selector:

1 def parse(self, response):                                             
2   sel = Selector(response)                                           
3   sel_list = sel.xpath('The copied XPath for the element')
4   for info in sel_list:                                             
5       yield {                                                        
6           'info' : info.extract()              
7       }      

Here are two situations that on how to find the element location:

A. If there are identifiers for the html tag

For example, for <span class="textMed">some text</span> or <span id="middle block">some text</span>, we can select the content with :

1 def parse(self, response):                                             
2   sel = Selector(response)                                           
3   first_info = sel.xpath('//span[@class="textMed"]/text()')
4   second_info = sel.xpath('//span[@id="middle block"]/text()')
5   yield {                                                        
6       'first_info' : first_info.extract()              
7       'second_info' : second_info.extract()              
8   }      

B. If there are no identifiers for the html tag

Luckily, html is hierarchical. We can use selector list to locate the tag. For example, <div><div>some text</div></div>, we want to select the inner div tag:

1 def parse(self, response):                                             
2   sel = Selector(response)                                           
3   sel_list = sel.xpath('//div') # return selector list for all div
4   second_div = sel_list.xpath('//div/text()')
5   yield {                                                        
6       'info' : second_div.extract()
7   }      

There are many other ways to do this, for more details, check Selectors documentation.

4. Sample structure

Now we can generate links for start_urls and fill in the interested XPath locations in the parse function with the following code block:

 1 from scrapy.spiders import Spider
 2 from scrapy.http import Request, FormRequest
 3 from scrapy.linkextractors.sgml import SgmlLinkExtractor
 4 from scrapy.spiders import Rule
 5 from scrapy.selector import Selector
 6 import json
 7 
 8 class LoginSpider(Spider):
 9  name = 'abccrawler'
10  allowed_domains = ['abc.com']
11  login_page = 'https://www.abc.com/InteractiveX/default.aspx'
12 
13  start_urls = [] # start_urls include the list of links that we want to crawl
14 
15  def start_requests(self):
16      return self.init_request()
17 
18  def init_request(self):
19      return [Request(url=self.login_page, callback=self.login)]
20 
21  def login(self, response):
22      return FormRequest.from_response(response, formdata={'username': 'dummy', 'password': 'dummy'}, callback=self.check_login_response)
23 
24  def check_login_response(self, response):
25      # "Sign out" is one possibility in response page, "Log out" is also possible
26      if 'Sign out' in response.body:
27          print 'Login successful!\n\n'
28          for url in self.start_urls:
29              yield self.make_requests_from_url(url)
30      else:
31          self.log("Could not log in...")
32 
33      def parse(self, response):
34          sel = Selector(response)
35          # identify the information we want to crawl