Web scraping using Scrapy and Deploy on Heroku

Hi guys,

Today I am sharing my experience and code of a simple web crawler of using scrapy to scraping web domain of

http://venturebeat.com/category/small-biz/

https://techcrunch.com/mobile/

http://www.infoworld.com/category/application-development/

based on Pycharm IDE, restore data to MongoDB and finally deploy to Heroku Scheduler.

 

(1) Setup your Pycharm IDE environment

refer to this

Here’s my run/debug configuration. The project name is “caissSpider”Screen Shot 2016-07-17 at 9.28.56 AM

(2) The spider

There’s a lots of tutorials on Scrapy, one of the useful projects to the beginner is this.

Here’s my simple project’s structure.

Screen Shot 2016-07-19 at 9.37.13 AM

Like’s study them briefly one by one.

2.1 items.py

from scrapy.item import Item, Field


class Website(Item):
    name = Field()
    description = Field()
    url = Field()
    img = Field()
    domain = Field()
    domainUrl = Field()

As you can see, this item class is like the collection in database, which defines the fields of the data structure that scraping from the web. At here, all the fields are simply using String.

2.2 settings.py

ITEM_PIPELINES = {'dirbot.pipelines.FilterWordsPipeline': 1}
SPIDER_MODULES = ['dirbot.spiders']
NEWSPIDER_MODULE = 'dirbot.spiders'
DEFAULT_ITEM_CLASS = 'dirbot.items.Website'

#Avoid <urlopen error timed out>
AWS_ACCESS_KEY_ID = ""
AWS_SECRET_ACCESS_KEY = ""

#AWS
MONGODB_SERVER = 'mongodb://<db username>:<db password>@<public url>:27017/test'
MONGODB_PORT = 27017
MONGODB_DB = 'test'
MONGODB_COLLECTION = 'news'

2.2.1. This file defines the settings of the spider. Note you would better put

AWS_ACCESS_KEY_ID = ""
AWS_SECRET_ACCESS_KEY = ""

in order to avoid the Scrapy “ulopen error time out” errors. 

2.2.2 Note for MONGODB_SERVER you have to setup the database username and database password for your MONGODB. Otherwise the Scrapy will show up “user is empty” error

2.2.3. Refer to this to set mongodb username and password

“This answer is for Mongo 3.2.1 Reference

Terminal 1:

$mongod --auth

Terminal 2:

db.createUser({user:"admin_name", pwd:"1234",roles:["readWrite","dbAdmin"]})

if you want to add without roles (optional):

db.createUser({user:"admin_name", pwd:"1234", roles:[]})

to check if authenticated or not:

db.auth("admin_name", "1234")

it should give you:

1"

2.2.4. For my spider, I store the data to “test” database in “news” collections.

 

2.3 pipeline.py

from scrapy.exceptions import DropItem
from scrapy.conf import settings
import pymongo

class FilterWordsPipeline(object):
    """A pipeline for filtering out items which contain certain words in their
    description"""

    # put all words in lowercase
    words_to_filter = ['politics', 'religion']

    def __init__(self):
        connection = pymongo.MongoClient(
            settings['MONGODB_SERVER'],
            settings['MONGODB_PORT']
        )
        db = connection[settings['MONGODB_DB']]
        self.collection = db[settings['MONGODB_COLLECTION']]

        #clean up previous one before new crawling
        self.collection.remove({})

    def process_item(self, item, spider):
        valid = True
        for word in self.words_to_filter:
            if word in unicode(item['description']).lower():
                valid = False
                raise DropItem("Contains forbidden word: %s" % word)

        if valid:
            self.collection.insert(dict(item))

        return item

The pipeline.py is aiming to filter out the sensitive words, clean up the current collection before storing data and insert the new scraping data to the collection.

2.4 requirements.txt

attrs==15.2.0
boto==2.39.0
cffi==1.6.0
cycler==0.10.0
enum34==1.1.2
futures==3.0.3
gunicorn==19.1.1
httplib2==0.9.2
idna==2.1
ipaddress==1.0.16
lxml==3.5.0
Scrapy==1.0.5

These are the packages needed to deploy to the Heroku Scheduler.

 

2.5 scrapy.cfg

[settings]
default = dirbot.settings

 

2.6 setup.py

from setuptools import setup, find_packages

setup(
    name='dirbot',
    version='1.0',
    packages=find_packages(),
    entry_points={'scrapy': ['settings = dirbot.settings']},
)

I didn’t change anything from this for the above two files. Simple setup scripts.

 

2.7 caissSpider.py

Finally, let’s look at the core spider file.

from scrapy.spiders import Spider
from dirbot.items import Website


class caissSpider(Spider):
    name = "caissSpider"
    allowed_domains = ["venturebeat.com", "techcrunch.com", "infoworld.com"]
    start_urls = [
        "http://venturebeat.com/category/small-biz/",
        "https://techcrunch.com/mobile/",
        "http://www.infoworld.com/category/application-development/",
    ]


    def parse(self, response):
        #(1)The number of articles we want to grab in each domain
        GRABNO = 3

        items = []

        #(2) For comparing and indicating which start_url that the response scraping from
        xpathSel = [
            # VentureBeat
            '//article',
            # Techcrunch
            '//li[@class="river-block "]//div[@class="block-content"]',
            # InfoWord
            '//div[@class="river-well article"]',
        ]

        compareUrls = [
            "http://venturebeat.com/category/small-biz/",
            "https://techcrunch.com/mobile/",
            "http://www.infoworld.com/category/application-development/"
        ]

        sites = [
            "",
            "",
            "http://www.infoworld.com",
        ]

        domains = [
            "VentureBeat",
            "TechCrunch",
            "Infoworld"
        ]

        #extract the url
        respon = str(response)[5:-1]

        #get the index of the Url
        indexes = [i for i, v in enumerate(compareUrls) if v == respon]
        index = indexes[0]

        #final xpath
        xpathReal = xpathSel[index]

        #(3) Process the response by xpath
        for i, sel in enumerate(response.xpath(xpathReal)):
            # print "sel url", sel.xpath('@href').extract()
            # print "sel desp", sel.xpath('text()').extract()

            if i < GRABNO:
                #Call the website item model
                item = Website()

                if domains[index] == "VentureBeat":
                    item['domain'] = "VentureBeat"
                    item['domainUrl'] = "http://venturebeat.com/category/small-biz/"
                    nameTmp = sel.xpath('//h2[@class="article-title"]/a/text()').extract()[i]
                    item['url'] = str(sel.xpath('//h2[@class="article-title"]/a/@href').extract()[i])
                    item['img'] = sel.xpath('//div[@class="article-media-thumbnail"]/a//img[@class="attachment-river-wide size-river-wide wp-post-image"]/@src').extract()[i]
                    despTmp = ""

                elif domains[index] == "TechCrunch":
                    item['domain'] = "TechCrunch"
                    item['domainUrl'] = "https://techcrunch.com/mobile/"
                    nameTmp = sel.xpath('//h2[@class="post-title"]/a/text()').extract()[i]
                    item['url'] = str(sel.xpath('//h2[@class="post-title"]/a/@href').extract()[i])
                    item['img'] = sel.xpath('//span[contains(@data-omni-sm-delegate, "gbl_river_image")]/a/img/@data-src').extract()[i]
                    despTmp = sel.xpath('//p[@class="excerpt"]/text()').extract()[i]

                else:
                    item['domain'] = "Infoworld"
                    item['domainUrl'] = "http://www.infoworld.com/category/application-development/"
                    nameTmp = sel.xpath('//div[@class="post-cont"]//h3/a/text()').extract()[i]
                    item['url'] = sites[index] + str(sel.xpath('//div[@class="post-cont"]//h3/a/@href').extract()[i])
                    item['img'] = sel.xpath('//figure[@class="well-img"]/a//img[contains(@class, "lazy carousel.idgeImage")]/@data-original').extract()[i]
                    despTmp = sel.xpath('//h4/text()').extract()[i]

                #Try to ignore the ASCII if there is any
                try:
                    desp = despTmp.encode('ascii', 'ignore').decode('ascii')
                except:
                    desp = despTmp

                try:
                    name = nameTmp.encode('ascii', 'ignore').decode('ascii')
                except:
                    name = nameTmp

                item['description'] = desp
                item['name'] = name

                items.append(item)
            else:
                break

        return items

 

2.7.1 Note the start_url contains the urls that we want to scrape from

2.7.2 parse(self, response) will keep scraping the web in the start_url but may not obey the sequence of the urls listed int eh start_url. Which is to say, the response of the “infoworld” may comes earlier than the “techcrunch”.

Thus, I am using simple arrays to maintain the sequence of the urls to compare, which includes the “xpathSel”, “compareUrls”, “sites”, “domains”

2.7.3 For each web in the start_url, the spider will scrape “GRABNO”‘s response, which is 3 here.

2.7.4 xPath

xpath is an easy to learn web/xml path selector. While I am not an expert of xpath, but here ‘s some tips that I can share.

1) “/” indicates the next node near its parent

2) “//”select the next node no matter where it is from the parent

3) “[@class=]” is useful if “//../” doesn’t work. For example,

When “//h2/a/text()” doesn’t work, you should try this,

//h2[@class="<some class name>"]/a/text()

4) “text()” to grap text and “@href” to scrape url. Remember to put “/” before them in some cases.

5) If we try to scrape the node whose class has partial words, use “[contains(@class, “<some words>”)]”, instead of “[@class=”<full words>”]”

To conclude, xpath is not hard to learn. We can figure it out quickly if  we scrape something,  print it and analyze it.

 

(3) Results in the MongoDB

The fields of the items saved in the MongoDB are exactly same as the documents defined in the “item.py”

Screen Shot 2016-07-17 at 10.25.41 AM

(4) Commands

To run the spider, there are three options, among which the later two will drop the output files.

scrapy crawl caissSpider

To get csv file
scrapy crawl caissSpider -o caissSpider.csv -t csv

To get json file
scrapy crawl caissSpider -o caissSpider.json -t json

My .json file result

[{"domain": "TechCrunch", "description": "Immediately, a startup that built mobile sales tools, will be shutting down at the end of the month, while part of the team will moveon to cloud-monitoringcompany New Relic.\nAccording to the official announcement, both the Immediately app (whichoffers a number of features, including phone logging and detecting when emails are opened) and the companys Gong app(a pocket ", "img": "https://tctechcrunch2011.files.wordpress.com/2015/05/immediately.jpg?w=210&h=158&crop=1&quality=85&strip=all", "url": "https://techcrunch.com/2016/07/15/immediately-shuts-down/", "domainUrl": "https://techcrunch.com/mobile/", "name": "Sales startup Immediately will shut down as team members join NewRelic"},
{"domain": "TechCrunch", "description": "New data released this morning on the mobile phenomenon Pokmon Go shows that the popular game isnt only the biggest in U.S. history  its also breaking records when itcomes to its ability to monetize and retain its users, as well. According to a report from SurveyMonkey,Pokmon Go is seeing retention rates atmore than double the industry average, and is pulling ", "img": "https://tctechcrunch2011.files.wordpress.com/2016/07/pokemongo-bulbasaur.jpg?w=210&h=158&crop=1&quality=85&strip=all", "url": "https://techcrunch.com/2016/07/15/encrypted-comms-company-silent-circle-closes-50m-series-c/", "domainUrl": "https://techcrunch.com/mobile/", "name": "Encrypted comms company Silent Circle closes $50M SeriesC"},
{"domain": "VentureBeat", "description": "", "img": "https://venturebeat.com/wp-content/uploads/2016/06/Europe-285x180.jpg", "url": "http://venturebeat.com/2016/07/15/marlin-equity-partners-closes-360-million-pot-its-first-dedicated-fund-for-europe/", "domainUrl": "http://venturebeat.com/category/small-biz/", "name": "Marlin Equity Partners closes $360 million pot, its first dedicated fund forEurope"},
{"domain": "TechCrunch", "description": "As President Obama approaches the end of his tenure in the White House, his team is launching a wireless networking research project that it hopes could be part of his wider legacy in the world of tech. Today, the Obama administration announced the Advanced Wireless Research Initiative, a group backed by $400 million in investment that will work on research aimed to maintain U.S. ", "img": "https://tctechcrunch2011.files.wordpress.com/2016/07/gettyimages-528840251.jpg?w=210&h=158&crop=1&quality=85&strip=all", "url": "https://techcrunch.com/2016/07/15/pokemon-gos-retention-rates-average-revenue-per-user-are-double-the-industry-average/", "domainUrl": "https://techcrunch.com/mobile/", "name": "Pokmon Gos retention rates, average revenue per user are double the industryaverage"},
{"domain": "VentureBeat", "description": "", "img": "https://venturebeat.com/wp-content/uploads/2016/07/Civil-Maps-Autonomous-Vehicle-285x180.jpg", "url": "http://venturebeat.com/2016/07/15/mapping-startup-civil-maps-raises-6-6m-from-ford-and-others-to-accelerate-self-driving-car-smarts/", "domainUrl": "http://venturebeat.com/category/small-biz/", "name": "Mapping startup Civil Maps raises $6.6M from Ford and others to accelerate self-driving carsmarts"},
{"domain": "Infoworld", "description": "The experimental language uses Go's underlying toolchain to deliver features that Google's language doesn't have yet", "img": "http://core0.staticworld.net/images/article/2015/09/20150918-att-logo-100615497-carousel.idge.jpg", "url": "http://www.infoworld.com/article/3095806/application-development/oden-uses-googles-golang-ecosystem-to-cook-up-a-new-language.html", "domainUrl": "http://www.infoworld.com/category/application-development/", "name": "Oden uses Google's Golang ecosystem to cook up a new language"},
{"domain": "VentureBeat", "description": "", "img": "https://venturebeat.com/wp-content/uploads/2016/05/wp-1464688609597-285x180.jpg", "url": "http://venturebeat.com/2016/07/13/findo-flint-capital-ai-search-infobesity/", "domainUrl": "http://venturebeat.com/category/small-biz/", "name": "Findo raises $4 million to solve your infobesity headache with A.I. and a smart searchassistant"},
{"domain": "Infoworld", "description": "The companies want to make it easier for enterprises to take IoT deployments from concept to reality ", "img": "http://core1.staticworld.net/images/article/2016/07/linux_course-100671653-carousel.idge.jpg", "url": "http://www.infoworld.com/article/3095493/internet-of-things/ibm-and-att-are-cozying-up-to-iot-developers.html", "domainUrl": "http://www.infoworld.com/category/application-development/", "name": "IBM and AT&T are cozying up to IoT developers"},
{"domain": "Infoworld", "description": "Packed with over 100 hours of instruction, this bundle will help you dive into Linux, the popular open-source operating system, and is currently on sale over 90% off its typical price.", "img": "http://core4.staticworld.net/images/article/2016/07/kick-100671483-carousel.idge.jpg", "url": "http://www.infoworld.comhttp://www.idganswers.com/question/29342/what-is-google-s-tensorflow-and-how-is-it-used-in-ai#src=ifw", "domainUrl": "http://www.infoworld.com/category/application-development/", "name": "What is Googles TensorFlow and how is it used in AI?"}]

 

(5) Heroku Deploy

We could easily setup a Heroku Scheduler like the following image illustrated freely. As a web-spider will generate lots of throughput and the AWS is charging by the it, deploying on the Heroku will save some bucks if your spider is running once per day.

First, we need to create a Procfile to notify Heroku the command to run this Application. So create a Procfile and put the following command in it.

web: scrapy crawl caissSpider

Then we can set a free Scheduler in Heroku.

Screen Shot 2016-07-17 at 10.29.14 AM

The scheduler will run it daily at 0am an push the results to your database set before.

(6) Post to your website

Screen Shot 2016-07-22 at 10.51.06 AM

 

Congratulations! You now know how to write a web spider and deploy as a product. Enjoy the Scrapy and xPath now!

 

Reference:

  1. http://doc.scrapy.org/en/latest/intro/tutorial.html
  2. https://github.com/scrapy/dirbot
  3. http://stackoverflow.com/questions/4881208/how-to-put-username-password-in-mongodb
  4. http://www.w3schools.com/xsl/xpath_syntax.asp
  5. http://stackoverflow.com/questions/21788939/how-to-use-pycharm-to-debug-scrapy-projects

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s