Web Scraping with Scrapy in Python (88/100 Days of Python)

Martin Mirakyan
4 min readMar 30, 2023

--

Day 88 of the “100 Days of Python” blog post series covering web scraping with Scrapy

Web scraping is the process of extracting data from websites. It can be used for various purposes, including data mining, research, and analysis. Scrapy is a powerful and flexible web scraping framework written in Python.

Install Scrapy

First, you need to install Scrapy. You can do this using pip, the Python package manager. Open your command prompt or terminal and run the following command:

pip install scrapy

Create a Scrapy Project

Once Scrapy is installed, you can create a new Scrapy project. To create a new project, run the following command in your command prompt or terminal:

scrapy startproject myproject

This will create a new directory called myproject, which contains the basic structure of a Scrapy project.

Create a Spider

Spiders are the main component of Scrapy. A spider is a Python class that defines how to crawl a website and extract data from it. To create a new spider, go to your project directory and run the following command:

cd myproject
scrapy genspider myspider example.com

This will create a new spider called myspider, which is set up to crawl the website example.com. You can edit the spider code to define how to extract data from the website.

Define the Spider Rules

In the spider code, you need to define the rules for how to crawl the website. You can do this using the start_urls and parse methods. The start_urls method defines the initial URLs to crawl, while the parse method defines how to extract data from the website:

import scrapy


class MyspiderSpider(scrapy.Spider):
name = 'myspider'
start_urls = ['https://example.com']

def parse(self, response, **kwargs):
title = response.css('title::text').extract_first()
yield {'title': title}

Run the Spider

Once you’ve defined the spider rules, you can run the spider using the following command:

scrapy crawl myspider

This will start the spider and crawl the website. The extracted data will be printed to the console.

Save the Data

If you want to save the extracted data to a file, you can use Scrapy’s built-in Feed Exporter. To do this, add the following lines to your settings.py file:

FEED_FORMAT = 'json'
FEED_URI = 'data.json'

This will save the extracted data to a file called data.json.

Advanced Scraping

Scrapy offers many advanced features for web scraping, including support for cookies, user agents, and proxies. You can also use Scrapy with other Python libraries, such as BeautifulSoup and Pandas, to further process and analyze the extracted data.

Crawling quotes.toscrape.com With Scrapy

We can extract information from a web page based on its HTML. As an example, let’s have a look at Quotes to Scrape which has many quotes from different people along with some attributes. This is a great page where we can practice scraping. If you navigate to the source of the page, you can find its HTML and all the attributes defined in it. You can do that by right-clicking on the content of the page and clicking on the View Page Source button.

After opening the source of the page, you can notice that all the quotes are inside a <div> element that has a class="quote" attribute. This way we can loop through all the quotes using scrapy.

So, the Spider will look something like this:

import scrapy


class MyspiderSpider(scrapy.Spider):
name = 'myspider'
start_urls = ['https://quotes.toscrape.com']

def parse(self, response, **kwargs):
for book in response.css('.quote'):
yield {
'text': book.css('.text::text').extract_first(),
'author': book.css('.author::text').extract_first(),
}

It extracts all the div elements with a class .quote and then yields other fields by extracting them from the HTML and turning those into values of a dictionary.

In this example, we have extracted the text of the quote and the author of the quote. If you look at the HTML above, the author is always defined inside a <span> which has a class attribute author and the quote itself is inside a <span> which has a class attribute text. The suffix ::text is added to make sure we extract the actual text inside the <span> and not the <span> itself.

After running the code with scrapy crawl myspider, you’ll get a nice JSONL with all the contents of the page:

[
{"text": "“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”", "author": "Albert Einstein"},
{"text": "“It is our choices, Harry, that show what we truly are, far more than our abilities.”", "author": "J.K. Rowling"},
{"text": "“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”", "author": "Albert Einstein"},
{"text": "“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”", "author": "Jane Austen"},
{"text": "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", "author": "Marilyn Monroe"},
{"text": "“Try not to become a man of success. Rather become a man of value.”", "author": "Albert Einstein"},
{"text": "“It is better to be hated for what you are than to be loved for what you are not.”", "author": "André Gide"},
{"text": "“I have not failed. I've just found 10,000 ways that won't work.”", "author": "Thomas A. Edison"},
{"text": "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”", "author": "Eleanor Roosevelt"},
{"text": "“A day without sunshine is like, you know, night.”", "author": "Steve Martin"}
]

What’s next?

--

--

Martin Mirakyan
Martin Mirakyan

Written by Martin Mirakyan

Software Engineer | Machine Learning | Founder of Profound Academy (https://profound.academy)

Responses (1)