Web Scraping with Scrapy in Python (88/100 Days of Python)
Web scraping is the process of extracting data from websites. It can be used for various purposes, including data mining, research, and analysis. Scrapy is a powerful and flexible web scraping framework written in Python.
Install Scrapy
First, you need to install Scrapy. You can do this using pip, the Python package manager. Open your command prompt or terminal and run the following command:
pip install scrapy
Create a Scrapy Project
Once Scrapy is installed, you can create a new Scrapy project. To create a new project, run the following command in your command prompt or terminal:
scrapy startproject myproject
This will create a new directory called myproject, which contains the basic structure of a Scrapy project.
Create a Spider
Spiders are the main component of Scrapy. A spider is a Python class that defines how to crawl a website and extract data from it. To create a new spider, go to your project directory and run the following command:
cd myproject
scrapy genspider myspider example.com
This will create a new spider called myspider, which is set up to crawl the website example.com. You can edit the spider code to define how to extract data from the website.
Define the Spider Rules
In the spider code, you need to define the rules for how to crawl the website. You can do this using the start_urls
and parse
methods. The start_urls
method defines the initial URLs to crawl, while the parse
method defines how to extract data from the website:
import scrapy
class MyspiderSpider(scrapy.Spider):
name = 'myspider'
start_urls = ['https://example.com']
def parse(self, response, **kwargs):
title = response.css('title::text').extract_first()
yield {'title': title}
Run the Spider
Once you’ve defined the spider rules, you can run the spider using the following command:
scrapy crawl myspider
This will start the spider and crawl the website. The extracted data will be printed to the console.
Save the Data
If you want to save the extracted data to a file, you can use Scrapy’s built-in Feed Exporter. To do this, add the following lines to your settings.py
file:
FEED_FORMAT = 'json'
FEED_URI = 'data.json'
This will save the extracted data to a file called data.json
.
Advanced Scraping
Scrapy offers many advanced features for web scraping, including support for cookies, user agents, and proxies. You can also use Scrapy with other Python libraries, such as BeautifulSoup and Pandas, to further process and analyze the extracted data.
Crawling quotes.toscrape.com
With Scrapy
We can extract information from a web page based on its HTML. As an example, let’s have a look at Quotes to Scrape which has many quotes from different people along with some attributes. This is a great page where we can practice scraping. If you navigate to the source of the page, you can find its HTML and all the attributes defined in it. You can do that by right-clicking on the content of the page and clicking on the View Page Source
button.
After opening the source of the page, you can notice that all the quotes are inside a <div>
element that has a class="quote"
attribute. This way we can loop through all the quotes using scrapy
.
So, the Spider will look something like this:
import scrapy
class MyspiderSpider(scrapy.Spider):
name = 'myspider'
start_urls = ['https://quotes.toscrape.com']
def parse(self, response, **kwargs):
for book in response.css('.quote'):
yield {
'text': book.css('.text::text').extract_first(),
'author': book.css('.author::text').extract_first(),
}
It extracts all the div
elements with a class .quote
and then yields other fields by extracting them from the HTML and turning those into values of a dictionary.
In this example, we have extracted the text
of the quote and the author
of the quote. If you look at the HTML above, the author is always defined inside a <span>
which has a class attribute author
and the quote itself is inside a <span>
which has a class attribute text
. The suffix ::text
is added to make sure we extract the actual text inside the <span>
and not the <span>
itself.
After running the code with scrapy crawl myspider
, you’ll get a nice JSONL with all the contents of the page:
[
{"text": "“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”", "author": "Albert Einstein"},
{"text": "“It is our choices, Harry, that show what we truly are, far more than our abilities.”", "author": "J.K. Rowling"},
{"text": "“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”", "author": "Albert Einstein"},
{"text": "“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”", "author": "Jane Austen"},
{"text": "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", "author": "Marilyn Monroe"},
{"text": "“Try not to become a man of success. Rather become a man of value.”", "author": "Albert Einstein"},
{"text": "“It is better to be hated for what you are than to be loved for what you are not.”", "author": "André Gide"},
{"text": "“I have not failed. I've just found 10,000 ways that won't work.”", "author": "Thomas A. Edison"},
{"text": "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”", "author": "Eleanor Roosevelt"},
{"text": "“A day without sunshine is like, you know, night.”", "author": "Steve Martin"}
]
What’s next?
- If you found this story valuable, please consider clapping multiple times (this really helps a lot!)
- Hands-on Practice: Free Python Course
- Full series: 100 Days of Python
- Previous topic: Mocking and Fixtures in Python
- Next topic: Working with Excel Sheets and CSV Files Using Pandas for Data Processing