Asyncronous Web Scraping with Asyncio

04:15 PM - 05:15 PM on August 16, 2014, Room 701

Bugra Akyildiz

Audience level:
intermediate
Category:
Web Development

Description

Asynchronous proramming can be loosely defined that the program can do different tasks while it waits for I/O operation to complete. It is an important advantage as I/O is slow and CPU can operate on other tasks while I/O is being executed. This not only removes waiting for I/O to complete but tasks that are not I/O bounded can be done efficiently.

Doing an I/O operation blocks the program in Python. In order to remove this disadvantage and provide asynchronous programming capabilities, new asyncio (asynchronous i/o) module introduced to the standard library in Python 3.4.

Abstract

Asynchronous proramming can be loosely defined that the program can do different tasks while it waits for I/O operation to complete. It is an important advantage as I/O is slow and CPU can operate on other tasks while I/O is being executed. This not only removes waiting for I/O to complete but tasks that are not I/O bounded can be done efficiently.

Doing an I/O operation blocks the program in Python. In order to remove this disadvantage and provide asynchronous programming capabilities, new asyncio (asynchronous i/o) module introduced to the standard library in Python 3.4.

Web scraping could be considered as I/O bounded task as program needs to wait for the server to respond to process the webpage. However, parsing and extracting the structured data from the raw html is the most time consuming part whereas the I/O bounded retrieving web pages is not. Therefore, asynchronous I/O is a natural fit for web scraping and processing web pages. While waiting other pages to be scraped, one could parse and extract the relevant parts from the web pages asynchronously.s

In this tutorial, I will introduce some asynchronous programming concepts and show its capabilities and advantages. Then, I will introduce asyncio module and provide an overview of the module, focusing on coroutines and event loops. In the demonstration and hands on session, we will scrape web pages from different sources and process them to get (semi)structured data, all asynchronously.