As we have all heard about and are using stack-overflow to browse through questions looking for solutions to the problems that are stealing our good night's sleep. In this post, we will go through a small post on how to collect stack-overflow questions and their respective tags along with the URL of the questions that are under the main tag. For this work, we will solely use popular scraping framework Scrapy which is based on python, of course.
When I say main tag, you would find it here: https://stackoverflow.com/tags
So, out of those tags, we will let's say choose one main tag, here 'python'. Now, the real work begins after clicking on the python tag link. We will see a list of questions related to it which can be found here: https://stackoverflow.com/questions/tagged/python
So, let's start building a scrapy project to collect those questions and save it in a csv file. The dependency here I used is Python:3.8 and Scrapy:2.2.1 .
First step: Create a scrapy project
$ scrapy startproject stackcrawler
$ cd stackcrawler
Second step: Uncommenting some lines in settings.py (to avoid banning)
- Uncomment CONCURRENT_REQUESTS
- Uncomment HTTPCACHE_ related variables
- Uncomment AUTOTHROTTLE_ related variables
- Uncomment DEFAULT_REQUEST_HEADERS
- Obey ROBOTSTXT
This should do it, but keep in mind of not passing too many requests to stack-overflow here.
Third step: Creating a Item class to represent the variables to store (items.py)
import scrapyclass StackSpiderItem(scrapy.Item):title = scrapy.Field()tags = scrapy.Field()urls = scrapy.Field()
Fourth step: Creating a spider file inside spiders directory (spiders/stack_spider.py)
from scrapy.spiders import CrawlSpiderfrom stackcrawler.items import StackSpiderItemclass StackSpider(CrawlSpider):name = "stack"allowed_domains = ['stackoverflow.com']start_urls = [f'http://stackoverflow.com/questions/tagged/python?page=
{page}&sort=newest&pagesize=50' for page in range(0, 50)]
def parse(self, response):questions = response.xpath('//div[@class="summary"]')
item = StackSpiderItem()
for question in questions:
item['urls'] = question.xpath('h3/a/@href').extract()[0]
item['title'] = question.xpath('h3/a/text()').extract()[0]
item['tags'] = question.xpath('div[2]/a/text()').extract()
yield item
This spider file is the main file that we want to execute to get all the questions and its tags. In the start_urls, we can see that I am iterating through 50 pages and collecting 50 questions each from all those pages.
And, the parse method is the entrypoint when a spider is initiated. It consists of the respective xpaths to get to the main container of all those questions, here represented by questions. Then, we iterate over it and get the inner details of the individual questions. I do not want to go through identifying the xpaths, I would leave that to u guys for a better understanding ✌.
And, finally to get the result in a csv format, execute:
$ scrapy crawl stack -o python.csv
There also are other formats to save the results. You can find it from https://docs.scrapy.org/en/latest/topics/exporters.html#built-in-item-exporters-reference .
Hope I was of some help to you guys on getting started with scraping and providing with a small glance on Scrapy. Please go through it, and let me know if you encounter any problems. In that case, write me in the comments and I would be more than happy to help you get through it.
Good day!
Comments
Post a Comment