Stealthy Scraping: A Guide to Proxy Integration with Scrapy-Playwright

Web scraping demands a delicate balance between efficiency and anonymity. Enter Scrapy-Playwright, a fusion of Scrapy and Playwright, enabling proxy integration. In this guide, we dive into the technical aspect of seamlessly integrating proxies into Scrapy-Playwright, fortifying data privacy and optimizing data extraction.

In order to understand how this proxy rotation works, we have to know how playwright functions and the difference between a page and a context.

As we've seen before, integrating playwright with scrapy is a faily simple process to do which consists of adding playwright value to the meta dictionary, with that scrapy will open a new window (context) or a tab (page) to process the request and that's mainly depending of the browser you chose.
With a context open, the ip and other parameters are predetermined with the context and won't change and that's what makes playwright a bit more resources consuming when using it with proxy rotation.
the trick to rotate proxy with that is to open a new context for each request (or determining the number of requests per ip)
This way we will have one context per new proxy, preferably using a proxy endpoint. See the request example code below :

    def start_requests(self):
        i = 330                                                                              
        for url in self.start_urls:                                                          
            i += 1                                                                           
            yield scrapy.Request(                                                            
                url=url,                                                                     
                headers=self.iheaders,
                meta={                                                                       
                    "playwright": True,                                                                      
                    "playwright_context": f"new{i}",                   
                    'playwright_context_kwargs': {
                        "proxy": {
                            "server": "[proxy provider endpoint]",
                            "username": f"[proxy username]-session-{i}",
                            "password": "[proxy password]"
                        }
                    }
                })

In the above example, we create a new context named "new-3" for example for each request made and give it proxy credentials to go through a new proxy each time, However if the protection is strong and you keep using that context for further pages it might get blocked after few requests so do your calculations correctly.

With this, you will be able to use proxy rotation on playwright and you can go further and create a custom middleware for it to keep your code clean,
On the dark side, creating concurent contexts will use too much RAM and CPU, so if you're hosting the spider on a low budget VPS make sure to tweak the concurrency settings to fit your resources and not crash your setup.

Stealthy Scraping: A Guide to Proxy Integration with Scrapy-Playwright

Recent Posts