21
The Differences Between Newbie & Pro Level Web Scraper Coder
- Checks & Balances
Newbie
A newbie uses no checks and balances. If it works on my machine, it should work in production.
Pro
a. A pro has carefully looked at every breaking point imaginable in the code and looks to see if any of that can bring the whole operation down. For example, if the webserver IP block, rate limits, change their code, the internet goes down, the disk space gets full, etc.
b. A pro builds in alerts and essential info into the alerts so he can debug them easily.
- Code & Architecture
Newbie
A newbie spends too much time on code and too little time on the Architecture.
Pro
A pro spends much time researching and experimenting with different frameworks and libraries like Scrapy, Puppeteer, Selenium, Beautiful Soup, etc. to see what suits his current needs the best.
- Framework
Newbie
A newbie doesn’t use a framework because it is not in his ‘Favorite’ programming language and writes code without any best practices.
Pro
The pro knows that a framework might have a small learning curve but is heavily offset very soon by all the abstractions they provide.
- Being Like a Bot
Newbie
A newbie doesn’t work on ‘pretending to be human’ enough.
Pro
Pro works more human than an actual human taking care of small children or babies.
- Choosing Proxy
Newbie
A newbie uses free proxy servers available on the internet
Pro
A pro doesn’t want a free lunch. If the project is important, he knows there is no way he can build a rotating proxy infrastructure. He will opt for one like Proxies API.
- Expect the Unexpected
Newbie
A newbie doesn’t factor in that the target website might change their code.
Pro
A pro expects it. Puts a time stamp on every website he written a scraper for. Writes a Hello World test case for each which should pass no matter what, and if it doesn’t, he sends himself an alert to change his code.
- Scrapping Process
Newbie
A newbie uses RegEx or some such rudimentary way to scrape data.
Pro
CSS selectors or XPath are the way to predictably be able to retrieve data, which allows for many changes to be made in the target HTML and the code will probably still work.
- Normalization Of Data
Newbie
A newbie doesn’t normalize data that are downloaded
Pro
Downloading from multiple websites means duplicate data, the same data in multiple formats, etc. A pro puts in normalization code to make sure the end data looks as uniform as possible.
- Crawling Speed
Newbie
A newbie doesn’t work on scaling the spiders by using concurrency, multiple spiders using Scrapyd, using Rotating Proxies to make more requests per second.
Pro
Pro is always looking to make the crawling process faster and more reliable.
- IP Blockage
Newbie
A newbie doesn’t believe that he will ever get IP blocked until he is.
Pro
A pro expects this almost to be a guarantee, especially for big sites like Amazon, Reddit, Yelp, etc. He puts in measures like Proxies API (Rotating Proxies) to help completely negate this risk.
The author is the founder of Proxies API, a proxy rotation API service.