Today, we will devote our microcontroller and its general use to a rather unusual purpose. Specifically, I will discuss how I made a microcontroller from a webscraper. (If this can be called when processing your own backend information).
As many of you know, a web scraper is a device that performs the task of retrieving information from that website. The most wanted information from websites is e-mail addresses and phone numbers. Recently, however, this phenomenon is also used for a variety of statistical tasks where the scraper receives data on products, and their prices, and evaluates them.
Even similar devices are used as robots for trading with cryptomas or common names. At the most appropriate time (statistically) the robot pedal and bought another – a suitable menu with a rising tendency. In my case, I was interested in getting information from websites, specifically phone numbers and email addresses.
Web scraper cez mikrokontroléry
We will therefore go directly to the realization. I have used 3 kinds of microcontrollers. For sites on an unencrypted HTTP protocol, I used Arduino with an Ethernet shield (HTTP only supports HTTPS protocol), and I used the NodeMCU board with an integrated ESP8266 chip and also ESP32 Devkit v1 DOIT. ESP32 has the advantage over ESP8266 in having two kernels, 160MHz faster, nearly 400KB more RAM, and can also connect this board to enterprise 802.1x (WPA / WPA2 Enterprise) enterprise networks.
The functionality of all three mentioned boards is morbidly simple. Their job is to just connect to the destination site from where we want to get the information and send the source code along the lines to my site where I process this information, and with the help of a regular expression, I get the information need. Since the boards can not run the client-side script, ie Javascript, they are protected for JS scraper search applications, other Rpi devices can record because they will run these scripts when emulating the browser. My boards will not appear in Google Analytics or the Smartlook viewer, and so on.
The only option to record the information that a page has been (GET request) is via the PHP code or another server-specific language that the request was made. Accessing various versions of HTTP headers from 1.0 through 1.1 and with newer ESP32 boards to HTTP/2 – according to the target site support is possible. On my site, I process the source code lines of another site via PHP where I have a re-written regulatory expression that I have gradually improved, and today I am able to record these types of email addresses that appear on the web in different formats to protect them from getting robots, scrapers, crawlers.
Boards can access content “behind” login via HTTP Authentication – if the host allows it. Over time, you can also create advanced apps to track statistics from your site. For example, the temperature of a particular weather site, bitcoin rate, statistics of sports matches and working with this data is possible for years.
One page can browse multiple boards at the same time; generally, I use one visit every 6-24 hours, for example, when it comes to news pages that can then be used to create an RSS feed that can be placed on my site or have similar use. Processed data can be stored instantly in a database, tables in XML or CSV format and can be used practically immediately.
I hope you liked the article and learned one of the new uses of Arduin, ESP boards and open hardware as a whole. Data acquisition is now demanded, and this method of data acquisition will result in the construction of farms in the future where similar hardware will be used and data processing will take place in the backend of a given farm that can trade with the data to form millions of mailing lists, phone numbers.
NOTE: Use of data from another site and downloading must be allowed from the host side 🙂
Something more about my scraper on NodeMCU: https://arduino.php5.sk/web-scraper.php
To read the content of the site, just use the example of WebClient (Arduino) and edit it by yourself: https://www.arduino.cc/en/Tutorial/WebClient
(ESP8266 HTTP): https://github.com/esp8266/Arduino/blob/master/libraries/ESP8266WiFi/examples/WiFiClient/WiFiClient.ino
(ESP8266 HTTPS): https://gist.github.com/9SQ/200c796672b0f4db173e
(ESP32 HTTP): https://github.com/espressif/arduino-esp32/blob/master/libraries/WiFi/examples/WiFiClient/WiFiClient.ino
(ESP32 HTTPS): https://github.com/espressif/arduino-esp32/blob/master/libraries/WiFiClientSecure/examples/WiFiClientSecure/WiFiClientSecure.ino
(ESP32 under 802.1x HTTP): https://github.com/espressif/arduino-esp32/blob/master/libraries/WiFi/examples/WiFiClientEnterprise/WiFiClientEnterprise.ino