Web scraping and JavaScript- A walk through using Puppeteer

Joaquin Correa
4 min readJul 30, 2021

A while ago I read my friend’s blog post about web scrapping. While I liked it, I instantly realized there was something missing: A front-end perspective. So, given I prefer front-end development at the time I am writing this post, I decided to write this quick introduction to web scraping using JavaScript instead of Ruby.

Photo by Anita Jankovic on Unsplash

-Basic concept

Suppose we want to get data from the web. What we would usually do would be to look up for some API that hopefully someone else has designed for us which contains the data we want for our apps. This however may not be always the case, and it is where web scraping happens. To put in English terms, web scrapping means to manually retrieve data from a webpage to implement in our programs.

-What do we need?

Besides JavaScript, we will need Puppeteer. Puppeteer is a Node library that ‘talks’ to Chromium, the source code behind Chrome. To install it locally on your file use the following command:

npm i puppeteer

-Let’s code

For this example, we want to get current data of the Premier League website’s table, more specifically the team names and their respective points.

After installing puppeteer, let’s import it through a variable named puppeteer. Next, declare an anonymous asynchronous function that gets called instantly. Add a try/catch block inside the body of the function

const puppeteer = require(‘puppeteer’)
(async () => {
try{}catch(err){console.log(err)}} )
()

Inside the try block let’s create a minibrowser using the .launch() function, and then create a brand new empty page by implementing the .newpage() method.

const puppeteer = require(‘puppeteer’)
(async () => {
try{
const browser = await puppeteer.launch()
const webpage = await browser.newPage()
}
catch(err){console.log(err)}} )
()

The next thing to do is to tell the webpage to visit the Goal website through the .goto() method. Then add a variable that will represent the HTML of that specific website with the .content() method and print it on the terminal with a console.log()

const puppeteer = require(‘puppeteer’)
(async () => {
try{
const browser = await puppeteer.launch()
const webpage = await browser.newPage()
await webpage.goto('https://www.goal.com/en/premier- league/table/2kwbbcootiqqgmrzs6o5inle5')

const html = await webpage.content()
console.log(html)
}
catch(err){console.log(err)}} )
()

In my case I named my file index.js. Now if we run it locally using the node command +the name of your JavaScript file in the terminal like this:

node index.js
It’s the HTML for the whole Goal webpage!

At this point we may have run into one or more errors when launching it, so it is recommended to look up for those bugs, given this process can be different in the OS.

So far what we have done is creating a browser, told it to go to a specific webpage and take all of its HTML for us. Now we are set to use this tool more in depth. The meat and the potatoes for this example will be on the following lines.

First let’s comment/delete the last two lines regarding the HTML. Next on the editor, add a function with the .evaluate() method.

let standings = await webpage.evaluate(()=>{})

Now lets go to the Goal website to inspect the exact HTML elements that contain the data we want for our standings function with the devtools and the document.QuerySelectorAll() method. This process may take a while depending on the website’s HTML.

On the goal website, it looks like the nodes containing the team names are the ones with the class name ‘.widget-match-standings__team — full-name’
And the nodes with the innerText with the current points number have the class name ‘.widget-match-standings__pts’

Looks like we found them. With that info in mind let’s return to the editor and format the function just like if we were inside the DOM:

let standings = await webpage.evaluate(()=>{
const teams = [ … document.querySelectorAll(‘.widget-match-standings__team — full-name’)].map((teamNode) => teamNode.innerText)
const points = [ … document.querySelectorAll(‘.widget-match-standings__pts’)].map((pointsNode) => pointsNode.innerText)
})

Now let’s make a return statement with an object format that will join the teams with their points:

let standings = await webpage.evaluate(()=>{
const teams = [ … document.querySelectorAll(‘.widget-match-standings__team — full-name’)].map((teamNode) => teamNode.innerText)
const points = [ … document.querySelectorAll(‘.widget-match-standings__pts’)].map((pointsNode) => pointsNode.innerText) return teams.map( (team, i) => ({team: team, points: points[i+1]}) )
})

Lastly, add a console.log() bellow the function that prints all of our Premier League data from the standing method to the console, and a .close() method for the browser. With that, our whole code should look like this:

If we run the console:

⚽⚽⚽

Et voila! It returns the data as an array of objects with the teams and their respective points. Now we can have this data to be implemented in our projects if we were to ever need it!

-Further reading

Puppeteer is a powerful tool to have for our case. This is why I recommend going over its documentation.

My friend’s post about web scrapping using Ruby language and URI.

--

--