Webscraper | Beri Media Blog

Monday starts AWESOME! I finally get a 200 ok status for my webscraper. And that by creating an AWS Lambda function, initialising node, installing both puppeteer-core & chromium-min both in the same versions. Having the .tar file on my S3 bucket and accessing all that from my nextjs app.

Here’s how I’ve done it:

Open a new VS Code Window and the terminal inside.

Create a new directory on your pc.

mkdir <your dir name>

cd into the directory & initiate a node environment.

npm init -y

Inside of this node environment create a file called

index.js

in your opened terminal, still in your node environment install the following node libraries.

npm install puppeteer-core@<version> & npm install @sparticuz/chromium-min@<version>

Since puppeteer releases frequently new versions @sparticuz/chromium-min might not always align with that. So make sure to check which is the latest version of chrome @sparticuz/chromium-min supports and then check here with which puppeteer version to install to align with the right chromium version. You might do this by trying to install e.g. @sparticuz/chromium-min@140 and if it works then also installing the right puppeteer version. In this case puppeteer-core@24.22.3.

Then in the index.js file write your puppeteer function.

I think it is important to set

pipe: true

Again go to your terminal and inside your directory, created in the first step run:

zip .<nameofnewzip> .

…to zip your entire node environment.

This now needs to go to AWS

Create an AWS Account and go into their Lambda functionality. Create a new function and upload your .zip to this function.

Then go to AWSs S3 service create a bucket & upload the suitable .tar file to the bucket. Here you might find the right file in the assets of the respective release: https://github.com/Sparticuz/chromium/releases

You need to update the url to your bucket in the lambda function and assign correct policies to your bucket. With access rights to your bucket & lambda function you’ll need to fiddle yourself. I assigned a policy to the bucket that I think was right and created an IAM user with access keys who can execute the lambda function. But I’m no expert in this so please double check! These access keys do I now life in my .env file and are used by this library I installed in my next.js database.

@aws-sdk/client-lambda

Linked here

There are many tutorials how to use this library in your nextjs application.

The rest of the week

I implement a Skelton to show while scraping the web and loading the product. But I 100% vibe coded it and it’s shitty. So I reverted back on the commit before and now live without a skeleton for now. I just disabled the input field for the url input while the lambda function is loading. Then I implemented simple rate limiting based on this article. All together on one of the last days of this week I pushed to production and wanted to show to friends what I did but in the train to work I found out to update the environment variables in production and therefore the functionality is failing. I’ll update this today evening and then are very eager to hear feedback on my project. I feel like I implemented all necessary surrounding elements to now be able to implement actual features like scraping the price of the product from the page. I also feel confident to submit my project as my CS50 project in the near future. The next thing I need to improve is the login & sign up.

See you next week! ✌🏽

Disclaimer: This blog is not AI polished. Not even corrected with ChatGPT. Raw from my techie brain trying to be artsy.

Beri.Media

Webscraper.

Here’s how I’ve done it:

This now needs to go to AWS

The rest of the week