Ben's Bites
← Back
.md

Create a simple AI-enhanced website scraper

Use Zapier and Airtable to create a simple web scraping tool that uses AI to extract specific information.

beginner free
Tool: ZapierTool: AirtableTool: ChatGPT Topic: Data ExtractionTopic: Automation

2024-11-13

Web scraping used to be a complex task, especially when you want to extract a particular data from a page, and especially when done at scale.

With advancements in AI, it can now read and interpret websites just like a human, quickly and accurately extracting the information you need. To show how this works, we'll create a simple web scraper using Airtable and Zapier.

In this tutorial, you will learn how to:

  • Set up an Airtable to manage the scraping
  • Build a zap to scrape content from a website with PhantomJSCloud
  • Use the OpenAI API to extract the required information from the webpage content
  • Add the parsed information back to Airtable

You’ll need:

Let’s see how it’s done.

Step 1: Set up an Airtable to manage the scraping

For this tutorial, we’re going to run the process from an Airtable. However, you could easily use Notion, a CRM app or even Google Sheets.

Go to Airtable and click to create a new Airtable base from scratch.

__wf_reserved_inherit

You can then customise it to your needs, but at a minimum, you need the following columns:

  • URL - The URL of the website you want to scrape.
  • Status - Create one dropdown option here called “Go”. Switching a record to this status will be used to trigger the automation.
  • Query - A prompt to tell the AI exactly what information you want to parse.
  • Parsed Information - We’ll get Zapier to put the AI response here.
__wf_reserved_inherit

The link I’ve included is the pricing page for Beehiiv. The question we will ask AI is “What's the $ price of their most expensive plan (excluding enterprise)?

Next, under the “View” tab you should rename the current view to ‘All’, then right-click and select “Duplicate view”, and call the new view ‘Go’.

__wf_reserved_inherit

Then apply a filter to this new view to only show records where ‘Status’ is ‘Go.

__wf_reserved_inherit

Step 2: Build a zap to scrape content from a website with PhantomJSCloud

Now we have our Airtable and an example scraping task ready, we need a Zapier workflow to make it work.

Head over to Zapier, then in the dashboard, in the top left, click the big orange “Create” button and select ‘Zaps’.

__wf_reserved_inherit

Inside the zap editor, click on the box that says “Trigger” and select/search for ‘Airtable’ as the app. Then select ‘New Record’ as the trigger event.

Configure the step by selecting the base, table and view you just created in Airtable.

__wf_reserved_inherit

Records will only appear in this view when you change the status field to “Go”. Make sure you only do that once you’ve filled in the other columns which Zapier will need.

Test the zap step and you’ll see your example record pulled through.

__wf_reserved_inherit

Now we need to parse content from the website.

To do this, we’re going to use a tool called PhantomJsCloud. Go ahead, sign up, and on the dashboard you’ll see your API key which we’ll need in the next part. The free plan allows you to parse around 500 pages per day.

__wf_reserved_inherit
💡 Note: I’ve obscured my API key in the screenshot above. Yours will be a random string of letters.

PhantomJsCloud doesn’t have a Zapier integration, so instead we’re going to use a webhook step inside Zapier to send and receive data from the API.

In the zap editor, click the “Action” box. Then search for "Webhook by Zapier" and select ‘Custom Request’ as the action event.

When configuring the step, these are the important fields.

  • Method
    Select ‘POST’.
  • URL
    Input
    https://phantomjscloud.com/api/browser/v2/XX/
    (replacing the string between the last two slashes with your API key).
  • Data
    Input the following code, replacing {{1.URL}} with a mapping of the ‘Link’ value from the trigger step.

__wf_reserved_inherit
💡 Tip: You’ll notice that we’re parsing the website in HTML. The alternative is plain text. In our case, HTML often gives useful context (like the position of information in tables). This should ensure more accurate responses.

Test the step, and after a few seconds, you should see the text from the website.

__wf_reserved_inherit
💡 Note: Sometimes the returned text will include some ‘junk’ like header and footer links, ads and sidebars etc. It’s not an issue as the AI is smart enough to ignore these.

Step 3: Use the OpenAI API to extract the required information from the webpage content

With the raw content parsed from the website, we’re now ready to have AI find and summarise the specific information we need.

💡 Tip: Although you probably know ChatGPT as a separate chat app built by OpenAI, the ChatGPT integration in Zapier lets you interact with the underlying APIs and functionality that powers ChatGPT.

Add a new action step to your zap, with “ChatGPT” as the app and ‘Conversation’ as the action event.

When configuring the step, these are the important fields.

  • User message
    Use the following text and map the “Text” value from the webhook step and the “Query” value from the trigger step.
Scraped content: {{2.Text}}

User query: {{1.Query}}

Answer:
__wf_reserved_inherit
  • Model
    Select the “gpt-4o-mini” model, OpenAI’s cheapest and quickest model.
__wf_reserved_inherit
  • Assistant Instructions
    You can experiment here to get exactly what you want, but this prompt should work well.
You're going to read HTML content scraped from a website.

Look at the user query and output an answer.

Make your answer concise with just enough information to answer the question.

- Make sure you base the answer only on information contained in the scraped content.

- Don't guess or rely on any other source of information.

- If you're not 100% sure about your answer, output "Unknown".
__wf_reserved_inherit
  • Max Tokens
    This sets the maximum output length. Anything about 1,000 should be suitable.
__wf_reserved_inherit

Go ahead and test the step, then scroll down towards the end in the “Data out” tab, you’ll see the response generated by ChatGPT.

__wf_reserved_inherit

A quick check on their website shows us this is correct.

__wf_reserved_inherit

Step 4: Add the parsed information back to Airtable

The last part of the process is to add the AI output back to the Airtable record.

Add a new step to your zap with “Airtable” as the app and ‘Update Record’ as the action event.

Select the same base and table as you did in the trigger and for the “Record” field, map the ‘ID’ value from the trigger step.

__wf_reserved_inherit

Then in the “Parsed Information” field map the ‘Reply’ value from the ChatGPT step.

__wf_reserved_inherit

Test the zap, head back to Airtable, and you’ll see the response added.

__wf_reserved_inherit

This tutorial was created by Andrew.