Extract Structured Data From Any Webpages Using AI

LLM Scraper is an open-source TypeScript library that takes the messiness of webpages (HTML, Markdown, text, and even images) and turns them into structured data using Large Language Models (LLMs). It can be useful for tasks like web scraping, data mining, and content analysis.

The library utilizes function calling to extract data and supports various LLM models, including Local (GGUF), OpenAI, and Groq chat models. Additionally, it features schema definition with Zod, full type-safety with TypeScript, and is built on the Playwright framework. Its streaming feature when crawling multiple pages makes it efficient for large-scale data extraction.

How to use it:

1. Install the required dependencies with NPM:

# NPM
$ npm i zod playwright llm-scraper

2. Import and initialize your LLM.

// For OpenAI's model:
import OpenAI from 'openai';
const model = new OpenAI();

// For a local LLM
import { LlamaModel } from 'node-llama-cpp';
const model = new LlamaModel({ 
  modelPath: 'model.gguf' 
});

3. Create a new browser instance and attach LLMScraper to it:

import { chromium } from 'playwright'
import LLMScraper from 'llm-scraper'

const browser = await chromium.launch()
const scraper = new LLMScraper(browser, model)

4. Here is an official example showing how to extract top stories from HackerNews using OpenAI API:

import { chromium } from 'playwright'
import { z } from 'zod'
import OpenAI from 'openai'
import LLMScraper from 'llm-scraper'

// Launch a browser instance
const browser = await chromium.launch()

// Initialize LLM provider
const llm = new OpenAI()

// Create a new LLMScraper
const scraper = new LLMScraper(browser, llm)

// Define schema to extract contents into
const schema = z.object({
  top: z
    .array(
      z.object({
        title: z.string(),
        points: z.number(),
        by: z.string(),
        commentsURL: z.string(),
      })
    )
    .length(5)
    .describe('Top 5 stories on Hacker News'),
})

// URLs to scrape
const urls = ['https://news.ycombinator.com']

// Run the scraper
const pages = await scraper.run(urls, {
  model: 'gpt-4-turbo',
  schema,
  mode: 'html',
  closeOnFinish: true,
})

// Stream the result from LLM
for await (const page of pages) {
  console.log(page.data)
}

5. This example shows how to scrape content from multiple URLs:

import { chromium } from 'playwright'
import { z } from 'zod'
import OpenAI from 'openai'
import LLMScraper from 'llm-scraper'

// Launch a browser instance
const browser = await chromium.launch()

// Initialize LLM provider
const llm = new OpenAI()

// Create a new LLMScraper
const scraper = new LLMScraper(browser, llm)

// Define schema to extract contents into
const schema = z.object({
  title: z.string().describe('Title of the webpage'),
})

// URLs to scrape
const urls = [
  'https://ft.com',
  'https://text.npr.org',
  'https://meduza.io',
  'https://theguardian.com',
]

// Run the scraper
const pages = await scraper.run(urls, {
  model: 'gpt-4-turbo',
  schema,
  mode: 'text',
  closeOnFinish: true,
})

// Stream the result from LLM
for await (const page of pages) {
  console.log(`Page Title: ${page.data?.title}`)
}

6. Use a local LLM instead:

import { chromium } from 'playwright'
import { LlamaModel } from 'node-llama-cpp'
import { z } from 'zod'
import LLMScraper from 'llm-scraper'

// Launch a browser instance
const browser = await chromium.launch()

const modelPath =
  '/Users/mish/jan/models/tinyllama-1.1b/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf'

const llm = new LlamaModel({ modelPath })

// Initialize a new LLMScraper with local model
const scraper = new LLMScraper(browser, llm)

// Define schema to extract contents into
const schema = z.object({
  h1: z.string().describe('The main heading of the page'),
})

// URLs to scrape
const urls = ['https://example.com', 'https://browserbase.com']

// Run the scraper
const pages = await scraper.run(urls, {
  schema,
  mode: 'text',
  closeOnFinish: true,
})

// Stream the result from LLM
for await (const page of pages) {
  console.log(page.data)
}