Skip to content

Data Scraping

Learn how to efficiently extract data from tables, whether they are small, paginated, or infinitely scrolled.

Extracting All Data

The most efficient way to scrape a table is using table.map(). It processes rows in chunks and follows pagination up to the configured maxPages limit.

typescript
// Define the data shape you want to extract
interface User {
  id: string;
  name: string;
  email: string;
}

const allUsers = await table.map<User>(async ({ row }) => {
  return {
    id: await row.getCell('ID').innerText(),
    name: await row.getCell('Name').innerText(),
    email: await row.getCell('Email').innerText()
  };
});

console.log(`Extracted ${allUsers.length} users`);

Handling Large Datasets

For very large tables (1000+ rows), accumulation in memory might be too heavy. You can process data in chunks or write to a file directly.

typescript
import fs from 'fs';

const stream = fs.createWriteStream('users.csv');
stream.write('ID,Name,Email\n');

await table.forEach(
  async ({ row }) => {
    const id = await row.getCell('ID').innerText();
    const name = await row.getCell('Name').innerText();
    const email = await row.getCell('Email').innerText();
    
    stream.write(`${id},${name},${email}\n`);
  }
);

stream.end();

Scraping Specific Columns

If you only need values from a single column, use table.map(). Increase maxPages when you want to scan beyond the first page.

typescript
const emails = await table.map(
  ({ row }) => row.getCell('Email').innerText(),
  { maxPages: 5 }
);

// Get and transform values (e.g., parse currency)
const salaries = await table.map(async ({ row }) => {
  const text = await row.getCell('Salary').innerText();
  return parseFloat(text.replace('$', '').replace(',', ''));
});

Handling Dynamic Content

Some tables load data lazily. You might need to wait for cell content to be non-empty.

typescript
const allData = await table.map(
  async ({ row }) => {
    // Wait for specific cell to have content
    await expect(row.getCell('Status')).not.toBeEmpty();
    // Or wait for a specific condition
    await row.getCell('Status').locator('.badge').waitFor();
    
    // Now extract data...
    return row.toJSON();
  }
);

Exporting to JSON

You can easily dump the current page or specific rows to JSON.

typescript
// Dump current page
const pageData = await table.findRows({}).then(r => r.toJSON());

// Dump specific rows
const activeUsers = await table.findRows({ Status: 'Active' });
const json = await activeUsers.toJSON();

// Write to file
fs.writeFileSync('active-users.json', JSON.stringify(json, null, 2));