Text Extraction

This guide covers extracting text from PDF pages, including position information and search.

Simple Extraction

Get all text from a page as a single string:

const page = pdf.getPage(0);
const { text } = page.extractText();

console.log(text);

The text is extracted in reading order (top-to-bottom, left-to-right for LTR languages).

Extract All Pages

const allText: string[] = [];

for (const page of pdf.getPages()) {
  const { text } = page.extractText();
  allText.push(text);
}

const fullDocument = allText.join("\n\n");

Text with Positions

Get detailed information about each text span:

const { lines } = page.extractText();

for (const line of lines) {
  for (const span of line.spans) {
    console.log({
      text: span.text,
      x: span.bbox.x, // Left edge in points
      y: span.bbox.y, // Bottom edge in points
      width: span.bbox.width, // Bounding box width
      height: span.bbox.height, // Bounding box height
      fontSize: span.fontSize,
    });
  }
}

Coordinate System

PDF coordinates start from the bottom-left corner:

x increases going right
y increases going up

To convert to top-left origin (like screen coordinates):

const pageHeight = page.height;

for (const line of lines) {
  const screenY = pageHeight - line.bbox.y - line.bbox.height;
  console.log(`"${line.text}" at screen position (${line.bbox.x}, ${screenY})`);
}

Search Text

Find text matching a pattern:

String Search

const results = page.findText("invoice");

for (const match of results) {
  console.log(`Found at:`, match.bbox);
}

Regex Search

// Find invoice numbers
const invoices = page.findText(/INV-\d{6}/g);

// Find email addresses
const emails = page.findText(/[\w.-]+@[\w.-]+\.\w+/g);

// Case-insensitive
const terms = page.findText(/important/gi);

Search Results

Each result includes:

interface TextMatch {
  text: string; // The matched text
  bbox: {
    x: number; // Left edge
    y: number; // Bottom edge
    width: number;
    height: number;
  };
  pageIndex: number; // Page where found
  charBoxes: BoundingBox[]; // Individual character positions
}

Search Entire Document

function searchDocument(pdf: PDF, pattern: string | RegExp) {
  const results: TextMatch[] = [];

  for (const page of pdf.getPages()) {
    const pageResults = page.findText(pattern);
    results.push(...pageResults);
  }

  return results;
}

const allMatches = searchDocument(pdf, /confidential/gi);
console.log(`Found ${allMatches.length} matches`);

Extract by Region

Extract text from a specific area of the page:

const { lines } = page.extractText();

// Define region (in points from bottom-left)
const region = { x: 50, y: 700, width: 200, height: 50 };

const inRegion = lines.filter(
  line =>
    line.bbox.x >= region.x &&
    line.bbox.x + line.bbox.width <= region.x + region.width &&
    line.bbox.y >= region.y &&
    line.bbox.y + line.bbox.height <= region.y + region.height,
);

const regionText = inRegion.map(line => line.text).join(" ");

Working with Lines

Text is automatically grouped into lines. Each line contains spans with the same baseline:

const { lines } = page.extractText();

// Lines are already sorted top-to-bottom
for (const line of lines) {
  console.log(`Line at y=${line.baseline}: "${line.text}"`);

  // Access individual spans within the line
  for (const span of line.spans) {
    console.log(`  Span: "${span.text}" (font: ${span.fontName}, size: ${span.fontSize})`);
  }
}

Handle Encoding Issues

Most modern PDFs include ToUnicode maps for proper text extraction. For older PDFs:

const { text } = page.extractText();

// Check for extraction issues
if (text.includes("\uFFFD") || text.length === 0) {
  console.warn("Text extraction may be incomplete");
  console.warn("PDF may use non-embedded fonts or missing ToUnicode maps");
}

Known Limitations

Issue	Cause	Workaround
Garbled text	Missing ToUnicode map	None - PDF must include mapping
Empty extraction	Text is actually images	Use OCR (external tool)
Wrong order	Complex layouts	Use position data to reorder
Missing CJK text	Predefined CMap	Only Identity-H/V supported

Performance Tips

For large documents:

// Process pages one at a time to limit memory
for (let i = 0; i < pdf.getPageCount(); i++) {
  const page = pdf.getPage(i);
  const { text } = page.extractText();

  // Process text immediately
  processPageText(i, text);

  // Text data can now be garbage collected
}

Text Extraction

Text Extraction

Simple Extraction

Extract All Pages

Text with Positions

Coordinate System

Search Text

String Search

Regex Search

Search Results

Search Entire Document

Extract by Region

Working with Lines

Handle Encoding Issues

Known Limitations

Performance Tips

Next Steps

Forms

Pages

On this page