LibPDF

Text Extraction

Extract text from PDFs with position information and search capabilities.

Text Extraction

This guide covers extracting text from PDF pages, including position information and search.

Simple Extraction

Get all text from a page as a single string:

const page = pdf.getPage(0);
const { text } = page.extractText();

console.log(text);

The text is extracted in reading order (top-to-bottom, left-to-right for LTR languages).

Extract All Pages

const allText: string[] = [];

for (const page of pdf.getPages()) {
  const { text } = page.extractText();
  allText.push(text);
}

const fullDocument = allText.join("\n\n");

Text with Positions

Get detailed information about each text span:

const { lines } = page.extractText();

for (const line of lines) {
  for (const span of line.spans) {
    console.log({
      text: span.text,
      x: span.bbox.x, // Left edge in points
      y: span.bbox.y, // Bottom edge in points
      width: span.bbox.width, // Bounding box width
      height: span.bbox.height, // Bounding box height
      fontSize: span.fontSize,
    });
  }
}

Coordinate System

PDF coordinates start from the bottom-left corner:

  • x increases going right
  • y increases going up

To convert to top-left origin (like screen coordinates):

const pageHeight = page.height;

for (const line of lines) {
  const screenY = pageHeight - line.bbox.y - line.bbox.height;
  console.log(`"${line.text}" at screen position (${line.bbox.x}, ${screenY})`);
}

Search Text

Find text matching a pattern:

const results = page.findText("invoice");

for (const match of results) {
  console.log(`Found at:`, match.bbox);
}
// Find invoice numbers
const invoices = page.findText(/INV-\d{6}/g);

// Find email addresses
const emails = page.findText(/[\w.-]+@[\w.-]+\.\w+/g);

// Case-insensitive
const terms = page.findText(/important/gi);

Search Results

Each result includes:

interface TextMatch {
  text: string; // The matched text
  bbox: {
    x: number; // Left edge
    y: number; // Bottom edge
    width: number;
    height: number;
  };
  pageIndex: number; // Page where found
  charBoxes: BoundingBox[]; // Individual character positions
}

Search Entire Document

function searchDocument(pdf: PDF, pattern: string | RegExp) {
  const results: TextMatch[] = [];

  for (const page of pdf.getPages()) {
    const pageResults = page.findText(pattern);
    results.push(...pageResults);
  }

  return results;
}

const allMatches = searchDocument(pdf, /confidential/gi);
console.log(`Found ${allMatches.length} matches`);

Extract by Region

Extract text from a specific area of the page:

const { lines } = page.extractText();

// Define region (in points from bottom-left)
const region = { x: 50, y: 700, width: 200, height: 50 };

const inRegion = lines.filter(
  line =>
    line.bbox.x >= region.x &&
    line.bbox.x + line.bbox.width <= region.x + region.width &&
    line.bbox.y >= region.y &&
    line.bbox.y + line.bbox.height <= region.y + region.height,
);

const regionText = inRegion.map(line => line.text).join(" ");

Working with Lines

Text is automatically grouped into lines. Each line contains spans with the same baseline:

const { lines } = page.extractText();

// Lines are already sorted top-to-bottom
for (const line of lines) {
  console.log(`Line at y=${line.baseline}: "${line.text}"`);

  // Access individual spans within the line
  for (const span of line.spans) {
    console.log(`  Span: "${span.text}" (font: ${span.fontName}, size: ${span.fontSize})`);
  }
}

Handle Encoding Issues

Most modern PDFs include ToUnicode maps for proper text extraction. For older PDFs:

const { text } = page.extractText();

// Check for extraction issues
if (text.includes("\uFFFD") || text.length === 0) {
  console.warn("Text extraction may be incomplete");
  console.warn("PDF may use non-embedded fonts or missing ToUnicode maps");
}

Known Limitations

IssueCauseWorkaround
Garbled textMissing ToUnicode mapNone - PDF must include mapping
Empty extractionText is actually imagesUse OCR (external tool)
Wrong orderComplex layoutsUse position data to reorder
Missing CJK textPredefined CMapOnly Identity-H/V supported

Performance Tips

For large documents:

// Process pages one at a time to limit memory
for (let i = 0; i < pdf.getPageCount(); i++) {
  const page = pdf.getPage(i);
  const { text } = page.extractText();

  // Process text immediately
  processPageText(i, text);

  // Text data can now be garbage collected
}

Next Steps

On this page