Text Extraction
Extract text from PDFs with position information and search capabilities.
Text Extraction
This guide covers extracting text from PDF pages, including position information and search.
Simple Extraction
Get all text from a page as a single string:
const page = pdf.getPage(0);
const { text } = page.extractText();
console.log(text);The text is extracted in reading order (top-to-bottom, left-to-right for LTR languages).
Extract All Pages
const allText: string[] = [];
for (const page of pdf.getPages()) {
const { text } = page.extractText();
allText.push(text);
}
const fullDocument = allText.join("\n\n");Text with Positions
Get detailed information about each text span:
const { lines } = page.extractText();
for (const line of lines) {
for (const span of line.spans) {
console.log({
text: span.text,
x: span.bbox.x, // Left edge in points
y: span.bbox.y, // Bottom edge in points
width: span.bbox.width, // Bounding box width
height: span.bbox.height, // Bounding box height
fontSize: span.fontSize,
});
}
}Coordinate System
PDF coordinates start from the bottom-left corner:
xincreases going rightyincreases going up
To convert to top-left origin (like screen coordinates):
const pageHeight = page.height;
for (const line of lines) {
const screenY = pageHeight - line.bbox.y - line.bbox.height;
console.log(`"${line.text}" at screen position (${line.bbox.x}, ${screenY})`);
}Search Text
Find text matching a pattern:
String Search
const results = page.findText("invoice");
for (const match of results) {
console.log(`Found at:`, match.bbox);
}Regex Search
// Find invoice numbers
const invoices = page.findText(/INV-\d{6}/g);
// Find email addresses
const emails = page.findText(/[\w.-]+@[\w.-]+\.\w+/g);
// Case-insensitive
const terms = page.findText(/important/gi);Search Results
Each result includes:
interface TextMatch {
text: string; // The matched text
bbox: {
x: number; // Left edge
y: number; // Bottom edge
width: number;
height: number;
};
pageIndex: number; // Page where found
charBoxes: BoundingBox[]; // Individual character positions
}Search Entire Document
function searchDocument(pdf: PDF, pattern: string | RegExp) {
const results: TextMatch[] = [];
for (const page of pdf.getPages()) {
const pageResults = page.findText(pattern);
results.push(...pageResults);
}
return results;
}
const allMatches = searchDocument(pdf, /confidential/gi);
console.log(`Found ${allMatches.length} matches`);Extract by Region
Extract text from a specific area of the page:
const { lines } = page.extractText();
// Define region (in points from bottom-left)
const region = { x: 50, y: 700, width: 200, height: 50 };
const inRegion = lines.filter(
line =>
line.bbox.x >= region.x &&
line.bbox.x + line.bbox.width <= region.x + region.width &&
line.bbox.y >= region.y &&
line.bbox.y + line.bbox.height <= region.y + region.height,
);
const regionText = inRegion.map(line => line.text).join(" ");Working with Lines
Text is automatically grouped into lines. Each line contains spans with the same baseline:
const { lines } = page.extractText();
// Lines are already sorted top-to-bottom
for (const line of lines) {
console.log(`Line at y=${line.baseline}: "${line.text}"`);
// Access individual spans within the line
for (const span of line.spans) {
console.log(` Span: "${span.text}" (font: ${span.fontName}, size: ${span.fontSize})`);
}
}Handle Encoding Issues
Most modern PDFs include ToUnicode maps for proper text extraction. For older PDFs:
const { text } = page.extractText();
// Check for extraction issues
if (text.includes("\uFFFD") || text.length === 0) {
console.warn("Text extraction may be incomplete");
console.warn("PDF may use non-embedded fonts or missing ToUnicode maps");
}Known Limitations
| Issue | Cause | Workaround |
|---|---|---|
| Garbled text | Missing ToUnicode map | None - PDF must include mapping |
| Empty extraction | Text is actually images | Use OCR (external tool) |
| Wrong order | Complex layouts | Use position data to reorder |
| Missing CJK text | Predefined CMap | Only Identity-H/V supported |
Performance Tips
For large documents:
// Process pages one at a time to limit memory
for (let i = 0; i < pdf.getPageCount(); i++) {
const page = pdf.getPage(i);
const { text } = page.extractText();
// Process text immediately
processPageText(i, text);
// Text data can now be garbage collected
}