BoxLang 🚀 A New JVM Dynamic Language Learn More...
|:------------------------------------------------------: |
| ⚡︎ B o x L a n g ⚡︎
| Dynamic : Modular : Productive
|:------------------------------------------------------: |
Copyright Since 2023 by Ortus Solutions, Corp
www.boxlang.io | www.ortussolutions.com
Â
A powerful BoxLang module that provides HTML parsing and cleaning capabilities using Jsoup. This module enables developers to safely parse, manipulate, and clean HTML content with ease.
This module can be installed using CommandBox or the BoxLang Installer Scripts
# BoxLang Installer Script
install-bx-module bx-jsoup
# commandbox
box install bx-jsoup
Parses an HTML string and returns a BoxDocument object for manipulation. BoxDocument extends Jsoup's Document class with additional BoxLang-specific methods.
Parameters:
html
(string, required): The HTML string to parseReturns: A BoxDocument object with methods for HTML manipulation
Example:
// Parse HTML content
htmlContent = "<html><head><title>My Page</title></head><body><h1>Hello World</h1></body></html>";
doc = htmlParse( htmlContent );
// Access document properties
title = doc.title(); // Returns "My Page"
bodyText = doc.body().text(); // Returns "Hello World"
// Use CSS selectors
htmlContent = "<ul><li class='item'>Item 1</li><li class='item'>Item 2</li></ul>";
doc = htmlParse( htmlContent );
items = doc.select( ".item" ); // Returns elements with class 'item'
// Extract text content
textContent = doc.text(); // Returns plain text without HTML tags
// Enhanced BoxDocument methods
htmlContent = "<div class='container'><h1>Title</h1><p>Content</p></div>";
doc = htmlParse( htmlContent );
// Convert to JSON
jsonString = doc.toJSON(); // Compact JSON
prettyJson = doc.toJSON( true ); // Pretty-printed JSON
// Convert to XML
xmlString = doc.toXML(); // Compact XML
prettyXml = doc.toXML( true, 2 ); // Pretty-printed XML with 2-space indentation
Standard Jsoup Document Methods:
title()
– Get the contents of the
<title>
tagselect(selector)
– Find elements using CSS selectorstext()
– Get the combined text of the entire documentouterHtml()
– Get the HTML of the entire documentbody()
– Get the <body>
elementhead()
– Get the <head>
elementgetElementById(id)
– Find an element by its ID attributegetElementsByTag(tagName)
– Get all elements with the
given taggetElementsByClass(className)
– Get all elements with
the given classgetElementsByAttribute(attrName)
– Get elements that
have the specified attributehtml()
– Get the inner HTML of the document bodycreateElement(tagName)
– Create a new element with the
given tagEnhanced BoxDocument Methods:
toJSON()
– Convert the document to a compact JSON representationtoJSON(prettyPrint)
– Convert to JSON with optional pretty-printingtoXML()
– Convert the document to a compact XML representationtoXML(prettyPrint, indentFactor)
– Convert to XML with
optional pretty-printing and custom indentationEnhanced Methods Examples:
// Sample HTML for examples
htmlContent = `
<div class="product" id="item-1">
<h2>Product Name</h2>
<p class="description">Product description here</p>
<span class="price">$19.99</span>
</div>
`;
doc = htmlParse( htmlContent );
// Convert to JSON (compact)
jsonCompact = doc.toJSON();
// Result: {"tag":"html","children":[{"tag":"head"},{"tag":"body","children":[{"tag":"div","attributes":{"class":"product","id":"item-1"},"children":[{"tag":"h2","children":[{"text":"Product Name"}]},{"tag":"p","attributes":{"class":"description"},"children":[{"text":"Product description here"}]},{"tag":"span","attributes":{"class":"price"},"children":[{"text":"$19.99"}]}]}]}]}
// Convert to JSON (pretty-printed)
jsonPretty = doc.toJSON( true );
// Result: Formatted JSON with proper indentation
// Convert to XML (compact)
xmlCompact = doc.toXML();
// Result: <html><head></head><body><div class="product" id="item-1"><h2>Product Name</h2>...</div></body></html>
// Convert to XML (pretty-printed with 4-space indentation)
xmlPretty = doc.toXML( true, 4 );
// Result:
// <html>
// <head></head>
// <body>
// <div class="product" id="item-1">
// <h2>Product Name</h2>
// <p class="description">Product description here</p>
// <span class="price">$19.99</span>
// </div>
// </body>
// </html>
Cleans and sanitizes HTML content to prevent XSS attacks and ensure safe rendering.
Parameters:
html
(string, required): The HTML string to cleansafeList
(string, optional): The safety level to apply
(default: "relaxed")preserveRelativeLinks
(boolean, optional): Whether to
preserve relative links (default: false)baseUri
(string, optional): Base URI for resolving
relative links (default: "")Returns: A cleaned HTML string
Safelist Options:
none
: Maximum cleaning, removes all tags and returns
plain text onlysimpletext
: Allows very limited inline formatting tags
like <b>
, <i>
, <br>
basic
: Basic cleaning, removes all tags except for a
few safe onesbasicwithimages
: Basic cleaning but allows imagesrelaxed
: More lenient cleaning, allows more tags (default)Examples:
// Basic cleaning with default "relaxed" safelist
dirtyHtml = "<script>alert('XSS')</script><p>Hello World!</p>";
cleanHtml = htmlClean( dirtyHtml );
// Result: "<p>Hello World!</p>"
// Strict cleaning with "basic" safelist
cleanHtml = htmlClean(
html: "<img src='image.jpg' /><script>alert('XSS')</script><p>Hello World!</p>",
safeList: "basic"
);
// Result: "<p>Hello World!</p>"
// Allow images with "basicwithimages" safelist
cleanHtml = htmlClean(
html: "<img src='image.jpg' /><script>alert('XSS')</script><p>Hello World!</p>",
safeList: "basicwithimages"
);
// Result: "<img src='image.jpg' /><p>Hello World!</p>"
// Plain text only with "none" safelist
cleanHtml = htmlClean(
html: "<p><strong>Bold text</strong> and <em>italic text</em></p>",
safeList: "none"
);
// Result: "Bold text and italic text"
// Preserve relative links
cleanHtml = htmlClean(
html: "<a href='page.html'>Link</a>",
preserveRelativeLinks: true
);
// Result: "<a href='page.html'>Link</a>"
// Convert relative links to absolute
cleanHtml = htmlClean(
html: "<a href='page.html'>Link</a>",
baseUri: "https://example.com/",
preserveRelativeLinks: false
);
// Result: "<a href='https://example.com/page.html'>Link</a>"
Clean user-generated content before storing or displaying:
userContent = "<p>Great article! <script>alert('hack')</script></p>";
safeContent = htmlClean( userContent );
// Store or display safeContent safely
Parse and extract data from HTML content:
scrapedHtml = "<div class='product'><h2>Product Name</h2><span class='price'>$19.99</span></div>";
doc = htmlParse( scrapedHtml );
productName = doc.select( ".product h2" ).text();
price = doc.select( ".price" ).text();
Convert HTML to different formats using BoxDocument's enhanced methods:
// Parse HTML content
htmlContent = `
<article>
<header>
<h1>Article Title</h1>
<meta name="author" content="John Doe">
</header>
<section class="content">
<p>First paragraph of the article.</p>
<p>Second paragraph with <em>emphasis</em>.</p>
</section>
</article>
`;
doc = htmlParse( htmlContent );
// Convert to structured JSON for API responses
jsonData = doc.toJSON( true );
// Use jsonData in REST APIs or data processing
// Convert to XML for legacy systems
xmlData = doc.toXML( true, 2 );
// Use xmlData for XML-based integrations
// Extract plain text for search indexing
textContent = doc.text();
// Use textContent for full-text search
Clean HTML emails before sending:
emailTemplate = "<p>Hello {{name}}, <script>malicious()</script></p>";
cleanTemplate = htmlClean( emailTemplate, "basic" );
// Process cleanTemplate safely
BoxLang is a professional open-source project and it is completely funded by the community and Ortus Solutions, Corp. Ortus Patreons get many benefits like a cfcasts account, a FORGEBOX Pro account and so much more. If you are interested in becoming a sponsor, please visit our patronage page: https://patreon.com/ortussolutions
"I am the way, and the truth, and the life; no one comes to the Father, but by me (JESUS)" Jn 14:1-12
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
$
box install bx-jsoup