Extract All Links From an HTML String using lambdasoup
Task
Web Programming / Dealing with HTML / Extract All Links From an HTML String
Opam Packages Used
- lambdasoup Tested with version: 1.1.1 — Used libraries: lambdasoup
Code
The find_links
function:
- Takes an HTML string as input
- Parses it into a document structure
- Extracts and returns all href attributes from anchor tags
Key components:
Soup.parse
- Converts HTML string into a traversable document structure$$
operator - Performs CSS-style selector queries on the document"a[href]"
- Selects all<a>
tags that have an href attributeSoup.R.attribute
- Extracts the value of a specified attribute from an element
let find_links html_content =
let document_node = Soup.parse html_content in
Soup.(document_node $$ "a[href]")
|> Soup.to_list
|> List.map (Soup.R.attribute "href")
Sample HTML document containing some hyperlinks.
let html_content = {|
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Sample HTML Page</title>
</head>
<body>
<header>
<h1>My Cool Learning Links</h1>
</header>
<main>
<section>
<H2>Click a link to get started!</H2>
<ul>
<li><a href="https://ocaml.org/docs">The Ocaml.org Learning Page</a></li>
<li><a href="https://pola.rs/">Pola.rs: Modern Python Dataframes</a></li>
<li><a href="https://www.nonexistentwebsite.com">It used to work.com</a></li>
</ul>
</section>
</main>
</body>
</html>|}
Expected output shows one URL per line:
- https://ocaml.org/docs
- https://pola.rs/
- https://www.nonexistentwebsite.com
let () =
find_links html_content
|> List.iter (fun a -> print_endline a)