package mechaml

  1. Overview
  2. Docs

Module Mechaml.AgentSource

Scraping agent

Mechaml is a web agent that allows to :

  • Fetch and parse HTML pages
  • Analyze, fill and submit HTML forms
  • Manages cookies, headers and redirections

It is build on top of Cohttp, Lwt and Lambdasoup.

Sourcetype t
Sourcetype http_status_code = Cohttp.Code.status_code
Sourcetype http_headers = Cohttp.Header.t

Operations on HTTP responses

Sourcemodule HttpResponse : sig ... end

The HttpResponse module defines a type and operations to extract content and metadata from server response

Sourcetype result = t * HttpResponse.t

Main operations

Sourceval init : ?max_redirect:int -> unit -> t

Create a new empty agent. ~max_redirect indicates how many times the agent will automatically and consecutively follow the Location header in case of HTTP 302 or 303 response codes to avoid a redirect loop. Set to 0 to disable any automatic redirection.

Perform a get request to the specified URI. get "http://www.site/some/url" agent sends a HTTP GET request and return the updated state of the agent together with the server's response

Sourceval get : string -> t -> result Lwt.t
Sourceval get_uri : Uri.t -> t -> result Lwt.t
Sourceval click : Page.Link.t -> t -> result Lwt.t

Same as get, but work directly with links instead of URIs

Send a raw post request to the specified URI

Sourceval post : string -> string -> t -> result Lwt.t
Sourceval post_uri : Uri.t -> string -> t -> result Lwt.t
Sourceval submit : Page.Form.t -> t -> result Lwt.t

Submit a filled form

Save some downloaded content in a file

Sourceval save_image : string -> Page.Image.t -> t -> result Lwt.t

save_image "/path/to/myfile.jpg" image agent loads the image using get, open myfile.jpg and write the content in asynchronously, and return the result

Sourceval save_content : string -> string -> unit Lwt.t

save_content "/path/to/myfile.html" content write the specified content in a file using Lwt's asynchronous IO

Cookies

(see Cookiejar)

Return the current Cookiejar

Set the current Cookiejar

Add a single cookie to the current Cookiejar

Remove a single cookie from the Cookiejar

Headers
Sourceval client_headers : t -> Cohttp.Header.t

Return the default headers sent when performing HTTP requests

Sourceval set_client_headers : Cohttp.Header.t -> t -> t

Use the specified headers as new default headers

Sourceval add_client_header : string -> string -> t -> t

Add a single pair key/value to the default headers

Sourceval remove_client_header : string -> t -> t

Remove a single pair key/value from the default headers

Redirection
Sourceval set_max_redirect : int -> t -> t

Max redirection to avoid infinite loops (use 0 to disable automatic redirection)

Sourceval default_max_redirect : int

The default maximum consecutive redirections

The Agent Monad

This module defines a monad that implicitely manages the state corresponding to the agent inside the Lwt monad. This is basically the state monad (for Agent.t) and the Lwt one stacked

Sourcemodule Monad : sig ... end