package mechaml

  1. Overview
  2. Docs

Module Mechaml.AgentSource

Scraping agent

Mechaml is a web agent that allows to :

  • Fetch and parse HTML pages
  • Analyze, fill and submit HTML forms
  • Manages cookies, headers and redirections

It is built on top of Cohttp, Lwt and Lambdasoup.

Sourcetype t
Sourcetype http_status_code = Cohttp.Code.status_code
Sourcetype http_headers = Cohttp.Header.t

Operations on HTTP responses

Sourcemodule HttpResponse : sig ... end

The HttpResponse module defines a type and operations to extract content and metadata from the server response

Sourcetype result = t * HttpResponse.t

Main operations

Sourceval init : ?max_redirect:int -> unit -> t

Create a new empty agent. ~max_redirect indicates how many times the agent will automatically and consecutively follow the Location header in case of an HTTP 302 or 303 response code, to avoid a redirect loop. Set to 0 to disable automatic redirection.

The following functions perform a get request to the specified URI. get "http://www.site/some/url" agent sends a HTTP GET request and return the updated state of the agent together with the server response

Sourceval get : string -> t -> result Lwt.t
Sourceval get_uri : Uri.t -> t -> result Lwt.t
Sourceval click : Page.Link.t -> t -> result Lwt.t

Same as get, but work directly with links instead of URIs

The following functions send a raw post request to the specified URI

Sourceval post : string -> string -> t -> result Lwt.t
Sourceval post_uri : Uri.t -> string -> t -> result Lwt.t
Sourceval submit : Page.Form.t -> t -> result Lwt.t

Submit a filled form

Save some downloaded content in a file

Sourceval save_image : string -> Page.Image.t -> t -> result Lwt.t

save_image "/path/to/myfile.jpg" image agent loads the image using get, opens myfile.jpg, write the content in asynchronously and then returns the result

Sourceval save_content : string -> string -> unit Lwt.t

save_content "/path/to/myfile.html" content writes the specified content in a file using Lwt asynchronous I/O

Cookies

(see Cookiejar)

Return the current Cookiejar

Set the current Cookiejar

Add a single cookie to the current Cookiejar

Remove a single cookie from the Cookiejar

Headers
Sourceval client_headers : t -> Cohttp.Header.t

Return the default headers sent when performing HTTP requests

Sourceval set_client_headers : Cohttp.Header.t -> t -> t

Use the specified headers as new default headers

Sourceval add_client_header : string -> string -> t -> t

Add a single key/value pair to the default headers

Sourceval remove_client_header : string -> t -> t

Remove a single key/value pair from the default headers

Redirection
Sourceval set_max_redirect : int -> t -> t

Set the maximum consecutive redirections (to avoid infinite loops). Use 0 to disable automatic redirection)

Sourceval default_max_redirect : int

The default maximum consecutive redirections

The Agent Monad

This module defines a monad that implicitly manages the state corresponding to the agent while being inside the Lwt monad. This is basically the state monad (for Agent.t) and the Lwt one stacked

Sourcemodule Monad : sig ... end